regex - Handle Unicode characters with Python regexes -
i'm writing simple application want replace words other words. i'm running problems words use single quotes such aren't, ain't, isn't.
i have text file following
aren’t=ain’t hello=hey i parse text file , create dictionary out of it
u'aren\u2019t' = u'ain\u2019t' u'hello' = u'hey' then try replace characters in given text
text = u"aren't" def replace_all(text, dict): i, k in dict.iteritems(): #replace whole words of k in lower cased text, regex = \bstring\b text = re.sub(r"\b" + + r"\b", k , text.lower()) return text the problem re.sub() doesnt match u'aren\u2019t' u"aren't".
what can replace_all() function match both "hello" , `"aren't" , replace them appropriate text? can in python dictionary doesn't contain unicode? convert text use unicode character, or modify regex match unicode character other text?
i guess problem is:
text = u"aren't" instead of:
text = u"aren’t" (note different apostrophes?)
here's code modified make work:
#!/usr/bin/env python # -*- coding: utf-8 -*- import re d = { u'aren’t': u'ain’t', u'hello': u'hey' } #text = u"aren't" text = u"aren’t" def replace_all(text, d): i, k in d.iteritems(): #replace whole words of k in lower cased text, regex = \bstring\b text = re.sub(r"\b" + + r"\b", k , text.lower()) return text if __name__ == '__main__': newtext = replace_all(text, d) print newtext output:
ain’t
Comments
Post a Comment