regex - Handle Unicode characters with Python regexes -
i'm writing simple application want replace words other words. i'm running problems words use single quotes such aren't
, ain't
, isn't
.
i have text file following
aren’t=ain’t hello=hey
i parse text file , create dictionary out of it
u'aren\u2019t' = u'ain\u2019t' u'hello' = u'hey'
then try replace characters in given text
text = u"aren't" def replace_all(text, dict): i, k in dict.iteritems(): #replace whole words of k in lower cased text, regex = \bstring\b text = re.sub(r"\b" + + r"\b", k , text.lower()) return text
the problem re.sub()
doesnt match u'aren\u2019t'
u"aren't"
.
what can replace_all()
function match both "hello"
, `"aren't" , replace them appropriate text? can in python dictionary doesn't contain unicode? convert text use unicode character, or modify regex match unicode character other text?
i guess problem is:
text = u"aren't"
instead of:
text = u"aren’t"
(note different apostrophes?)
here's code modified make work:
#!/usr/bin/env python # -*- coding: utf-8 -*- import re d = { u'aren’t': u'ain’t', u'hello': u'hey' } #text = u"aren't" text = u"aren’t" def replace_all(text, d): i, k in d.iteritems(): #replace whole words of k in lower cased text, regex = \bstring\b text = re.sub(r"\b" + + r"\b", k , text.lower()) return text if __name__ == '__main__': newtext = replace_all(text, d) print newtext
output:
ain’t
Comments
Post a Comment