regex - Handle Unicode characters with Python regexes -


i'm writing simple application want replace words other words. i'm running problems words use single quotes such aren't, ain't, isn't.

i have text file following

aren’t=ain’t hello=hey 

i parse text file , create dictionary out of it

u'aren\u2019t' = u'ain\u2019t' u'hello' = u'hey' 

then try replace characters in given text

text = u"aren't"  def replace_all(text, dict):     i, k in dict.iteritems():         #replace whole words of k in lower cased text, regex = \bstring\b         text = re.sub(r"\b" + + r"\b", k , text.lower())     return text 

the problem re.sub() doesnt match u'aren\u2019t' u"aren't".

what can replace_all() function match both "hello" , `"aren't" , replace them appropriate text? can in python dictionary doesn't contain unicode? convert text use unicode character, or modify regex match unicode character other text?

i guess problem is:

text = u"aren't" 

instead of:

text = u"aren’t" 

(note different apostrophes?)

here's code modified make work:

#!/usr/bin/env python # -*- coding: utf-8 -*-  import re  d = {     u'aren’t': u'ain’t',     u'hello': u'hey'     } #text = u"aren't" text = u"aren’t"   def replace_all(text, d):     i, k in d.iteritems():         #replace whole words of k in lower cased text, regex = \bstring\b         text = re.sub(r"\b" + + r"\b", k , text.lower())     return text  if __name__ == '__main__':     newtext = replace_all(text, d)     print newtext 

output:

ain’t 

Comments

Popular posts from this blog

Javascript line number mapping -

c# - Is it possible to remove an existing registration from Autofac container builder? -

php - Mysql PK and FK char(36) vs int(10) -