regex - Handle Unicode characters with Python regexes -

- June 15, 2010

i'm writing simple application want replace words other words. i'm running problems words use single quotes such aren't, ain't, isn't.

i have text file following

aren’t=ain’t hello=hey

i parse text file , create dictionary out of it

u'aren\u2019t' = u'ain\u2019t' u'hello' = u'hey'

then try replace characters in given text

text = u"aren't"  def replace_all(text, dict):     i, k in dict.iteritems():         #replace whole words of k in lower cased text, regex = \bstring\b         text = re.sub(r"\b" + + r"\b", k , text.lower())     return text

the problem re.sub() doesnt match u'aren\u2019t' u"aren't".

what can replace_all() function match both "hello" , `"aren't" , replace them appropriate text? can in python dictionary doesn't contain unicode? convert text use unicode character, or modify regex match unicode character other text?

i guess problem is:

text = u"aren't"

instead of:

text = u"aren’t"

(note different apostrophes?)

here's code modified make work:

#!/usr/bin/env python # -*- coding: utf-8 -*-  import re  d = {     u'aren’t': u'ain’t',     u'hello': u'hey'     } #text = u"aren't" text = u"aren’t"   def replace_all(text, d):     i, k in d.iteritems():         #replace whole words of k in lower cased text, regex = \bstring\b         text = re.sub(r"\b" + + r"\b", k , text.lower())     return text  if __name__ == '__main__':     newtext = replace_all(text, d)     print newtext

output:

ain’t

Search This Blog

Return

regex - Handle Unicode characters with Python regexes -

Comments

Post a Comment

Popular posts from this blog

Show multiple (2,3,4,…) images in the same window in OpenCV -

c# - Is it possible to remove an existing registration from Autofac container builder? -

asp.net - RadAsyncUpload in code behind, how to? -