No explanation for weird behavior in re module!

Jason Orendorff jason at jorendorff.com
Mon Feb 11 00:16:08 EST 2002


synthespian writes:
> 	The problem is that I can't make Python read anything with 
> non-ASCII character set.

import re
import codecs

pattern = re.compile(ur'^(der|die|das)\s+(\w+)', re.UNICODE)

f = codecs.open('article.txt', 'r', 'iso-8859-1')
lines = f.readlines()
f.close()

f = codecs.open('article.out.txt', 'w', 'iso-8859-1')
for line in lines:
    match = pattern.match(line)
    article = match.group(1)
    noun = match.group(2)
    f.write(u"article: %s ... noun: %s\n" % (article, noun))
f.close()


I've got Python 2.2, but I think it should work for you too.

## Jason Orendorff    http://www.jorendorff.com/






More information about the Python-list mailing list