No explanation for weird behavior in re module!
Matthias Huening
mhuening at zedat.fu-berlin.de
Mon Feb 11 04:58:07 EST 2002
"Tim Peters" <tim.one at home.com> wrote in
news:mailman.1013386937.27383.python-list at python.org:
>
>> Other than the fact that 'Tür' has the 'ü' unicode charcater, I
>> fail
>> to see any difference!
>
> Heh. Leaving this joy to someone else <wink>.
>
Okay, I'll try...
If your string comes in 'Latin-1' you will have to tell Python to treat it
as Unicode. And when you want to print it afterwards, you'll have to
encode it as 'Latin-1'.
---
import re
txt = 'die Tür, Türen'
txt = unicode(txt, 'latin-1')
pattern = re.compile(ur'(der|die|das)\s+(\w+)', re.UNICODE)
match = pattern.match(txt)
article = match.group(1)
noun = match.group(2)
print article.encode('latin-1')
print noun.encode('latin-1')
---
Hope this helps.
Matthias
More information about the Python-list
mailing list