No explanation for weird behavior in re module!

Matthias Huening mhuening at zedat.fu-berlin.de
Mon Feb 11 04:58:07 EST 2002


"Tim Peters" <tim.one at home.com> wrote in
news:mailman.1013386937.27383.python-list at python.org: 

> 
>>      Other than the fact that 'Tür' has the 'ü' unicode charcater, I
>>      fail 
>> to see any difference! 
> 
> Heh.  Leaving this joy to someone else <wink>.
> 

Okay, I'll try...
If your string comes in 'Latin-1' you will have to tell Python to treat it 
as Unicode. And when you want to print it afterwards, you'll have to 
encode it as 'Latin-1'.

---
import re
txt = 'die Tür, Türen'
txt = unicode(txt, 'latin-1')
pattern = re.compile(ur'(der|die|das)\s+(\w+)', re.UNICODE)

match = pattern.match(txt)
article = match.group(1)
noun = match.group(2)

print article.encode('latin-1')
print noun.encode('latin-1')
---

Hope this helps.
Matthias



More information about the Python-list mailing list