RE + UTF-8
cepl@surfbest.net
ceplma at gmail.com
Sat Sep 24 19:48:11 EDT 2005
Working on extension of genericwiki.py plugin for PyBlosxom and I have
problems with UTF-8 and RE. When I have this wiki line, it does break
URL too early:
[http://en.wikipedia.org/wiki/Petr_Chelcický Petr Chelcický's]
work(s) into English.
and creates
[<a
href="http://en.wikipedia.org/wiki/Petr_Chel">http://en.wikipedia.org/wiki/Petr_Chel</a>cický
Petr Chelcický's]
The RE genericwiki uses for parsing this:
# WikiName pattern used in your wiki
wikinamepattern = r'\b(([A-Z]\w+){2,})\b' # original
mailurlpattern = r'mailto\:[\"\-\_\.\w]+\@[\-\_\.\w]+\w'
newsurlpattern = r'news\:(?:\w+\.){1,}\w+'
fileurlpattern =
r'(?:http|https|file|ftp):[/-_.\w-]+[\/\w][?&+=%\w/-_.#]*'
[...]
# Turn '[xxx:address label]' into labeled link
body = re.sub(r'\[(' +
fileurlpattern + '|' +
mailurlpattern + '|' +
newsurlpattern + ')\s+(.+?)\]',
r'<a href="\1">\2</a>', body,re.U)
I have tried to test RE and UTF-8 in Python generally and the results
are even more confusing (done with locale cs_CZ.UTF-8 in konsole):
>> locale.getpreferredencoding()
'UTF-8'
>>> print re.sub("(\w*)","X","[Chelcický]",re.L)
X[X?Xý]
>>> print re.sub("(\w*)","X","[Chelcický]",re.UNICODE)
X[X?X?X]X
>>>
I would expect that both print commands should give just plain X, but
apparently Python doesn't undestand that. What's the problem?
Thanks for any reply,
Matej
More information about the Python-list
mailing list