[Python-3000] Regular expressions, py3k and unicode
Antoine Pitrou
solipsis at pitrou.net
Sat Jun 28 22:45:31 CEST 2008
Hello,
Several posters (including a certain GvR) in the bug tracker (*) have been
baffled by an apparent bug where the re.IGNORECASE flag didn't imply
case-insensitivity for non-ASCII characters. It turns out that, although the
pattern was a string object and although Py3k is supposed to be
unicode-friendly, you still need to supply the re.UNICODE flag if you want the
re module to use unicode-aware case-insensitive matching.
Wouldn't it be more natural that, at least when the pattern is a str object
rather a bytes object, the re.UNICODE be implied by default?
(*) http://bugs.python.org/issue2834
Another question in the same vein: is it normal that we can match a bytes object
with an str pattern and vice-versa?
pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
pat.match('á'.encode('latin1'))
# gives <_sre.SRE_Match object at 0xb7c66c60>
pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE)
pat.match('á')
# gives <_sre.SRE_Match object at 0xb7c66c60>
Regards
Antoine.
More information about the Python-3000
mailing list