[Python-3000] Regular expressions, py3k and unicode

Sat Jun 28 22:45:31 CEST 2008

Hello,

Several posters (including a certain GvR) in the bug tracker (*) have been
baffled by an apparent bug where the re.IGNORECASE flag didn't imply
case-insensitivity for non-ASCII characters. It turns out that, although the
pattern was a string object and although Py3k is supposed to be
unicode-friendly, you still need to supply the re.UNICODE flag if you want the
re module to use unicode-aware case-insensitive matching.

Wouldn't it be more natural that, at least when the pattern is a str object
rather a bytes object, the re.UNICODE be implied by default?

(*) http://bugs.python.org/issue2834

Another question in the same vein: is it normal that we can match a bytes object
with an str pattern and vice-versa?

 pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
 pat.match('á'.encode('latin1'))
 # gives <_sre.SRE_Match object at 0xb7c66c60>

 pat = re.compile('Á'.encode('latin1'), re.IGNORECASE | re.UNICODE)
 pat.match('á')
 # gives <_sre.SRE_Match object at 0xb7c66c60>

Regards

Antoine.