UNICODE mode for regular expressions - time to change the default?

John Nagle nagle at animats.com
Thu Apr 5 15:50:45 EDT 2007


   Regular expressions are compiled in ASCII mode unless
Unicode mode is specified to "rc.compile".  The difference is that regular
expressions in ASCII mode don't recognize things like
Unicode whitespace, even when applied to Unicode strings.
For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
a form of whitespace. It's the Unicode equivalent of HTML's " ".
This can create some strange bugs.

   Is the current default good?  Or is it time to compile all regular
expressions in Unicode mode by default?  It shouldn't hurt processing of
ASCII strings to do that.  The current setup is really a legacy of when
most things in Python didn't work in Unicode mode, and you didn't want to
introduce Unicode unnecessarily.   It's another one of those obscure
Unicode "gotchas" that really should go away.

					John Nagle



More information about the Python-list mailing list