UNICODE mode for regular expressions - time to change the default?
Steve Holden
steve at holdenweb.com
Thu Apr 5 17:44:01 EDT 2007
John Nagle wrote:
> Regular expressions are compiled in ASCII mode unless
> Unicode mode is specified to "rc.compile". The difference is that regular
> expressions in ASCII mode don't recognize things like
> Unicode whitespace, even when applied to Unicode strings.
> For example, Unicode character 0x00A0 is a "NO-BREAK SPACE", which is
> a form of whitespace. It's the Unicode equivalent of HTML's " ".
> This can create some strange bugs.
>
> Is the current default good? Or is it time to compile all regular
> expressions in Unicode mode by default? It shouldn't hurt processing of
> ASCII strings to do that. The current setup is really a legacy of when
> most things in Python didn't work in Unicode mode, and you didn't want to
> introduce Unicode unnecessarily. It's another one of those obscure
> Unicode "gotchas" that really should go away.
>
> John Nagle
Personally I'd leave it to go away with Python 3.0, when all strings
will be Unicode.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com
More information about the Python-list
mailing list