[Python-3000] Review needed: regular expressions and unicode

Antoine Pitrou solipsis at pitrou.net
Mon Jul 28 18:51:30 CEST 2008


Hi,

I've posted my final patch to adapt the re module to the py3k standards of
bytes/unicode separation.

Here is a short summary of the changes:
- mixing bytes and str patterns, search and replacement strings raises a
TypeError
- re.UNICODE and (?u) become almost no-ops: they are the default for unicode
strings, and forbidden for bytes strings
- re.ASCII and (?a) are introduced: for unicode strings, they specify to do
old-style ASCII matching (for example, \d will only match [0-9] rather than all
ranges of unicode decimal digits); for bytes strings, they are the only
available behaviour
- mixing re.UNICODE and re.ASCII is forbidden
- the stdlib is adapted so that (hopefully) all places which rely on ASCII
matching of unicode patterns don't break

>From the above description you might infer that we should deprecate re.UNICODE
and (?u). It's a possible decision but I think we should leave that to a later
patch. The status of re.LOCALE is another issue again.

The issue is at http://bugs.python.org/issue2834
and the patch can be reviewed at http://codereview.appspot.com/2439

Thanks

Antoine.




More information about the Python-3000 mailing list