[issue12749] lib re cannot match non-BMP ranges (all versions, all builds)

Ezio Melotti report at bugs.python.org
Sun Aug 14 21:56:22 CEST 2011


Ezio Melotti <ezio.melotti at gmail.com> added the comment:

> Perhaps I am doing something wrong?

That's weird, I tried on a wide Python 2.6.6 too and it works even there.  Maybe a bug that got fixed between 2.6.2 and 2.6.6?  Or maybe something else?

> Is there a way to easily have these co-exist on the same system?

Here I have different HG clones, one for each release (2.7, 3.2, 3.3), and I run ./configure (--with-wide-unicode) && make -j2.  Then I just run ./python from there without installing it in the system.
You might do the same or look at "make altinstall".  If you run "make install" it will install it as the default Python, so that's probably what you want.  Another option is to use virtualenv.

> The Python2 version is *much* noisier.  

Yes, Python 3 fixed many of these things and it's a much "cleaner" language.

> (1) You have keep remembering to u"..." everything because neither
>        # -*- coding: UTF-8 -*-
>    nor even
>        from __future__ import unicode_literals
>    suffices.  

Before Unicode Python only had plain (byte)strings, when Unicode strings were introduced the u"..." syntax was chosen to distinguish them.  On Python 3, "..." is a Unicode string, whereas b"..." is used for bytes.
"# -*- coding: UTF-8 -*-" is only about the encoding used to save the file, and doesn't affect other things.  Also this is the default on Python 3 so it's not necessary anymore (it's ASCII (or iso-8859-1?) on Python2).
"from __future__ import unicode_literals" allows you to use "..." and b"..." instead of u"..." and "..." on Python 2.  In my example I used u"..." to be explicit and because I was running from the terminal without using unicode_literals.

> (2) You have to manually encode every string, which is utterly
> bizarre to me.

re works with both bytes and Unicode strings, on both Python 2 and Python 3.  I was encoding them to see if it was able to handle the range when it was in a UTF-8 encoded string, rather than a Unicode string.  Even if it didn't fail with an exception, it failed with a wrong result (and that's even worse).

> (3) Plus you then have turn around and tell re, "Hey by the way, you
> know those Unicode strings I just passed you?  Those are Unicode 
> strings, you know."
> Like it couldn't tell that already by realizing it got Unicode not
> byte strings.  So weird.

The re.UNICODE flags affects the behavior of e.g. \w and \d, it's not telling re that we are passing Unicode strings rather than bytes.  By default on Python 2 those only match ASCII letters and digits.  This is also fixed on Python 3, where by default they match non-ASCII letters and digits (unless you pass re.ASCII).

> *  Requiring explicitly coded callouts to a library are at best 
> tedious and annoying.  ICU4J's UCharacter and JDK7's Character 
> classes both have
>         String  getName(int codePoint)

FWIW we have unicodedata.lookup('SNOWMAN')

> One question: If one really must use code point numbers in strings, 
> does Python have any clean uniform way to enter them besides having
> to choose the clunky \uHHHH vs \UHHHHHHHH thing?

Nope.  OTOH it doesn't happen to often to use those (especially the \U version), so I'm not sure that it's worth adding something else just to save a few chars (also \x{12345} is only one char less than \U00012345).

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue12749>
_______________________________________


More information about the Python-bugs-list mailing list