[Tutor] why is unichr(sys.maxunicode) blank?

eryksun eryksun at gmail.com
Sat May 18 21:15:07 CEST 2013


On Sat, May 18, 2013 at 12:45 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> It seems that the result of str.isalpha() and str.isdigit() *might* be different depending
> on the setting of locale.C_CTYPE.

Yes, str() in 2.x uses the locale predicates from <ctype.h>:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/ctype.h.html

However, 2.x bytearray uses the bytes_methods from 3.x, which use pyctype:

2.7.5 source:
http://hg.python.org/cpython/file/ab05e7dd2788/Include/pyctype.h
http://hg.python.org/cpython/file/ab05e7dd2788/Python/pyctype.c
http://hg.python.org/cpython/file/ab05e7dd2788/Include/bytes_methods.h
http://hg.python.org/cpython/file/ab05e7dd2788/Objects/stringlib/ctype.h

Note that the table in pyctype.c is only defined for ASCII.

> It is pretty sick that all these things can be adjusted separately (what is the use of having:
> danish collation, russian case conversion, english decimal sign, japanese codepage ;-)

Here's a non-sick example. A system in the US might customize
LC_MEASUREMENT to use SI units and LC_TIME to have Monday as the first
day of the week.

> That one is the clearest IMHO. Oh no, now I see the possible impact on regexes. The
> meaning of e.g. "\s+" might change depending on the locale.C_CTYPE setting!!

The re module has the re.L flag to enable limited locale support. It
only affects the alphanumeric category and word boundaries. You're
probably better off using re.U and the Unicode database.

>> Narrow builds create UTF-16 surrogate pairs from \U literals, but
>> these aren't treated as an atomic unit for slicing, iteration, or
>> string length.
>
> That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] will return the
> first byte? That is annoying.

It's 2 bytes, not one. If you use a non-BMP \U escape on a narrow
build it creates a surrogate pair.  Each surrogate has a 10-bit range
in a 2-byte code. The lead surrogate is in the range 0xD800-0xDBFF,
and the trail is in the range 0xDC00-0xDFFF.


More information about the Tutor mailing list