[Tutor] why is unichr(sys.maxunicode) blank?

Albert-Jan Roskam fomcl at yahoo.com
Sat May 18 18:45:32 CEST 2013


> 
>>  East Asian languages. But later on Joel Spolsky's "standard" 
> page about unicode
>>  I read that it goes to 6 bytes. That's what I implied when I mentioned 
> "utf8".
> 
> Each surrogate in a UTF-16 surrogate pair is 10 bits, for a total of
> 20-bits. Thus UTF-16 sets the upper bound on the number of code points
> at 2**20 + 2**16 (BMP). UTF-8 only needs 4 bytes for this number of
> codes.
> 
>>  A certain locale implies a certain codepage (on Windows), but where does 
> the locale
>>  category LC_CTYPE fit in this story?
> 
> LC_CTYPE is the locale category that classifies characters. In Debian
> Linux, the English-language locales copy LC_CTYPE from the i18n
> (internationalization) locale:
 
Thanks for the links. Without examples it remains pretty abstract, but I think I know is meant by this locale category now.. "The LC_CTYPE category shall define character classification, case conversion, and other character attributes. So if you switch from one locale to another, certain attributes of a character set might change". A switch from locale A to locale B might affect an attribute "casing", therefore, the mapping from lower- to uppercase *might* differ by locale. In stupid country X  "a".upper() may return "B".

It seems that the result of str.isalpha() and str.isdigit() *might* be different depending on the setting of locale.C_CTYPE. 

It is pretty sick that all these things can be adjusted separately (what is the use of having: danish collation, russian case conversion, english decimal sign, japanese codepage ;-)

 
> The i18n locale is defined by the ISO/IEC technical report 14652, as
> an instance of an upward compatible extension to the POSIX locale
> specification called the FDCC-set (i.e. Set of Formal Definitions of
> Cultural Conventions). Here it is in all its glory, if you like
> reading technical reports:
> 
> http://www.open-std.org/jtc1/sc22/wg20/docs/n972-14652ft.pdf

> If that's not enough, here's the POSIX 1003.1 locale spec:
> 
> short: http://goo.gl/aOJUx
> http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html


That one is the clearest IMHO. Oh no, now I see the possible impact on regexes. The meaning of e.g. "\s+"
might change depending on the locale.C_CTYPE setting!!


>>  Isn't UCS-2 the internal unicode encoding for CPython (narrow builds)?
> 
> Narrow builds create UTF-16 surrogate pairs from \U literals, but
> these aren't treated as an atomic unit for slicing, iteration, or
> string length.

That is a nice way of putting it. So if you slice a multibyte char "mb", mb[0] will return the first byte? That is annoying.



More information about the Tutor mailing list