[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues

Victor Stinner victor.stinner at haypocalc.com
Thu Dec 8 02:43:40 CET 2011


I would like to deny the creation of an Unicode string containing characters 
outside the range [U+0000; U+10FFFF]. The check is already present in some 
places (e.g. the builtin chr() function), but not everywhere. The last 
important function is PyUnicode_FromWideChar, function used to decode text 
from the OS.

The problem is that test_locale fails on Solaris with such checks. I would 
like to know how to handle Solaris issues. One possible solution is to not 
handle issues, and just raise exceptions and skip the failing tests on Solaris 
;-) Another solution is to modify locale.strxfrm() on all platforms to return 
a list of int, instead of a str. The type of the result is not really 
important, we just have to be able to compare two results (equal, greater, 
lesser or equal, etc.). Another solution?


The two Solaris issues:

 - in the hu_HU locale, localeconv() returns U+30000020 for the thousands 
 - locale.strxfrm() calls wcsxfrm() which returns characters in the range 
[0x1000000; 0x1FFFFFF]

For localeconv(), it is the b'\xA0' byte string decoded from an encoding 
looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like 
a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 
locale anymore, only UTF-8 locales (which is much better!). I'm unable to 
reproduce the issue on my OpenIndiana VM.

For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163, 
0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks 
like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add 
0x1000000 to each group. Example of strxfrm() output for the same input: 
{0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}.

See http://bugs.python.org/issue13441 for more information.


More information about the Python-Dev mailing list