[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues

Thu Dec 8 02:43:40 CET 2011

Hi,

I would like to deny the creation of an Unicode string containing characters 
outside the range [U+0000; U+10FFFF]. The check is already present in some 
places (e.g. the builtin chr() function), but not everywhere. The last 
important function is PyUnicode_FromWideChar, function used to decode text 
from the OS.

The problem is that test_locale fails on Solaris with such checks. I would 
like to know how to handle Solaris issues. One possible solution is to not 
handle issues, and just raise exceptions and skip the failing tests on Solaris 
;-) Another solution is to modify locale.strxfrm() on all platforms to return 
a list of int, instead of a str. The type of the result is not really 
important, we just have to be able to compare two results (equal, greater, 
lesser or equal, etc.). Another solution?

--

The two Solaris issues:

 - in the hu_HU locale, localeconv() returns U+30000020 for the thousands 
separator 
 - locale.strxfrm() calls wcsxfrm() which returns characters in the range 
[0x1000000; 0x1FFFFFF]

For localeconv(), it is the b'\xA0' byte string decoded from an encoding 
looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like 
a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 
locale anymore, only UTF-8 locales (which is much better!). I'm unable to 
reproduce the issue on my OpenIndiana VM.

For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163, 
0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks 
like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add 
0x1000000 to each group. Example of strxfrm() output for the same input: 
{0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}.

See http://bugs.python.org/issue13441 for more information.

Victor