[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues
Victor Stinner
victor.stinner at haypocalc.com
Thu Dec 8 02:43:40 CET 2011
Hi,
I would like to deny the creation of an Unicode string containing characters
outside the range [U+0000; U+10FFFF]. The check is already present in some
places (e.g. the builtin chr() function), but not everywhere. The last
important function is PyUnicode_FromWideChar, function used to decode text
from the OS.
The problem is that test_locale fails on Solaris with such checks. I would
like to know how to handle Solaris issues. One possible solution is to not
handle issues, and just raise exceptions and skip the failing tests on Solaris
;-) Another solution is to modify locale.strxfrm() on all platforms to return
a list of int, instead of a str. The type of the result is not really
important, we just have to be able to compare two results (equal, greater,
lesser or equal, etc.). Another solution?
--
The two Solaris issues:
- in the hu_HU locale, localeconv() returns U+30000020 for the thousands
separator
- locale.strxfrm() calls wcsxfrm() which returns characters in the range
[0x1000000; 0x1FFFFFF]
For localeconv(), it is the b'\xA0' byte string decoded from an encoding
looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like
a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859
locale anymore, only UTF-8 locales (which is much better!). I'm unable to
reproduce the issue on my OpenIndiana VM.
For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163,
0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks
like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add
0x1000000 to each group. Example of strxfrm() output for the same input:
{0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}.
See http://bugs.python.org/issue13441 for more information.
Victor
More information about the Python-Dev
mailing list