Reject characters bigger than U+10FFFF and Solaris issues
Hi, I would like to deny the creation of an Unicode string containing characters outside the range [U+0000; U+10FFFF]. The check is already present in some places (e.g. the builtin chr() function), but not everywhere. The last important function is PyUnicode_FromWideChar, function used to decode text from the OS. The problem is that test_locale fails on Solaris with such checks. I would like to know how to handle Solaris issues. One possible solution is to not handle issues, and just raise exceptions and skip the failing tests on Solaris ;-) Another solution is to modify locale.strxfrm() on all platforms to return a list of int, instead of a str. The type of the result is not really important, we just have to be able to compare two results (equal, greater, lesser or equal, etc.). Another solution? -- The two Solaris issues: - in the hu_HU locale, localeconv() returns U+30000020 for the thousands separator - locale.strxfrm() calls wcsxfrm() which returns characters in the range [0x1000000; 0x1FFFFFF] For localeconv(), it is the b'\xA0' byte string decoded from an encoding looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 locale anymore, only UTF-8 locales (which is much better!). I'm unable to reproduce the issue on my OpenIndiana VM. For wcsxfrm(), I'm not sure of the range. Example of a result: {0x1010163, 0x1010101, 0x1010103, 0x1010101, 0x1010103, 0x1010101, 0x1010101}. It looks like wcsxfrm() uses the result of strxfrm() by grouping bytes 3 by 3 and add 0x1000000 to each group. Example of strxfrm() output for the same input: {0x01, 0x01, 0x63, 0x01, 0x01, 0x01, ...}. See http://bugs.python.org/issue13441 for more information. Victor
Victor Stinner
For localeconv(), it is the b'\xA0' byte string decoded from an encoding looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 locale anymore, only UTF-8 locales (which is much better!). I'm unable to reproduce the issue on my OpenIndiana VM.
I'm think that b'\xA0' is a valid thousands separator. The 'fi_FI' locale also uses that. Decimal.__format__() has to handle the 'n' specifier, which takes the thousands separator directly from localeconv(). Currently I have this horrible function to deal with the problem: /* Convert decimal_point or thousands_sep, which may be multibyte or in the range [128, 255], to a UTF8 string. */ static PyObject * dotsep_as_utf8(const char *s) { PyObject *utf8; PyObject *tmp; wchar_t buf[2]; size_t n; n = mbstowcs(buf, s, 2); if (n != 1) { /* Issue #7442 */ PyErr_SetString(PyExc_ValueError, "invalid decimal point or unsupported " "combination of LC_CTYPE and LC_NUMERIC"); return NULL; } tmp = PyUnicode_FromWideChar(buf, n); if (tmp == NULL) { return NULL; } utf8 = PyUnicode_AsUTF8String(tmp); Py_DECREF(tmp); return utf8; } The main issue is that there is no portable function mbst_to_utf8() that uses the current locale. If possible, it would be great to have such a thing in the C-API. I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems have this thousands separator. Stefan Krah
Stefan Krah
I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems have this thousands separator.
Are LC_CTYPE and LC_NUMERIC set to the same value on the buildbot? Otherwise you encounter http://bugs.python.org/issue7442 . Stefan Krah
Le 08/12/2011 10:17, Stefan Krah a écrit :
I'm think that b'\xA0' is a valid thousands separator.
I agree, but it's not the point: the problem is that b'\xA0' is decoded to a strange U+30000020 character by mbstowcs().
Currently I have this horrible function to deal with the problem:
... n = mbstowcs(buf, s, 2); ... tmp = PyUnicode_FromWideChar(buf, n); if (tmp == NULL) { return NULL; } utf8 = PyUnicode_AsUTF8String(tmp); Py_DECREF(tmp); return utf8;
I would not help this specific issue: b'\xA0' is not decodable from UTF-8.
I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems have this thousands separator.
The problem is not directly in the C localeconv() function, but in mbstowcs() with the hu_HU locale. You can try my test program for this issue: http://bugs.python.org/file23876/localeconv_wchar.c My test is maybe not correct, because it only sets LC_ALL, which is a little bit different than Python tests (see below). -- I don't remember on which buildbot the issue occurred :-( - "sparc solaris10 gcc 3.x" has "LANG=C" and "TZ=Europe/Berlin" environement variable - "x86 OpenIndiana 3.x" and "AMD64 OpenIndian a%203.x" have "TZ=Europe/London" and no locale variable!? The issue occurred for example in test_lc_numeric_basic() of test__locale which sets LC_NUMERIC and LC_CTYPE locales (but not LC_ALL). LC_ALL and LC_NUMERIC are different in this test, but LC_NUMERIC and LC_CTYPE are the same. -- Stefan: would you accept that locale.localeconv() and locale.strxfrm() stop working (instead of returning invalid data) on Solaris in certains cases (it looks like the issue depends on the locale and the OS version)? It can be a motivation to fix the root of the issue ;-) Victor
Victor Stinner
The problem is not directly in the C localeconv() function, but in mbstowcs() with the hu_HU locale.
Ah, I see.
You can try my test program for this issue: http://bugs.python.org/file23876/localeconv_wchar.c
Can't test on OpenSolaris, since Oracle removed the package repo and I need the ISO locales.
Stefan: would you accept that locale.localeconv() and locale.strxfrm() stop working (instead of returning invalid data) on Solaris in certains cases (it looks like the issue depends on the locale and the OS version)? It can be a motivation to fix the root of the issue ;-)
Yes, if the cause is a broken mbstowcs() that sounds good. Stefan Krah
participants (2)
-
Stefan Krah
-
Victor Stinner