[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues

Thu Dec 8 13:24:51 CET 2011

Le 08/12/2011 10:17, Stefan Krah a écrit :
> I'm think that b'\xA0' is a valid thousands separator.

I agree, but it's not the point: the problem is that b'\xA0' is decoded 
to a strange U+30000020 character by mbstowcs().

> Currently I have this horrible function to deal with the problem:
>
> ...
>          n = mbstowcs(buf, s, 2);
> ...
>          tmp = PyUnicode_FromWideChar(buf, n);
>          if (tmp == NULL) {
>                  return NULL;
>          }
>          utf8 = PyUnicode_AsUTF8String(tmp);
>          Py_DECREF(tmp);
>          return utf8;

I would not help this specific issue: b'\xA0' is not decodable from UTF-8.

> I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems
> have this thousands separator.

The problem is not directly in the C localeconv() function, but in 
mbstowcs() with the hu_HU locale.

You can try my test program for this issue:
http://bugs.python.org/file23876/localeconv_wchar.c

My test is maybe not correct, because it only sets LC_ALL, which is a 
little bit different than Python tests (see below).

--

I don't remember on which buildbot the issue occurred :-(

  - "sparc solaris10 gcc 3.x" has "LANG=C" and "TZ=Europe/Berlin" 
environement variable
  - "x86 OpenIndiana 3.x" and "AMD64 OpenIndian a%203.x" have 
"TZ=Europe/London" and no locale variable!?

The issue occurred for example in test_lc_numeric_basic() of 
test__locale which sets LC_NUMERIC and LC_CTYPE locales (but not 
LC_ALL). LC_ALL and LC_NUMERIC are different in this test, but 
LC_NUMERIC and LC_CTYPE are the same.

--

Stefan: would you accept that locale.localeconv() and locale.strxfrm() 
stop working (instead of returning invalid data) on Solaris in certains 
cases (it looks like the issue depends on the locale and the OS 
version)? It can be a motivation to fix the root of the issue ;-)

Victor