[Python-Dev] Reject characters bigger than U+10FFFF and Solaris issues

Stefan Krah stefan at bytereef.org
Thu Dec 8 10:17:52 CET 2011


Victor Stinner <victor.stinner at haypocalc.com> wrote:
> For localeconv(), it is the b'\xA0' byte string decoded from an encoding 
> looking like ISO-8859-?? (b'\xA0' is not decodable from UTF-8). It looks like 
> a bug in the decoder. It also looks like OpenIndiana doesn't use ISO-8859 
> locale anymore, only UTF-8 locales (which is much better!). I'm unable to 
> reproduce the issue on my OpenIndiana VM.

I'm think that b'\xA0' is a valid thousands separator. The 'fi_FI' locale also
uses that. Decimal.__format__() has to handle the 'n' specifier, which takes the
thousands separator directly from localeconv(). Currently I have this horrible
function to deal with the problem:

/* Convert decimal_point or thousands_sep, which may be multibyte or in
   the range [128, 255], to a UTF8 string. */
static PyObject *
dotsep_as_utf8(const char *s)
{
        PyObject *utf8;
        PyObject *tmp;
        wchar_t buf[2];
        size_t n;

        n = mbstowcs(buf, s, 2);
        if (n != 1) { /* Issue #7442 */
                PyErr_SetString(PyExc_ValueError,
                    "invalid decimal point or unsupported "
                    "combination of LC_CTYPE and LC_NUMERIC");
                return NULL;
        }
        tmp = PyUnicode_FromWideChar(buf, n);
        if (tmp == NULL) {
                return NULL;
        }
        utf8 = PyUnicode_AsUTF8String(tmp);
        Py_DECREF(tmp);
        return utf8;
}


The main issue is that there is no portable function mbst_to_utf8()
that uses the current locale. If possible, it would be great to have
such a thing in the C-API.

I'm not sure why the b'\xA0' problem only occurs in Solaris. Many systems
have this thousands separator.



Stefan Krah




More information about the Python-Dev mailing list