[I18n-sig] error handling in charmap-based codecs

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 20 Dec 2000 12:36:16 +0100

> Most standard codecs based on the charmap codec, such as
> iso8859_2 and koi8_r, appear not to do correct error handling.
> Although the default error handling scheme is "strict",
> characters that are not in a mapping are passed through without
> decoding/encoding.  Worse, a error handling scheme specified is
> completely ignored.

Indeed. I have filed a bug report, "Unicode encoders don't report
errors properly",


Unfortunately, there is disagreement whether this is a bug, or what
the nature of the bug is.

> 1965:        /* Get mapping (char ordinal -> integer, Unicode char or None) */
> 1966:        w = PyInt_FromLong((long)ch);
> 1967:        if (w == NULL)
> 1968:            goto onError;
> 1969:        x = PyObject_GetItem(mapping, w);
> 1970:        Py_DECREF(w);
> 1971:        if (x == NULL) {
> 1972:            if (PyErr_ExceptionMatches(PyExc_LookupError)) {
> 1973:                /* No mapping found: default to Latin-1 mapping */
> 1974:                PyErr_Clear();
> 1975:                *p++ = (Py_UNICODE)ch;
> 1976:                continue;
> 1977:            }
> 1978:            goto onError;
> 1979:        }
> Evidently, a character not in the 'mapping' object is passed as
> it is.  I'm not sure why the if statement shown above has been
> put here.

I'm not sure, either. There is no documentation what the function is
supposed to do, so it is hard to tell whether it does that correctly.
IMO, it should read

       if (x == NULL) {
           if (PyErr_ExceptionMatches(PyExc_LookupError)) {
               /* No mapping found: default to Latin-1 mapping */
               x = Py_None;
           } else
               goto onError;

I can't see any reason for defaulting to *Latin-1*.

> A error handling scheme works as expected if the mapping object
> returns None for an undefined key.  So, I've added the following
> code to charmap-based codecs of mine:

Yes, that is also the proposed solution in response to my bug
report. I don't like it at all as a solution; it's an ok work-around.
As a solution, it is stupid: All codecs will have to pay the cost for
UserDict accesses, and no codec makes uses of this 1:1 "feature" -
when real solution is three-line change.

Just my 0.02EUR,