Mailman 3 PEP 293, Codec Error Handling Callbacks - Python-Dev

June 19, 2002


      Here's another new PEP.

Bye,
    Walter Dörwald

----------------------------------------------------------------------
PEP: 293
Title: Codec Error Handling Callbacks
Version: $Revision: 1.1 $
Last-Modified: $Date: 2002/06/19 03:22:11 $
Author: Walter Dörwald
Status: Draft
Type: Standards Track
Created: 18-Jun-2002
Python-Version: 2.3
Post-History:


Abstract

     This PEP aims at extending Python's fixed codec error handling
     schemes with a more flexible callback based approach.

     Python currently uses a fixed error handling for codec error
     handlers.  This PEP describes a mechanism which allows Python to
     use function callbacks as error handlers.  With these more
     flexible error handlers it is possible to add new functionality to
     existing codecs by e.g. providing fallback solutions or different
     encodings for cases where the standard codec mapping does not
     apply.


Specification

     Currently the set of codec error handling algorithms is fixed to
     either "strict", "replace" or "ignore" and the semantics of these
     algorithms is implemented separately for each codec.

     The proposed patch will make the set of error handling algorithms
     extensible through a codec error handler registry which maps
     handler names to handler functions.  This registry consists of the
     following two C functions:

         int PyCodec_RegisterError(const char *name, PyObject *error)

         PyObject *PyCodec_LookupError(const char *name)

     and their Python counterparts

         codecs.register_error(name, error)

         codecs.lookup_error(name)

     PyCodec_LookupError raises a LookupError if no callback function
     has been registered under this name.

     Similar to the encoding name registry there is no way of
     unregistering callback functions or iterating through the
     available functions.

     The callback functions will be used in the following way by the
     codecs: when the codec encounters an encoding/decoding error, the
     callback function is looked up by name, the information about the
     error is stored in an exception object and the callback is called
     with this object.  The callback returns information about how to
     proceed (or raises an exception).

     For encoding, the exception object will look like this:

        class UnicodeEncodeError(UnicodeError):
            def __init__(self, encoding, object, start, end, reason):
                UnicodeError.__init__(self,
                    "encoding '%s' can't encode characters " +
                    "in positions %d-%d: %s" % (encoding,
                        start, end-1, reason))
                self.encoding = encoding
                self.object = object
                self.start = start
                self.end = end
                self.reason = reason

     This type will be implemented in C with the appropriate setter and
     getter methods for the attributes, which have the following
     meaning:

       * encoding: The name of the encoding;
       * object: The original unicode object for which encode() has
         been called;
       * start: The position of the first unencodable character;
       * end: (The position of the last unencodable character)+1 (or
         the length of object, if all characters from start to the end
         of object are unencodable);
       * reason: The reason why object[start:end] couldn't be encoded.

     If object has consecutive unencodable characters, the encoder
     should collect those characters for one call to the callback if
     those characters can't be encoded for the same reason.  The
     encoder is not required to implement this behaviour but may call
     the callback for every single character, but it is strongly
     suggested that the collecting method is implemented.

     The callback must not modify the exception object.  If the
     callback does not raise an exception (either the one passed in, or
     a different one), it must return a tuple:

         (replacement, newpos)

     replacement is a unicode object that the encoder will encode and
     emit instead of the unencodable object[start:end] part, newpos
     specifies a new position within object, where (after encoding the
     replacement) the encoder will continue encoding.

     If the replacement string itself contains an unencodable character
     the encoder raises the exception object (but may set a different
     reason string before raising).

     Should further encoding errors occur, the encoder is allowed to
     reuse the exception object for the next call to the callback.
     Furthermore the encoder is allowed to cache the result of
     codecs.lookup_error.

     If the callback does not know how to handle the exception, it must
     raise a TypeError.

     Decoding works similar to encoding with the following differences:
     The exception class is named UnicodeDecodeError and the attribute
     object is the original 8bit string that the decoder is currently
     decoding.

     The decoder will call the callback with those bytes that
     constitute one undecodable sequence, even if there is more than
     one undecodable sequence that is undecodable for the same reason
     directly after the first one.  E.g. for the "unicode-escape"
     encoding, when decoding the illegal string "\\u00\\u01x", the
     callback will be called twice (once for "\\u00" and once for
     "\\u01").  This is done to be able to generate the correct number
     of replacement characters.

     The replacement returned from the callback is a unicode object
     that will be emitted by the decoder as-is without further
     processing instead of the undecodable object[start:end] part.

     There is a third API that uses the old strict/ignore/replace error
     handling scheme:

         PyUnicode_TranslateCharmap/unicode.translate

     The proposed patch will enhance PyUnicode_TranslateCharmap, so
     that it also supports the callback registry.  This has the
     additional side effect that PyUnicode_TranslateCharmap will
     support multi-character replacement strings (see SF feature
     request #403100 [1]).

     For PyUnicode_TranslateCharmap the exception class will be named
     UnicodeTranslateError.  PyUnicode_TranslateCharmap will collect
     all consecutive untranslatable characters (i.e. those that map to
     None) and call the callback with them.  The replacement returned
     from the callback is a unicode object that will be put in the
     translated result as-is, without further processing.

     All encoders and decoders are allowed to implement the callback
     functionality themselves, if they recognize the callback name
     (i.e. if it is a system callback like "strict", "replace" and
     "ignore").  The proposed patch will add two additional system
     callback names: "backslashreplace" and "xmlcharrefreplace", which
     can be used for encoding and translating and which will also be
     implemented in-place for all encoders and
     PyUnicode_TranslateCharmap.

     The Python equivalent of these five callbacks will look like this:

         def strict(exc):
             raise exc

         def ignore(exc):
             if isinstance(exc, UnicodeError):
                 return (u"", exc.end)
             else:
                 raise TypeError("can't handle %s" % exc.__name__)

        def replace(exc):
             if isinstance(exc, UnicodeEncodeError):
                 return ((exc.end-exc.start)*u"?", exc.end)
             elif isinstance(exc, UnicodeDecodeError):
                 return (u"\\ufffd", exc.end)
             elif isinstance(exc, UnicodeTranslateError):
                 return ((exc.end-exc.start)*u"\\ufffd", exc.end)
             else:
                 raise TypeError("can't handle %s" % exc.__name__)

        def backslashreplace(exc):
             if isinstance(exc,
                 (UnicodeEncodeError, UnicodeTranslateError)):
                 s = u""
                 for c in exc.object[exc.start:exc.end]:
                    if ord(c)<=0xff:
                        s += u"\\x%02x" % ord(c)
                    elif ord(c)<=0xffff:
                        s += u"\\u%04x" % ord(c)
                    else:
                        s += u"\\U%08x" % ord(c)
                 return (s, exc.end)
             else:
                 raise TypeError("can't handle %s" % exc.__name__)

        def xmlcharrefreplace(exc):
             if isinstance(exc,
                 (UnicodeEncodeError, UnicodeTranslateError)):
                 s = u""
                 for c in exc.object[exc.start:exc.end]:
                    s += u"&#%d;" % ord(c)
                 return (s, exc.end)
             else:
                 raise TypeError("can't handle %s" % exc.__name__)

     These five callback handlers will also be accessible to Python as
     codecs.strict_error, codecs.ignore_error, codecs.replace_error,
     codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.


Rationale

     Most legacy encoding do not support the full range of Unicode
     characters.  For these cases many high level protocols support a
     way of escaping a Unicode character (e.g. Python itself supports
     the \x, \u and \U convention, XML supports character references
     via &#xxx; etc.).

     When implementing such an encoding algorithm, a problem with the
     current implementation of the encode method of Unicode objects
     becomes apparent: For determining which characters are unencodable
     by a certain encoding, every single character has to be tried,
     because encode does not provide any information about the location
     of the error(s), so

         # (1)
         us = u"xxx"
         s = us.encode(encoding)

     has to be replaced by

         # (2)
         us = u"xxx"
         v = []
         for c in us:
             try:
                 v.append(c.encode(encoding))
             except UnicodeError:
                 v.append("&#%d;" % ord(c))
         s = "".join(v)

     This slows down encoding dramatically as now the loop through the
     string is done in Python code and no longer in C code.

     Furthermore this solution poses problems with stateful encodings.
     For example UTF-16 uses a Byte Order Mark at the start of the
     encoded byte string to specify the byte order.  Using (2) with
     UTF-16, results in an 8 bit string with a BOM between every
     character.

     To work around this problem, a stream writer - which keeps state
     between calls to the encoding function - has to be used:

         # (3)
         us = u"xxx"
         import codecs, cStringIO as StringIO
         writer = codecs.getwriter(encoding)

         v = StringIO.StringIO()
         uv = writer(v)
         for c in us:
             try:
                 uv.write(c)
             except UnicodeError:
                 uv.write(u"&#%d;" % ord(c))
         s = v.getvalue()

     To compare the speed of (1) and (3) the following test script has
     been used:

         # (4)
         import time
         us = u"äa"*1000000
         encoding = "ascii"
         import codecs, cStringIO as StringIO

         t1 = time.time()

         s1 = us.encode(encoding, "replace")

         t2 = time.time()

         writer = codecs.getwriter(encoding)

         v = StringIO.StringIO()
         uv = writer(v)
         for c in us:
             try:
                 uv.write(c)
             except UnicodeError:
                 uv.write(u"?")
         s2 = v.getvalue()

         t3 = time.time()

         assert(s1==s2)
         print "1:", t2-t1
         print "2:", t3-t2
         print "factor:", (t3-t2)/(t2-t1)

     On Linux this gives the following output (with Python 2.3a0):

         1: 0.274321913719
         2: 51.1284689903
         factor: 186.381278466

     i.e. (3) is 180 times slower than (1).

     Codecs must be stateless, because as soon as a callback is
     registered it is available globally and can be called by multiple
     encode() calls.  To be able to use stateful callbacks, the errors
     parameter for encode/decode/translate would have to be changed
     from char * to PyObject *, so that the callback could be used
     directly, without the need to register the callback globally.  As
     this requires changes to lots of C prototypes, this approach was
     rejected.

     Currently all encoding/decoding functions have arguments

         const Py_UNICODE *p, int size

     or

         const char *p, int size

     to specify the unicode characters/8bit characters to be
     encoded/decoded.  So in case of an error the codec has to create a
     new unicode or str object from these parameters and store it in
     the exception object.  The callers of these encoding/decoding
     functions extract these parameters from str/unicode objects
     themselves most of the time, so it could speed up error handling
     if these object were passed directly.  As this again requires
     changes to many C functions, this approach has been rejected.


Implementation Notes

     A sample implementation is available as SourceForge patch #432401
     [2].  The current version of this patch differs from the
     specification in the following way:

       * The error information is passed from the codec to the callback
         not as an exception object, but as a tuple, which has an
         additional entry state, which can be used for additional
         information the codec might want to pass to the callback.
       * There are two separate registries (one for
         encoding/translating and one for decoding)

     The class codecs.StreamReaderWriter uses the errors parameter for
     both reading and writing.  To be more flexible this should
     probably be changed to two separate parameters for reading and
     writing.

     The errors parameter of PyUnicode_TranslateCharmap is not
     availably to Python, which makes testing of the new functionality
     of PyUnicode_TranslateCharmap impossible with Python scripts.  The
     patch should add an optional argument errors to unicode.translate
     to expose the functionality and make testing possible.

     Codecs that do something different than encoding/decoding from/to
     unicode and want to use the new machinery can define their own
     exception classes and the strict handlers will automatically work
     with it. The other predefined error handlers are unicode specific
     and expect to get a Unicode(Encode|Decode|Translate)Error
     exception object so they won't work.


Backwards Compatibility

     The semantics of unicode.encode with errors="replace" has changed:
     The old version always stored a ? character in the output string
     even if no character was mapped to ?  in the mapping.  With the
     proposed patch, the replacement string from the callback callback
     will again be looked up in the mapping dictionary.  But as all
     supported encodings are ASCII based, and thus map ? to ?, this
     should not be a problem in practice.


References

     [1] SF feature request #403100
         "Multicharacter replacements in PyUnicode_TranslateCharmap"
         http://www.python.org/sf/403100

     [2] SF patch #432401 "unicode encoding error callbacks"
         http://www.python.org/sf/432401


Copyright

     This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End:

PEP 293, Codec Error Handling Callbacks

M.-A. Lemburg

Oren Tirosh

M.-A. Lemburg

Oren Tirosh

Oren Tirosh

M.-A. Lemburg

Oren Tirosh

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Jack Jansen

M.-A. Lemburg

M.-A. Lemburg

Oren Tirosh

M.-A. Lemburg

Oren Tirosh

Oren Tirosh

M.-A. Lemburg

Oren Tirosh

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Jack Jansen

M.-A. Lemburg

tags

participants (7)