[Python-Dev] Proposed PEP: Codec error handling callbacks

Thu, 04 Oct 2001 20:22:01 +0200

Here's a first try at a PEP for making Python's codec error
handling customizable. This has been briefly discussed before
on the I18N-SIG mailing list.

A sample implementation of the proposed feature is available
as a SourceForge patch.

Bye,
    Walter Dörwald

---------------------------------------------------------------------------
PEP: ???
Title: Codec error handling callbacks
Version: $Revision$
Last-Modified: $Date$
Author: walter@livinglogic.de (Walter Dörwald)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: ??-???-2001
Post-History: 04-Oct-2001

Abstract

     This PEP aims at extending Python's fixed codec error handling
     schemes with a more flexible callback based approach.

     Python currently uses a fixed error handling for codec error
     handlers. This PEP describes a mechanism which allows Python to
     use function callbacks as error handlers.  With these more flexible
     error handlers it is possible to add new functionality to existing
     codecs by e.g. providing fallback solutions or different encodings
     for cases where the standard codec mapping does not apply.

Problem

     A typical case is where a character encoding does not support
     the full range of Unicode characters.  For these cases many high
     level protocols support a way of escaping a Unicode character
     (e.g. Python itself support the \x, \u and \U convention, XML
     supports character references via &#xxxx; etc.).

     When implementing such an encoding algorithm, a problem with the
     current implementation of the encode method of Unicode objects
     becomes apparent: For determining which characters are unencodable
     by a certain encoding, every single character has to be tried,
     because encode does not provide any information about the location
     of the error(s), so

         # (1)
         us = u"xxx"
         s = us.encode(encoding)

     has to be replaced by

         # (2)
         us = u"xxx"
         v = []
         for c in us:
             try:
                 v.append(c.encode(encoding))
             except UnicodeError:
                 v.append("&#" + ord(c) + ";")
         s = "".join(v)

     This slows down encoding dramatically as now the loop through
     the string is done in Python code and no longer in C code.

     Furthermore this solution poses problems with stateful encodings.
     For example UTF16 uses a Byte Order Mark at the start of the
     encoded byte string to specify the byte order.  Using (2) with
     UTF16, results in an 8 bit string with a BOM between every
     character.

     To work around this problem, a stream writer - which keeps
     state between calls to the encoding function - has to be used:

         # (3)
         us = u"xxx"
         import codecs, cStringIO as StringIO
         writer = codecs.lookup(encoding)[3]

         v = StringIO.StringIO()
         uv = writer(v)
         for c in us:
             try:
                 uv.write(c)
             except UnicodeError:
                 uv.write(u"&#%d;" % ord(c))
         s = str(v)

     To compare the speed of (1) and (3) the following test script
     has been used:

         # (4)
         import time
         us = u"äa"*1000000
         encoding = "ascii"
         import codecs, cStringIO as StringIO

         t1 = time.time()

         s1 = us.encode(encoding, "replace")

         t2 = time.time()

         writer = codecs.lookup(encoding)[3]

         v = StringIO.StringIO()
         uv = writer(v)
         for c in us:
             try:
                 uv.write(c)
             except UnicodeError:
                 uv.write(u"?")
         s2 = v.getvalue()

         t3 = time.time()

         assert(s1==s2)
         print "1:", t2-t1
         print "2:", t3-t2
         print "factor:", (t3-t2)/(t2-t1)

     On Linux with Python 2.1 this gives the
     following output:
         1: 0.395456075668
         2: 126.909595966
         factor: 320.919575586
     i.e. (3) is 320 times slower than (1).

Solution 1

     Add a position attribute to UnicodeError instances.

     When the encoder encounters an unencodable character
     it raises an exception that specifies the position in the
     Unicode object where the unencodable character occurs.
     The application can then reencode the Unicode object up
     to the offending character, replace the offending character
     with something appropriate and retry encoding with the
     rest of the Unicode string until encoding is finished.

     A stream writer will write everything up to the offending
     character and then raise an exception, so the application
     only has to replace the offending character and retry the rest
     of the string.

     Advantage

     Requires minor changes to all the encoders/stream writers.

     Disadvantage

     As the encoder has to be called multiple times, this
     won't work with stateful encoding, so a stream writer
     has to be used in all cases.

     If unencodable characters occur very often Solution 1
     will probably not be much faster than (3).  E.g. for the
     string u"a"*10000+u"ä" all the characters but the last one
     will have to be encoded twice when using an encoder
     (but only once when using a stream writer).

     This solution is specific to encoding and can't
     be extended to decoding easily.

Solution 2

     Add additional error handling names for every needed
     replacement scheme (e.g. "xmlcharrefreplace" for "&#%d;"
     or "escapereplace" for "\\x%02x" / "\\u%04x" / "\\U%08x")

     Advantage

     Requires minor changes to all encoders/stream writers.

     As the replacement scheme is implemented directly in
     the encoder this should be the fastest solution.

     Disadvantages

     The available replacement schemes are hardwired.
     Additional replacement schemes require that all
     encoders/decoders are changed again.  This is especially
     problematic for decoding where users might want to
     implement various error handlers for handling broken
     text files.

Solution 3

     The errors argument is no longer a string, but a callback
     function: This callback function gets passed the original
     unicode object and the position of the unencodable character
     and returns a new unicode object that should be encoded instead
     of the unencodable one. (Note that this requires that the encoder
     *must* be able to encode what is returned from the handler.  If
     not a normal UnicodeError will be raised.)

     Example code could look like this:

         us = u"xxx"

         def xmlescape(uni, pos):
             return u"&#%d;" % ord(uni[pos])

         s = us.encode(encode, xmlescape)

     Advantages

     This makes the error handling completely customizable. The error
     handler may choose to raise an exception or do any kind of
     replacement required.

     The interface between the encoder and the error handler can be
     designed in a way that this mechanism can be used for decoding too:
     The original 8bit string is passed to the error handler and the
     error handler returns a replacement unicode object and a
     resyncronization position where decoding should continue.

     Disadvantages

     This solutions requires changes to every C function
     that has "const char *errors" arguments. (e.g.

         PyUnicode_EncodeLatin1(
             const Py_UNICODE *p,
             int size,
             const char *errors)

     has to be changed to

         PyUnicode_EncodeLatin1(
             const Py_UNICODE *p,
             int size,
             PyObject *errors)

     (To provide backwards compatibility the PyUnicode_EncodeLatin1
     signature remains the same, a new function
     PyUnicode_EncodeLatin1Ex is introduced with the new
     signature.  PyUnicode_EncodeLatin1 simply calls the new function.)

Solution 4

     The errors argument is still a string, but this string
     is used to lookup a error handling callback function
     from a callback registry.

     Advantage

     No changes to the Python C API are required.  Well known
     error handling schemes could be implemented directly in
     the encoder/decoder for maximum speed.  Like solution 3
     this can be done for encoders and decoders.

     Disadvantages

     Currently all encoding/decoding functions have arguments
         const Py_UNICODE *p, int size
     or
         const char *p, int size
     to specify the unicode characters/8bit characters to be
     encoded/decoded.  In case of a error a new unicode or str
     object has to be created from this information and passed
     to the error handler.  This has to be done either for every
     occuring error or once on the first error and the result object
     must be kept until the end of the function in case another
     error occurs. Most functions that call the codec functions
     work with unicode/str objects anyway, so they have to extract
     the const Py_UNICODE */int arguments and pass it to the functions,
     which has to reconstruct the object in case of an error.

Sample implementation

     A sample implementation is available on SourceForge [1].  This
     patch implements a combination of solutions 3 and 4, i.e. it
     is possible to pass functions *and* registered callback names
     to unicode.encode.

     The interface between the encoder and the handler has been
     extended to be able to support the same interface for encoding
     and decoding.  The encoder/decoder passes a tuple to the callback
     with the following entries:
         0: the name of the encoding
         1: the original Unicode object/the original 8bit string
         2: the position of the unencodable character/byte
         3: the reason why the character/byte can't be
            encoded/decoded as a string (e.g.
            "character not in range(128)" for encoding or
            "unexpected end of data", "invalid code byte" etc.
            for decoding)
         4: an additional object that can be used to pass
            state information to the callback. All implemented
            encoders/decoders currently pass None for this.

     The callback must return a tuple with the following info:

         0: a Unicode object which will be encoded instead
            of the offending character (for encoders).
            A Unicode object that will be used as a replacement
            for the undecodable bytes in the decoded Unicode
            object (for decoders)
         1: a position where the encoding/decoding process should
            continue after processing the replacement string.

     The patch includes several preregistered encode
     error handlers schemes with the following names:
         "strict", "ignore", "replace", "xmlcharrefreplace",
         "escapereplace"
     and decode error handlers with the names:
         "strict", "ignore", "replace"

     The patch includes the other change to the C API described in
     Solution 4.  The parameters
         const Py_UNICODE *p, int size
     have been replace by
         PyObject *p
     so that all functions get the Unicode object directly
     as a PyObject * and pass this directly along to the error
     callback.

     For further details see the patch on SourceForge.

References

     [1] Python patch #432401 "unicode encoding error callbacks"

http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=deta=
il&aid=432401

     [2] Previous discussions on this topics in the I18N-SIG mailing list
         http://mail.python.org/pipermail/i18n-sig/2000-December/000587.h=
tml
         http://mail.python.org/pipermail/i18n-sig/2001-June/001173.html

Copyright

     This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: