[Python-Dev] Proposed PEP: Codec error handling callbacks
Walter Dörwald
walter@livinglogic.de
Thu, 04 Oct 2001 20:22:01 +0200
Here's a first try at a PEP for making Python's codec error
handling customizable. This has been briefly discussed before
on the I18N-SIG mailing list.
A sample implementation of the proposed feature is available
as a SourceForge patch.
Bye,
Walter Dörwald
---------------------------------------------------------------------------
PEP: ???
Title: Codec error handling callbacks
Version: $Revision$
Last-Modified: $Date$
Author: walter@livinglogic.de (Walter Dörwald)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: ??-???-2001
Post-History: 04-Oct-2001
Abstract
This PEP aims at extending Python's fixed codec error handling
schemes with a more flexible callback based approach.
Python currently uses a fixed error handling for codec error
handlers. This PEP describes a mechanism which allows Python to
use function callbacks as error handlers. With these more flexible
error handlers it is possible to add new functionality to existing
codecs by e.g. providing fallback solutions or different encodings
for cases where the standard codec mapping does not apply.
Problem
A typical case is where a character encoding does not support
the full range of Unicode characters. For these cases many high
level protocols support a way of escaping a Unicode character
(e.g. Python itself support the \x, \u and \U convention, XML
supports character references via &#xxxx; etc.).
When implementing such an encoding algorithm, a problem with the
current implementation of the encode method of Unicode objects
becomes apparent: For determining which characters are unencodable
by a certain encoding, every single character has to be tried,
because encode does not provide any information about the location
of the error(s), so
# (1)
us = u"xxx"
s = us.encode(encoding)
has to be replaced by
# (2)
us = u"xxx"
v = []
for c in us:
try:
v.append(c.encode(encoding))
except UnicodeError:
v.append("&#" + ord(c) + ";")
s = "".join(v)
This slows down encoding dramatically as now the loop through
the string is done in Python code and no longer in C code.
Furthermore this solution poses problems with stateful encodings.
For example UTF16 uses a Byte Order Mark at the start of the
encoded byte string to specify the byte order. Using (2) with
UTF16, results in an 8 bit string with a BOM between every
character.
To work around this problem, a stream writer - which keeps
state between calls to the encoding function - has to be used:
# (3)
us = u"xxx"
import codecs, cStringIO as StringIO
writer = codecs.lookup(encoding)[3]
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"&#%d;" % ord(c))
s = str(v)
To compare the speed of (1) and (3) the following test script
has been used:
# (4)
import time
us = u"äa"*1000000
encoding = "ascii"
import codecs, cStringIO as StringIO
t1 = time.time()
s1 = us.encode(encoding, "replace")
t2 = time.time()
writer = codecs.lookup(encoding)[3]
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"?")
s2 = v.getvalue()
t3 = time.time()
assert(s1==s2)
print "1:", t2-t1
print "2:", t3-t2
print "factor:", (t3-t2)/(t2-t1)
On Linux with Python 2.1 this gives the
following output:
1: 0.395456075668
2: 126.909595966
factor: 320.919575586
i.e. (3) is 320 times slower than (1).
Solution 1
Add a position attribute to UnicodeError instances.
When the encoder encounters an unencodable character
it raises an exception that specifies the position in the
Unicode object where the unencodable character occurs.
The application can then reencode the Unicode object up
to the offending character, replace the offending character
with something appropriate and retry encoding with the
rest of the Unicode string until encoding is finished.
A stream writer will write everything up to the offending
character and then raise an exception, so the application
only has to replace the offending character and retry the rest
of the string.
Advantage
Requires minor changes to all the encoders/stream writers.
Disadvantage
As the encoder has to be called multiple times, this
won't work with stateful encoding, so a stream writer
has to be used in all cases.
If unencodable characters occur very often Solution 1
will probably not be much faster than (3). E.g. for the
string u"a"*10000+u"ä" all the characters but the last one
will have to be encoded twice when using an encoder
(but only once when using a stream writer).
This solution is specific to encoding and can't
be extended to decoding easily.
Solution 2
Add additional error handling names for every needed
replacement scheme (e.g. "xmlcharrefreplace" for "&#%d;"
or "escapereplace" for "\\x%02x" / "\\u%04x" / "\\U%08x")
Advantage
Requires minor changes to all encoders/stream writers.
As the replacement scheme is implemented directly in
the encoder this should be the fastest solution.
Disadvantages
The available replacement schemes are hardwired.
Additional replacement schemes require that all
encoders/decoders are changed again. This is especially
problematic for decoding where users might want to
implement various error handlers for handling broken
text files.
Solution 3
The errors argument is no longer a string, but a callback
function: This callback function gets passed the original
unicode object and the position of the unencodable character
and returns a new unicode object that should be encoded instead
of the unencodable one. (Note that this requires that the encoder
*must* be able to encode what is returned from the handler. If
not a normal UnicodeError will be raised.)
Example code could look like this:
us = u"xxx"
def xmlescape(uni, pos):
return u"&#%d;" % ord(uni[pos])
s = us.encode(encode, xmlescape)
Advantages
This makes the error handling completely customizable. The error
handler may choose to raise an exception or do any kind of
replacement required.
The interface between the encoder and the error handler can be
designed in a way that this mechanism can be used for decoding too:
The original 8bit string is passed to the error handler and the
error handler returns a replacement unicode object and a
resyncronization position where decoding should continue.
Disadvantages
This solutions requires changes to every C function
that has "const char *errors" arguments. (e.g.
PyUnicode_EncodeLatin1(
const Py_UNICODE *p,
int size,
const char *errors)
has to be changed to
PyUnicode_EncodeLatin1(
const Py_UNICODE *p,
int size,
PyObject *errors)
(To provide backwards compatibility the PyUnicode_EncodeLatin1
signature remains the same, a new function
PyUnicode_EncodeLatin1Ex is introduced with the new
signature. PyUnicode_EncodeLatin1 simply calls the new function.)
Solution 4
The errors argument is still a string, but this string
is used to lookup a error handling callback function
from a callback registry.
Advantage
No changes to the Python C API are required. Well known
error handling schemes could be implemented directly in
the encoder/decoder for maximum speed. Like solution 3
this can be done for encoders and decoders.
Disadvantages
Currently all encoding/decoding functions have arguments
const Py_UNICODE *p, int size
or
const char *p, int size
to specify the unicode characters/8bit characters to be
encoded/decoded. In case of a error a new unicode or str
object has to be created from this information and passed
to the error handler. This has to be done either for every
occuring error or once on the first error and the result object
must be kept until the end of the function in case another
error occurs. Most functions that call the codec functions
work with unicode/str objects anyway, so they have to extract
the const Py_UNICODE */int arguments and pass it to the functions,
which has to reconstruct the object in case of an error.
Sample implementation
A sample implementation is available on SourceForge [1]. This
patch implements a combination of solutions 3 and 4, i.e. it
is possible to pass functions *and* registered callback names
to unicode.encode.
The interface between the encoder and the handler has been
extended to be able to support the same interface for encoding
and decoding. The encoder/decoder passes a tuple to the callback
with the following entries:
0: the name of the encoding
1: the original Unicode object/the original 8bit string
2: the position of the unencodable character/byte
3: the reason why the character/byte can't be
encoded/decoded as a string (e.g.
"character not in range(128)" for encoding or
"unexpected end of data", "invalid code byte" etc.
for decoding)
4: an additional object that can be used to pass
state information to the callback. All implemented
encoders/decoders currently pass None for this.
The callback must return a tuple with the following info:
0: a Unicode object which will be encoded instead
of the offending character (for encoders).
A Unicode object that will be used as a replacement
for the undecodable bytes in the decoded Unicode
object (for decoders)
1: a position where the encoding/decoding process should
continue after processing the replacement string.
The patch includes several preregistered encode
error handlers schemes with the following names:
"strict", "ignore", "replace", "xmlcharrefreplace",
"escapereplace"
and decode error handlers with the names:
"strict", "ignore", "replace"
The patch includes the other change to the C API described in
Solution 4. The parameters
const Py_UNICODE *p, int size
have been replace by
PyObject *p
so that all functions get the Unicode object directly
as a PyObject * and pass this directly along to the error
callback.
For further details see the patch on SourceForge.
References
[1] Python patch #432401 "unicode encoding error callbacks"
http://sourceforge.net/tracker/?group_id=5470&atid=305470&func=deta=
il&aid=432401
[2] Previous discussions on this topics in the I18N-SIG mailing list
http://mail.python.org/pipermail/i18n-sig/2000-December/000587.h=
tml
http://mail.python.org/pipermail/i18n-sig/2001-June/001173.html
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: