[Python-Dev] PEP 293, Codec Error Handling Callbacks

Walter Dörwald walter@livinglogic.de
Mon, 12 Aug 2002 12:47:33 +0200


I'm back from vacation. Comments on the thread and a list
of open issues are below.

Guido van Rossum wrote:
 > M.-A. Lemburg wrote:
 > > Walter has written a pretty good test suite for the patch
 > > and I have a good feeling about it. I'd like Walter to check
 > > it into CVS and then see whether the alpha tests bring up any
 > > quirks. The patch only touches the codecs and adds some new
 > > exceptions. There are no other changes involved.
 > >
 > > I think that together with PEP 263 (source code encoding) this
 > > is a great step forward in Python's i18n capabilities.
 > >
 > > BTW, the test script contains some examples of how to put the
 > > error callbacks to use:
 > >
 > > 
http://sourceforge.net/tracker/download.php?group_id=5470&atid=305470&file_id=27815&aid=432401
 >
 > Sounds like a plan then.

Does this mean we can check in the patch?

Documentation is still missing and encoding specific
decoding tests should be added to the test script.

Has anybody except me and Marc-André tried the patch?
On anything other than Linux/Intel? With UCS2 and UCS4?

Martin v. Loewis wrote:
 > If you look at the large blocks of new code, you find that it is in
 >
 > - charmap_encoding_error, which insists on implementing known error
 >   handling algorithms inline,

This is done for performance reasons.

 > - the default error handlers, of which atleast
 >   PyCodec_XMLCharRefReplaceErrors should be pure-Python

The PyCodec_XMLCharRefReplaceErrors functionality is
independent of the rest, so moving this to Python
won't reduce complexity that much. And it will
slow down "xmlcharrefreplace" handling for those
codecs that don't implement it inline.

 > - PyCodec_BackslashReplaceErrors, likewise,
 >
 > - the UnicodeError exception methods (which could be omitted, IMO).

Those methods were implemented so that we can easily
move to new style exceptions. The exception attributes
can then be members of the C struct and the accessor functions
can be simple macros.

I guess some of the methods could be removed by moving
duplicate ones to the base class UnicodeError, but
this would break backwards compatibility.

Oren Tirosh wrote:
 > Some of my reservations about PEP 293:
 >
 > It overloads the meaning of the error handling argument in an unintuitive
 > way.  It gets to the point where it's much more than just error 
handling -
 > it's actually extending the functionality of the codec.
 >
 > Why implement yet another name-based registry?  There must be a 
simpler way
 > to do it.

The registry is name-based because this is required by the current C API.
Passing the error handler directly as a function object would be
simpler, but this can't be done, as it would require vast changes
to the C API (an old version of the patch did that.) And this way
we gain the benefit of implementing well-known error hanlding
names inline.

It is "yet another" registry exactly because encoding and error handling
are completely orthogonal (at least for encoding). If you add a
new error handler all codecs can use it (as long as they are aware
of the new error handling way) and if you define a new codec it will
work with all existing error handlers.

 > Generating an exception for each character that isn't handled by simple
 > lookup probably adds quite a lot of overhead.

1. All encoders try to collect runs of unencodable characters to
minimize the number of calls to the callback.

2. The PEP explicitely states that the codec is allowed to
reuse the exception object. All codecs do this, so the
exception object will only be created once (at most;
when no error occurs, no exception object will be created)
The exception object is just a quick way to pass information
between the codec and the error handler and it could become
even faster as soon as we get new style exceptions.

 > What are the use cases?  Maybe a simple extension to charmap would be 
enough
 > for all the practical cases?

Not all codecs are charmap based.



Open issues:

1. For each error handler two Python function objects are created:
One in the registry and a different one in the codecs module. This
means that e.g.
codecs.lookup_error("replace") != codecs.replace_errors

We can fix that by making the name ob the Python function object
globally visible or by changing the codecs init function to do a lookup 
and use the result or simply by removing codecs.replace_errors

2. Currently charmap encoding uses a safe way for reallocation
string storage, which tests available space on each output. This
slows charmap encoding down a bit. This should probably be changed
back to the old way: Test available space only for output strings
longer than one character.

3. Error reporting logic in the exception attribute setters/getters
may be non-standard. What is the standard way to report errors for
C functions that don't return object pointers?
==0 for error and !=0 for success
or
==0 for success and !=0 for error
PyArg_ParseTuple returns true an success, PyObject_SetAttr returns true
on failure, which one is the exception and which one the rule?

4. Assigning to an attribute of an exception object does not
change the appropriate entry in the args attribute. Is this
worth changing?

5. UTF-7 decoding does not yet take full advantage of the machinery:
When an unterminated shift sequence is encountered (e.g. "+xxx")
the faulty byte sequence has already been emitted.

Bye,
    Walter Dörwald