[Python-Dev] PEP 293, Codec Error Handling Callbacks

06 Aug 2002 10:25:34 +0200

Oren Tirosh <oren-py-d@hishome.net> writes:

> > If you look at the patch, you see that it precisely does what you
> > propose to do: add a callback to the charmap codec:
> > 
> > - it deletes charmap_decoding_error
> > - it adds state to feed the callback function
> > - it replaces the old call to charmap_decoding_error by
> 
> But it's NOT an error. It's new encoding functionality.  

What is not an error? The handling? Certainly: the error and the error
handler are different things; error handlers are not errors. "ignore"
and "replace" are not errors, either, they are also new encoding
functionality. That is the very nature of handlers: they add
functionality.

> The real problem was some missing functionality in codecs. Here are two 
> approaches to solve the problem:
> 
> 1. Add the missing functionality.

That is not feasible, since you want that functionality also for
codecs you haven't heard of.

> 2. Keep the old, limited functionality, let it fail, catch the error,
> re-use an argument originally intended for an error handling strategy to 
> shoehorn a callback that can implement the missing functionality, add a new 
> name-based registry to overcome the fact that the argument must be a string.

That is possible, but inefficient. It is also the approach that people
use today, and the reason for this PEP to exist. The current
UnicodeError does not report any detail on the state that the codec
was in.

> Since this approach is conceptually stuck on treating it as an error it 
> actually creates and discards a new exception object for each character 
> converted via this path.

It's worth: If you find that the entire string cannot be encoded, you
have typically two choices:
- you perform a binary search. That may cause log n exceptions.
- you encode every character on its own. That reduce the number of
  exceptions to the number of unencodable characters, but it will also
  mean that the encoding is wrong for some encodings: You will always
  get the shift-in/shift-out sequences that your encoding may specify.

On decoding, this is worse: feeding a byte at a time may fail
altogether if you happen to break a multibyte character - when feeding
the entire string happily consumes long sequences of characters, and
only runs into a single problem byte.

Regards,
Martin