[I18n-sig] Proposal: Extended error handlingforunicode.encode

M.-A. Lemburg mal@lemburg.com
Fri, 05 Jan 2001 10:54:07 +0100

"Martin v. Loewis" wrote:
> > Sure, exceptions are not much of a problem, but how would the
> > callback tell the encoder/decoder to e.g. skip forward 2 bytes or
> > perhaps backward 10 bytes ?
> First, I'd like to point out that encoding and decoding is *not*
> symmetric with regards to error handling, so there is *no* need to
> make the interfaces appear symmetric; it is rather unfortunate that
> Python 2 gives this impression.
> The reason for the difference is that converting from some encoding to
> Unicode never fails for virtually all encodings because of missing
> characters in Unicode - Unicode is supposed to support almost
> everything, and code sets that cannot completely map into Unicode
> probably need special attention anyway (normally, by producing a
> non-reversible mapping). So the callback is not needed at all for
> decoding.
> For encoding, my claim is that error callbacks never want to skip
> forward 2 bytes. If anything, then go forward two characters - but I
> can't even imagine a scenario where that would be needed. Don't try to
> design an API that nobody will ever use.
> Walter has demonstrated how to implement the "skip the current
> character" case: by returning u"" from the callback.

The codec design is supposed to cover the general case of
encoding/decoding arbitrary data from and to arbitrary formats.

Please don't try to break everything down to Unicode<->8-bit
codecs. The design should be able to cover conversion between
image formats, audio formats, compression schemes and other
encodings just as well as between different text formats.

I agree that the case for Unicode codecs allows some simplification
to the codec API design, but restricting it to this range of
application only would cause us much trouble in the years
to come when other codec applications start to appear in the
Python universe.

Other applications do have a need to jump back and forth in
the data stream, e.g. say you want to decode a corrupt image
file or a truncated MP3 file.

> > What if the callback would have to scan the stream from the
> > beginning to find out where to continue or look ahead a few hundred
> > bytes to find the next valid encodable sequence ?
> What would be the specific encoding, and what would be the specific
> error handling algorithm that would require such a service?

See above.
> > Again, you should keep in mind that the scheme has to work
> > for all encoding/decoding work, not only conversion from and
> > to Unicode.
> Why is that? That sounds like gross overgeneralization to me.
> Specifically, do you know anybody using that framework for anything
> but Unicode conversion? If so, who is that, and what is the specific
> application?

I am planning to add compression codecs based on zlib and
possibly cryptographic codecs which can then be used together
with stackable streams to provide seemless compression and/or
encryption to application which otherwise do not provide this

> > If we were to provide a callback as optional method to
> > StreamReaders/Writers, the task could be done either statically
> > by subclassing the existing codec StreamReaders/Writers or
> > dynamically by asking the codec registry to return the StreamReader/
> > Writer classes.
> So how would the implementation of charmap_encode invoke this method?
> It currently doesn't even get hold of the codec object.

Through the extended API I proposed earlier on: the extra context
object would allow providing a callback mechanism. Alternatively,
the StreamRead/Writer classes could use their own specific
C coding functions.
> > Another option would be 'copy' which tries to simply copy input
> > to output in case this is reasonably possible given the encoding
> > (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
> > is in case no mapping is defined). An option 'raise' could also
> > be valuable in conjunction with an exception context object to have
> > the codec raise customized exceptions. Provided the context
> > object points to another encoder/decoder, an option 'fallback'
> > could be used to tell the codec to pass the failing input data
> > to the alternate encoder/decoder in order to have it converted.
> > Etc. etc.
> >
> > There are many things one could do with the error string.
> I guess my question is different: Do you consider the error string to
> be of a well-defined finite enumerated set of possible values, or is
> it your view that it is up to the codec what error strings to accept?

Exactly. There is a set of error strings which the codec
must accept, but it is free to also implement other schemes
as well.

> If so, why would they have to be strings?

I chose strings to simplify the implementation. Back when the
design was discussed, we figured that the codec should take
care of the error handling. Python's codec design is one of
the few which does allow setting error handling behaviour --
other implementations tend to simply raise an exception and leave
the user in the dark.

It's too late to *change* the design. We can only extend it.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/