[I18n-sig] Extended error handling for codecs

"Walter Dörwald" walter@livinglogic.de
Fri, 26 Jan 2001 19:55:49 +0100


On 09.01.01 at 09:27 M.-A. Lemburg wrote:

[ I think this was supposed to go to the list ]

> "Walter Dörwald" wrote:
> > 
> > On 04.01.01 at 11:00 M.-A. Lemburg wrote:
> > 
> > > [...]
> > >
> > > How would such a scheme allow passing back control information
> > > such as: continue with the next character in the stream
> > 
> > def ignore(encoding, string, position):
> >         return u""
> > 
> > u"xxx".encode(encoding, 'callback', ignore)
> > 
> > > or break with an exception ?
> > 
> > def raiseAnException(encoding, string, position):
> >         raise FancyException("can't encode character %r at position %d
> in string %r with encoding %s"
> >                 % (string[position], position, string, encoding))
> > 
> > u"xxx".encode(encoding, 'callback', raiseAnException)
> 
> Ok. I still think that we need to pass more information from
> and to the callback. How about this scheme (the internal error
> handlers work using a similar scheme):
> 
> def callback(encoding, inputdata, inputposition, 
>              outputdata, outputposition, errors):
>     ...
>     return (inputdata, inputposition, outputdata, outputposition)
> 
> This would give the callback enough information to do just
> about everything with the data in question. After having called
> the callback(), the encoder or decoder would then reinitialize
> itself using the returned data and positions.

Does that mean that the callback can feed replacement input data back
to the encoder? How does the callback tell the encoder to switch
back to the original input after the replacement input is exhausted?
Or does the callback have to construct a complete replacemant input
string? As I see it, the callback can't modify the outputdata, because
the output data is already encoded, and the callback knows nothing
about the encoding.

How could a "xml-escape" be implemented with that? 

> > > > Looking again at the TR6 mechanism: Even if the error callback was
> > > > called, and even if it had to return bytes instead of unicodes, it
> > > > could still operate stateless: it would just output SQU as often as
> > > > required. I believe that most stateful encodings have a "escape to
> > > > known state" mechanism.
> > >
> > > Which is what I'm talking about all along: the codecs know best
> > > what to do, so better extend them than try to fiddle in some
> > > information using a callback.
> > 
> > The callback is only used in the situation when the codec does
> > not know what to do, i.e. when it encounters an unencodable
> > character. The callback is an *error handler* and not a
> > "I don't know how to implement my own encoding algorithm,
> > please help me"-handler. >;->
> 
> Let's put it this way: the error handler should have at least
> the same possibilities as the current builtin error handlers
> have.

There is a big difference: the generic callback should be able
to work without knowing the encoding. All current builtin error
handlers know the encoding because there's a specific error handler
for every encoding.

> If a codec needs more information to process an error
> condition, e.g. in case it holds internal state (encoder and
> decoder functions may not use external state per design),
> then it's the codec which has to be extended -- the error handler
> won't be able to help.

But the codec knows everything about its own internal state, what
it does not know is what kind of error handling is wanted. This
additional information can't be provided by the codec, but is 
provided by the user, who does't know anything about the encoding.
(e.g. if it's a list of acceptable encodings from an HTTP Accept-Charset
header)

> Would this be a good compromise ?
>
> > > I don't object to adding callback support to the codec's
> > > error handlers, but we'll need a new set of APIs to allow
> > > this.
> > 
> > I could live with a
> >         u"xxx".encode(encoding, 'callback', handler)
> > on the Python side, but what does this mean for the C API?
> 
> Pretty much the same thing: we'll be adding PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which have the additional
> context object as PyObject*.

OK, but what are those objects supposed to know and do?

> > > > So I still think your objection is theoretical, whereas the problem
> > > > that Walter is trying to solve is real.
> > >
> > > I did propose a solution which would satisfy your needs: simply
> > > add a new error treatment 'xml-escape' to the builtin codecs
> > > which then does the needed XML escaping. XML is general enough
> > > to warrant such a step and the required changes are minor.
> > >
> > > Another candidate for a new error treatment would be 'unicode-escape'
> > > which then replaces the character in question with '\uXXXX'.
> > >
> > > For the general case, I'd rather add new PyUnicode_EncodeEx()
> > > and PyUnicode_DecodeEx() APIs which then take a Python
> > > context object as extra argument.
> > 
> > What should this extra argument be for the decoder?
> 
> A PyObject* just like for the encoder. The codec design is kept
> symmetric to simplify support for stackable streams and also
> to simplify the APIs (there aren't all that many API signatures
> to remember).

But the APIs are not really symmetric: There is no easy inverse of
    u"xxx".encoding(encoding, "callback", xmlReplacementHandler)
that automatically generates characters from XML character entities.
How would the decoder know, when a character entity is encountered?

Encoding errors simply mean that the encoding is not capable of 
handling the data to be encoded. The error handling then has to 
provide a way of making the unencodable part of the data encodable. 
Ideally this should be independant from the encoding.

Decoding errors mean something completely different: The encoded
data does not conform to the format it claims to be in. Fixing
this kind of error requires an intimate knowledge of the encoding
and therefore can not be encoding independent.

> > > The error treatment string
> > > would then define how to use this context object, e.g. 'callback'
> > > could be made to apply processing similar to what Walter
> > > suggested.
> > 
> > 'callback' seem too generic to me. May there will be other callbacks
> > in the future that are used for different stuff. This is the
> > "give me a replacement or die" error handler.
> 
> The error handling string should provide enough room for
> extensions... what other short string would you recommend ?
> 'handler' or 'callcontext' ?

In theory "replace" would be the correct name, as the error handler
returns a replacement string to be encoded instead of the offending
character. but we could use "replacementhandler" or something like
that.

> [...]

Bye,
   Walter Dörwald

-- 
Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de