[I18n-sig] XML and codecs

M.-A. Lemburg mal@lemburg.com
Tue, 05 Jun 2001 21:46:57 +0200

"Martin v. Loewis" wrote:
> > Should be no problem since the exception will sort of freeze
> > the current state of the codec (provided it's a StreamWriter/Reader)
> > and let you use this state to take appropriate actions.
> What do you mean: "provided it's a StreamReader/Writer". What if I
> invoke the encode method found in codec lookup, and get an exception?

The encoders/decoders returned in the lookup tuple are not
supposed to store state. If you want to or need to store state,
then you should use the factory functions (StreamWriter and
-Reader) to first create an instance which can store state
and then use its .encode()/.decode() methods.

> The exception does not carry the state.

That's not what I meant. If you have created say a StreamReader
object, then this object will store the state and if its
.encode() method raises a UnicodeBreakError exception you
can use the current state stored in the object to take
some action of recovery, etc.

> Suppose you encode into JIS X
> 0201.  That has four shift states:
>     "\033(B": US_ASCII,
>     "\033(J": JISX0201_1976,
>     "\033$@": JISX0208_1978,
>     "\033$B": JISX0208_1983,
> }
> Depending on which of the escape codes you've emitted last, the
> following bytes will have different meanings.
> Now, suppose we encode a string that cannot be translated to JIS
> X0201.  The codec will raise an exception, telling us how much bytes
> it has encoded. Now, suppose we want to replace this character with
> the string "&9898;". If we are in the US_ASCII shift state, we can
> immediately encode it. If we are in a different shift state, we must
> issue the control sequence first.
> When the codec does not preserve state, it cannot correctly encode the
> entire string, since concatenating the results of encode() invocations
> might be incorrect.
> If you don't believe me, tell me how I can use your proposed interface
> to encode a Unicode into JIS X 0201 + XML escapes, with using the
> encode/decode functions only.
> > Not sure what you mean here, but the encoder and decoder
> > returned by codecs.lookup() must not maintain state. This
> > property is reserved for StreamWriters and Readers (see the
> > Unicode docs).
> You mean the sentence that says
> # The functions/methods are expected to work in a stateless mode.
> What is "expected to work"? Who expects they work in stateless mode,
> and why? And what happens if they don't?
> It also says
> # These must be functions or methods which have the same interface as
> # the encode()/decode() methods of Codec instances (see Codec
> # Interface).
> So surely, the result of codecs.lookup may be a method. If it is a
> method, it surely must be a bound method (or else, where does the self
> argument come from?) Since bound methods are allows, the encode/decode
> functions *may* preserve state: A bound method always references state
> in form of the object it is bound to.
> So I think the sentence in the documentation saying "expected to work"
> is an error.

This is per design and not a mistake.

If encoders/decoders (the first two items in the
lookup tuple) would store state, then you would have serious problems
when reusing them for different inputs. I'm not even talking about
threading problems here.

The other two entries were designed to provide statefull codec
interfaces, so your JIS codec would have to use those in order
to store shift states etc. or do more complex work on the data.

The encoder/decoder functions should only provide very basic
encoding/decoding facilities which do not require keeping
state (e.g. they might have additional keyword arguments to
parameterize them to work in different shift states).

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/