[I18n-sig] Proposal: Extended error handlingforunicode.encode

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sun, 7 Jan 2001 11:09:53 +0100


[need for codecs to go direct from one native encoding to another]
> I spent a year of my life on a very complex i18n project,
> corresponded with Ken Lunde and many other developers in the field,
> and got the same feedback from the developers at Digital Garage in
> Tokyo, who deal with this every day.

I then have to accept that this really happens in life, although I
surely hope that the cases where it is necessary to have such cases
become more and more rare.

Can you elaborate a bit what the problem was in this complex project?
I.e. which where the encodings A and B that you needed direct
conversion for? Why couldn't you go through Unicode? If the reason was
that you could not have "correctly" recoded a certain subset of the
characters, then which characters would have suffered?

> The key requirements I had were that (a) the API should not be
> limited to Unicode <--> 8-bit, and 

I believe that requirement is not completely answered. If you want to
get from A to B, and both a and b are byte-oriented encodings, then
the API offers

   b = a.encode("AtoB")

First, you need a codec name that describes both source and target
encoding; for the Unicode codecs, you only need one encoding in the
codec name.

However, that API does not work: The encode method of a byte string
assumes that the string is in the system encoding. It first tries to
decode the string into a Unicode object, then takes the codec name as
one going from Unicode to the target.

So instead, you have to write

  enc,dec,_,_ = codecs.lookup("AtoB")
  b,_ = enc(a)

That assumes that you first had registered your codec:

  import AtoB,codecs
  codecs.register(AtoB.lookup)

In this case, it would be easier *not* to use the framework:

  import AtoB
  b = AtoB.encode(a)

> (b) you should be able to extend codec mappings and algorithms
> without needing a C compiler every time.

I don't know what you mean by "extend codec mappings". If you want to
register codecs written in Python and use it from C, that works very
well.

If you want to enhance an existing codec to support additional
characters, or to partially replace the output of an existing codec -
well, that is surely not available, and the matter of the current
debate: It is currently not possible to enhance an existing codec so
that it would produce &#4567; if U+4567 is not supported in the target
encoding.

> I can provide lots of use cases if needed but they are hard to
> follow if you don't know a little Japanese.

Please assume I know a little Japanese, and present a single use
case. Since that would be mainly to satisfy my curiosity: don't if
that would be a longer essay.

> (2) there was much interest in the Java concept of 'stackable
> streams' and stream conversion tools.  The general case is
> clearly a stream of bytes, and Unicode strings are one 
> case of these.  Several of us also felt that with the right
> little state machine in the codec package, you could do vey 
> powerful things in different spheres like compression, binary 
> encodings like base 64/85/whatever.  
> 
> Guido played a large part in the discussions and, I believe he
> fully understood and echoed the design goal you question
> at the top.

Indeed, that's what I question. Stackable things always look like a
good idea on paper, so people can be easily talked into approving
them. I'm not quite clear why the file API doesn't already provide
stackable streams, in fact, gzip.GzipFile is a demonstration that this
is really possible.

The question is whether anybody currently *has* written codecs that
don't deal with strings, yet use the codec interfaces. My claim is
that you never want to 'stack' more than one stream on top of
another. People are then happy with whatever stacking API the codec
offers.

My concern is not so much the existance of the API, but that it is
taken as a rationale for preventing improvements of the usability of
the Unicode library.

Regards,
Martin