[I18n-sig] Proposal: Extended error handlingforunicode.encode

Andy Robinson andy@reportlab.com
Sat, 6 Jan 2001 23:26:45 -0000

>> The codec design is supposed to cover the general case of
>> encoding/decoding arbitrary data from and to arbitrary formats.
> Where is it documented as such? I believe it is wishful thinking to
> assume they cover some general case, although I have to acknowledge
> that *your* wish is more relevant than other people's wishes.
>> Please don't try to break everything down to Unicode<->8-bit
>> codecs. The design should be able to cover conversion between
>> image formats, audio formats, compression schemes and other
>> encodings just as well as between different text formats.

> Is there any precedent that it is actually useful for 
> anything else?

I'm trying to catch up on this thread after a long absence.
I have not been able to do any i18n work this year and  
cannot give any opinions on the error handling
details, but I must comment on these paragraphs.

There was a great deal of discussion about keeping the codec
mechanism general-purpose on the python-dev list when the unicode
proposal was first put together.  This came from two directions:

(1) I argued long and hard then that i18n is not just Unicode; there
are many legacy problems where you want to be able to write
codecs to go direct from one native encoding to another without
going through Unicode.  They are never needed in the case of
perfectly encoded data, but this need is pressing if having 
to deal with and clean up large amounts of misencoded data, 
user-defined characters etc. I spent a year of my life on 
a very complex i18n project, corresponded with Ken Lunde 
and many other developers in the field, and got the same feedback
from the developers at Digital Garage in Tokyo, who deal with this
every day.  

The key requirements I had were that (a) the API should not be
limited to Unicode <--> 8-bit, and (b) you should be able to
extend codec mappings and algorithms without needing a C compiler
every time.  I can provide lots of use cases if needed but they
are hard to follow if you don't know a little Japanese.

(2) there was much interest in the Java concept of 'stackable
streams' and stream conversion tools.  The general case is
clearly a stream of bytes, and Unicode strings are one 
case of these.  Several of us also felt that with the right
little state machine in the codec package, you could do vey 
powerful things in different spheres like compression, binary 
encodings like base 64/85/whatever.  

Guido played a large part in the discussions and, I believe he
fully understood and echoed the design goal you question
at the top.

Since then, Marc-Andre has done a fantastic mount of largely 
unpaid work, but I have not been able to follow up with the 
work I wanted to do on Asian codecs.  If I had, you'd have 
plenty of use cases for keeping things general purpose.  I 
am however confident that whenever we get around to building
the right codec package (which depends a lot on when ReportLab
gets its first Asian customers), people in the feel will
see Python's i18n support is way ahead of that of Java.


Andy Robinson
(still flat out keeping a startup going and failing to do
my duties as sig moderator, sadly)