[Python-Dev] Some thoughts on the codecs...

Mon, 15 Nov 1999 07:30:45 -0800 (PST)

Some thoughts on the codecs...

1. Stream interface
At the moment a codec has dump and load methods which
read a (slice of a) stream into a string in memory and
vice versa.  As the proposal notes, this could lead to
errors if you take a slice out of a stream.   This is
not just due to character truncation; some Asian
encodings are modal and have shift-in and shift-out
sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit
pointless to me as the source (or target) is still a
Unicode string in memory.

This is a real problem - a filter to convert big files
between two encodings should be possible without
knowledge of the particular encoding, as should one on
the input/output of some server.  We can still give a
default implementation for single-byte encodings.

What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more
useful to feed the codec with data a chunk at a time?

2. Data driven codecs
I really like codecs being objects, and believe we
could build support for a lot more encodings, a lot
sooner than is otherwise possible, by making them data
driven rather making each one compiled C code with
static mapping tables.  What do people think about the
approach below?

First of all, the ISO8859-1 series are straight
mappings to Unicode code points.  So one Python script
could parse these files and build the mapping table,
and a very small data file could hold these encodings.
  A compiled helper function analogous to
string.translate() could deal with most of them.

Secondly, the double-byte ones involve a mixture of
algorithms and data.  The worst cases I know are modal
encodings which need a single-byte lookup table, a
double-byte lookup table, and have some very simple
rules about escape sequences in between them.  A
simple state machine could still handle these (and the
single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally
data-driven set of rules.  

Third, we can massively compress the mapping tables
using a notation which just lists contiguous ranges;
and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but
with an extra 'smiley' at 0XFE32".  In these cases, a
script can build a family of related codecs in an
auditable manner. 

3. What encodings to distribute?
The only clean answers to this are 'almost none', or
'everything that Unicode 3.0 has a mapping for'.  The
latter is going to add some weight to the
distribution.  What are people's feelings?  Do we ship
any at all apart from the Unicode ones?  Should new
encodings be downloadable from www.python.org?  Should
there be an optional package outside the main
distribution?

Thanks,

Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com