[Python-Dev] Some thoughts on the codecs...

M.-A. Lemburg mal@lemburg.com
Mon, 15 Nov 1999 20:20:55 +0100


Andy Robinson wrote:
> 
> Some thoughts on the codecs...
> 
> 1. Stream interface
> At the moment a codec has dump and load methods which
> read a (slice of a) stream into a string in memory and
> vice versa.  As the proposal notes, this could lead to
> errors if you take a slice out of a stream.   This is
> not just due to character truncation; some Asian
> encodings are modal and have shift-in and shift-out
> sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit
> pointless to me as the source (or target) is still a
> Unicode string in memory.
> 
> This is a real problem - a filter to convert big files
> between two encodings should be possible without
> knowledge of the particular encoding, as should one on
> the input/output of some server.  We can still give a
> default implementation for single-byte encodings.
> 
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more
> useful to feed the codec with data a chunk at a time?

The idea was to use Unicode as intermediate for all
encoding conversions. 

What you invision here are stream recoders. The can
easily be implemented as an useful addition to the Codec
subclasses, but I don't think that these have to go
into the core.
 
> 2. Data driven codecs
> I really like codecs being objects, and believe we
> could build support for a lot more encodings, a lot
> sooner than is otherwise possible, by making them data
> driven rather making each one compiled C code with
> static mapping tables.  What do people think about the
> approach below?
> 
> First of all, the ISO8859-1 series are straight
> mappings to Unicode code points.  So one Python script
> could parse these files and build the mapping table,
> and a very small data file could hold these encodings.
>   A compiled helper function analogous to
> string.translate() could deal with most of them.

The problem with these large tables is that currently
Python modules are not shared among processes since
every process builds its own table.

Static C data has the advantage of being shareable at
the OS level.

You can of course implement Python based lookup tables,
but these should be too large...
 
> Secondly, the double-byte ones involve a mixture of
> algorithms and data.  The worst cases I know are modal
> encodings which need a single-byte lookup table, a
> double-byte lookup table, and have some very simple
> rules about escape sequences in between them.  A
> simple state machine could still handle these (and the
> single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally
> data-driven set of rules.
> 
> Third, we can massively compress the mapping tables
> using a notation which just lists contiguous ranges;
> and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but
> with an extra 'smiley' at 0XFE32".  In these cases, a
> script can build a family of related codecs in an
> auditable manner.

These are all great ideas, but I think they unnecessarily
complicate the proposal.
 
> 3. What encodings to distribute?
> The only clean answers to this are 'almost none', or
> 'everything that Unicode 3.0 has a mapping for'.  The
> latter is going to add some weight to the
> distribution.  What are people's feelings?  Do we ship
> any at all apart from the Unicode ones?  Should new
> encodings be downloadable from www.python.org?  Should
> there be an optional package outside the main
> distribution?

Since Codecs can be registered at runtime, there is quite
some potential there for extension writers coding their
own fast codecs. E.g. one could use mxTextTools as codec
engine working at C speeds.

I would propose to only add some very basic encodings to
the standard distribution, e.g. the ones mentioned under
Standard Codecs in the proposal:

  'utf-8':		8-bit variable length encoding
  'utf-16':		16-bit variable length encoding (litte/big endian)
  'utf-16-le':		utf-16 but explicitly little endian
  'utf-16-be':		utf-16 but explicitly big endian
  'ascii':		7-bit ASCII codepage
  'latin-1':		Latin-1 codepage
  'html-entities':	Latin-1 + HTML entities;
			see htmlentitydefs.py from the standard Pythin Lib
  'jis' (a popular version XXX):
			Japanese character encoding
  'unicode-escape':	See Unicode Constructors for a definition
  'native':		Dump of the Internal Format used by Python

Perhaps not even 'html-entities' (even though it would make
a cool replacement for cgi.escape()) and maybe we should
also place the JIS encoding into a separate Unicode package.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/