[Python-Dev] Some thoughts on the codecs...

Mon, 15 Nov 1999 16:37:28 -0500

> Andy Robinson wrote:
> > 
> > Some thoughts on the codecs...
> > 
> > 1. Stream interface
> > At the moment a codec has dump and load methods which
> > read a (slice of a) stream into a string in memory and
> > vice versa.  As the proposal notes, this could lead to
> > errors if you take a slice out of a stream.   This is
> > not just due to character truncation; some Asian
> > encodings are modal and have shift-in and shift-out
> > sequences as they move from Western single-byte
> > characters to double-byte ones.   It also seems a bit
> > pointless to me as the source (or target) is still a
> > Unicode string in memory.
> > 
> > This is a real problem - a filter to convert big files
> > between two encodings should be possible without
> > knowledge of the particular encoding, as should one on
> > the input/output of some server.  We can still give a
> > default implementation for single-byte encodings.
> > 
> > What's a good API for real stream conversion?   just
> > Codec.encodeStream(infile, outfile)  ?  or is it more
> > useful to feed the codec with data a chunk at a time?

M.-A. Lemburg responds:

> The idea was to use Unicode as intermediate for all
> encoding conversions. 
> 
> What you invision here are stream recoders. The can
> easily be implemented as an useful addition to the Codec
> subclasses, but I don't think that these have to go
> into the core.

What I wanted was a codec API that acts somewhat like a buffered file;
the buffer makes it possible to efficient handle shift states.  This
is not exactly what Andy shows, but it's not what Marc's current spec
has either.

I had thought something more like what Java does: an output stream
codec's constructor takes a writable file object and the object
returned by the constructor has a write() method, a flush() method and
a close() method.  It acts like a buffering interface to the
underlying file; this allows it to generate the minimal number of
shift sequeuces.  Similar for input stream codecs.

Andy's file translation example could then be written as follows:

# assuming variables input_file, input_encoding, output_file,
# output_encoding, and constant BUFFER_SIZE

f = open(input_file, "rb")
f1 = unicodec.codecs[input_encoding].stream_reader(f)
g = open(output_file, "wb")
g1 = unicodec.codecs[output_encoding].stream_writer(f)

while 1:
      buffer = f1.read(BUFFER_SIZE)
      if not buffer:
	 break
      f2.write(buffer)

f2.close()
f1.close()

Note that we could possibly make these the only API that a codec needs
to provide; the string object <--> unicode object conversions can be
done using this and the cStringIO module.  (On the other hand it seems
a common case that would be quite useful.)

> > 2. Data driven codecs
> > I really like codecs being objects, and believe we
> > could build support for a lot more encodings, a lot
> > sooner than is otherwise possible, by making them data
> > driven rather making each one compiled C code with
> > static mapping tables.  What do people think about the
> > approach below?
> > 
> > First of all, the ISO8859-1 series are straight
> > mappings to Unicode code points.  So one Python script
> > could parse these files and build the mapping table,
> > and a very small data file could hold these encodings.
> >   A compiled helper function analogous to
> > string.translate() could deal with most of them.
> 
> The problem with these large tables is that currently
> Python modules are not shared among processes since
> every process builds its own table.
> 
> Static C data has the advantage of being shareable at
> the OS level.

Don't worry about it.  128K is too small to care, I think...

> You can of course implement Python based lookup tables,
> but these should be too large...
>  
> > Secondly, the double-byte ones involve a mixture of
> > algorithms and data.  The worst cases I know are modal
> > encodings which need a single-byte lookup table, a
> > double-byte lookup table, and have some very simple
> > rules about escape sequences in between them.  A
> > simple state machine could still handle these (and the
> > single-byte mappings above become extra-simple special
> > cases); I could imagine feeding it a totally
> > data-driven set of rules.
> > 
> > Third, we can massively compress the mapping tables
> > using a notation which just lists contiguous ranges;
> > and very often there are relationships between
> > encodings.  For example, "cpXYZ is just like cpXYY but
> > with an extra 'smiley' at 0XFE32".  In these cases, a
> > script can build a family of related codecs in an
> > auditable manner.
> 
> These are all great ideas, but I think they unnecessarily
> complicate the proposal.

Agreed, let's leave the *implementation* of codecs out of the current
efforts.

However I want to make sure that the *interface* to codecs is defined
right, because changing it will be expensive.  (This is Linus
Torvald's philosophy on drivers -- he doesn't care about bugs in
drivers, as they will get fixed; however he greatly cares about
defining the driver APIs correctly.)

> > 3. What encodings to distribute?
> > The only clean answers to this are 'almost none', or
> > 'everything that Unicode 3.0 has a mapping for'.  The
> > latter is going to add some weight to the
> > distribution.  What are people's feelings?  Do we ship
> > any at all apart from the Unicode ones?  Should new
> > encodings be downloadable from www.python.org?  Should
> > there be an optional package outside the main
> > distribution?
> 
> Since Codecs can be registered at runtime, there is quite
> some potential there for extension writers coding their
> own fast codecs. E.g. one could use mxTextTools as codec
> engine working at C speeds.

(Do you think you'll be able to extort some money from HP for these? :-)

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python
> 
> Perhaps not even 'html-entities' (even though it would make
> a cool replacement for cgi.escape()) and maybe we should
> also place the JIS encoding into a separate Unicode package.

I'd drop html-entities, it seems too cutesie.  (And who uses these
anyway, outside browsers?)

For JIS (shift-JIS?) I hope that Andy can help us with some pointers
and validation.

And unicode-escape: now that you mention it, this is a section of
the proposal that I don't understand.  I quote it here:

| Python should provide a built-in constructor for Unicode strings which
| is available through __builtins__:
| 
|   u = unicode(<encoded Python string>[,<encoding name>=<default encoding>])
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What do you mean by this notation?  Since encoding names are not
always legal Python identifiers (most contain hyphens), I don't
understand what you really meant here.  Do you mean to say that it has
to be a keyword argument?  I would disagree; and then I would have
expected the notation [,encoding=<default encoding>].

| With the 'unicode-escape' encoding being defined as:
| 
|   u = u'<unicode-escape encoded Python string>'
| 
| · for single characters (and this includes all \XXX sequences except \uXXXX),
|   take the ordinal and interpret it as Unicode ordinal;
| 
| · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
|   instead, e.g. \u03C0 to represent the character Pi.

I've looked at this several times and I don't see the difference
between the two bullets.  (Ironically, you are using a non-ASCII
character here that doesn't always display, depending on where I look
at your mail :-).

Can you give some examples?

Is u'\u0020' different from u'\x20' (a space)?

Does '\u0020' (no u prefix) have a meaning?

Also, I remember reading Tim Peters who suggested that a "raw unicode"
notation (ur"...") might be necessary, to encode regular expressions.
I tend to agree.

While I'm on the topic, I don't see in your proposal a description of
the source file character encoding.  Currently, this is undefined, and
in fact can be (ab)used to enter non-ASCII in string literals.  For
example, a programmer named François might write a file containing
this statement:

  print "Written by François." # (There's a cedilla in there!)

(He assumes his source character encoding is Latin-1, and he doesn't
want to have to type \347 when he can type a cedilla on his keyboard.)

If his source file (or .pyc file!)  is executed by a Japanese user,
this will probably print some garbage.

Using the new Unicode strings, François could change his program as
follows:

  print unicode("Written by François.", "latin-1")

Assuming that François sets his sys.stdout to use Latin-1, while the
Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).

But when the Japanese user views François' source file, he will again
see garbage.  If he uses a generic tool to translate latin-1 files to
shift-JIS (assuming shift-JIS has a cedilla character) the program
will no longer work correctly -- the string "latin-1" has to be
changed to "shift-jis".

What should we do about this?  The safest and most radical solution is
to disallow non-ASCII source characters; François will then have to
type

  print u"Written by Fran\u00E7ois."

but, knowing François, he probably won't like this solution very much
(since he didn't like the \347 version either).

--Guido van Rossum (home page: http://www.python.org/~guido/)