[Python-Dev] Some thoughts on the codecs...
Guido van Rossum
guido@CNRI.Reston.VA.US
Mon, 15 Nov 1999 16:37:28 -0500
> Andy Robinson wrote:
> >
> > Some thoughts on the codecs...
> >
> > 1. Stream interface
> > At the moment a codec has dump and load methods which
> > read a (slice of a) stream into a string in memory and
> > vice versa. As the proposal notes, this could lead to
> > errors if you take a slice out of a stream. This is
> > not just due to character truncation; some Asian
> > encodings are modal and have shift-in and shift-out
> > sequences as they move from Western single-byte
> > characters to double-byte ones. It also seems a bit
> > pointless to me as the source (or target) is still a
> > Unicode string in memory.
> >
> > This is a real problem - a filter to convert big files
> > between two encodings should be possible without
> > knowledge of the particular encoding, as should one on
> > the input/output of some server. We can still give a
> > default implementation for single-byte encodings.
> >
> > What's a good API for real stream conversion? just
> > Codec.encodeStream(infile, outfile) ? or is it more
> > useful to feed the codec with data a chunk at a time?
M.-A. Lemburg responds:
> The idea was to use Unicode as intermediate for all
> encoding conversions.
>
> What you invision here are stream recoders. The can
> easily be implemented as an useful addition to the Codec
> subclasses, but I don't think that these have to go
> into the core.
What I wanted was a codec API that acts somewhat like a buffered file;
the buffer makes it possible to efficient handle shift states. This
is not exactly what Andy shows, but it's not what Marc's current spec
has either.
I had thought something more like what Java does: an output stream
codec's constructor takes a writable file object and the object
returned by the constructor has a write() method, a flush() method and
a close() method. It acts like a buffering interface to the
underlying file; this allows it to generate the minimal number of
shift sequeuces. Similar for input stream codecs.
Andy's file translation example could then be written as follows:
# assuming variables input_file, input_encoding, output_file,
# output_encoding, and constant BUFFER_SIZE
f = open(input_file, "rb")
f1 = unicodec.codecs[input_encoding].stream_reader(f)
g = open(output_file, "wb")
g1 = unicodec.codecs[output_encoding].stream_writer(f)
while 1:
buffer = f1.read(BUFFER_SIZE)
if not buffer:
break
f2.write(buffer)
f2.close()
f1.close()
Note that we could possibly make these the only API that a codec needs
to provide; the string object <--> unicode object conversions can be
done using this and the cStringIO module. (On the other hand it seems
a common case that would be quite useful.)
> > 2. Data driven codecs
> > I really like codecs being objects, and believe we
> > could build support for a lot more encodings, a lot
> > sooner than is otherwise possible, by making them data
> > driven rather making each one compiled C code with
> > static mapping tables. What do people think about the
> > approach below?
> >
> > First of all, the ISO8859-1 series are straight
> > mappings to Unicode code points. So one Python script
> > could parse these files and build the mapping table,
> > and a very small data file could hold these encodings.
> > A compiled helper function analogous to
> > string.translate() could deal with most of them.
>
> The problem with these large tables is that currently
> Python modules are not shared among processes since
> every process builds its own table.
>
> Static C data has the advantage of being shareable at
> the OS level.
Don't worry about it. 128K is too small to care, I think...
> You can of course implement Python based lookup tables,
> but these should be too large...
>
> > Secondly, the double-byte ones involve a mixture of
> > algorithms and data. The worst cases I know are modal
> > encodings which need a single-byte lookup table, a
> > double-byte lookup table, and have some very simple
> > rules about escape sequences in between them. A
> > simple state machine could still handle these (and the
> > single-byte mappings above become extra-simple special
> > cases); I could imagine feeding it a totally
> > data-driven set of rules.
> >
> > Third, we can massively compress the mapping tables
> > using a notation which just lists contiguous ranges;
> > and very often there are relationships between
> > encodings. For example, "cpXYZ is just like cpXYY but
> > with an extra 'smiley' at 0XFE32". In these cases, a
> > script can build a family of related codecs in an
> > auditable manner.
>
> These are all great ideas, but I think they unnecessarily
> complicate the proposal.
Agreed, let's leave the *implementation* of codecs out of the current
efforts.
However I want to make sure that the *interface* to codecs is defined
right, because changing it will be expensive. (This is Linus
Torvald's philosophy on drivers -- he doesn't care about bugs in
drivers, as they will get fixed; however he greatly cares about
defining the driver APIs correctly.)
> > 3. What encodings to distribute?
> > The only clean answers to this are 'almost none', or
> > 'everything that Unicode 3.0 has a mapping for'. The
> > latter is going to add some weight to the
> > distribution. What are people's feelings? Do we ship
> > any at all apart from the Unicode ones? Should new
> > encodings be downloadable from www.python.org? Should
> > there be an optional package outside the main
> > distribution?
>
> Since Codecs can be registered at runtime, there is quite
> some potential there for extension writers coding their
> own fast codecs. E.g. one could use mxTextTools as codec
> engine working at C speeds.
(Do you think you'll be able to extort some money from HP for these? :-)
> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
>
> 'utf-8': 8-bit variable length encoding
> 'utf-16': 16-bit variable length encoding (litte/big endian)
> 'utf-16-le': utf-16 but explicitly little endian
> 'utf-16-be': utf-16 but explicitly big endian
> 'ascii': 7-bit ASCII codepage
> 'latin-1': Latin-1 codepage
> 'html-entities': Latin-1 + HTML entities;
> see htmlentitydefs.py from the standard Pythin Lib
> 'jis' (a popular version XXX):
> Japanese character encoding
> 'unicode-escape': See Unicode Constructors for a definition
> 'native': Dump of the Internal Format used by Python
>
> Perhaps not even 'html-entities' (even though it would make
> a cool replacement for cgi.escape()) and maybe we should
> also place the JIS encoding into a separate Unicode package.
I'd drop html-entities, it seems too cutesie. (And who uses these
anyway, outside browsers?)
For JIS (shift-JIS?) I hope that Andy can help us with some pointers
and validation.
And unicode-escape: now that you mention it, this is a section of
the proposal that I don't understand. I quote it here:
| Python should provide a built-in constructor for Unicode strings which
| is available through __builtins__:
|
| u = unicode(<encoded Python string>[,<encoding name>=<default encoding>])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
What do you mean by this notation? Since encoding names are not
always legal Python identifiers (most contain hyphens), I don't
understand what you really meant here. Do you mean to say that it has
to be a keyword argument? I would disagree; and then I would have
expected the notation [,encoding=<default encoding>].
| With the 'unicode-escape' encoding being defined as:
|
| u = u'<unicode-escape encoded Python string>'
|
| · for single characters (and this includes all \XXX sequences except \uXXXX),
| take the ordinal and interpret it as Unicode ordinal;
|
| · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX
| instead, e.g. \u03C0 to represent the character Pi.
I've looked at this several times and I don't see the difference
between the two bullets. (Ironically, you are using a non-ASCII
character here that doesn't always display, depending on where I look
at your mail :-).
Can you give some examples?
Is u'\u0020' different from u'\x20' (a space)?
Does '\u0020' (no u prefix) have a meaning?
Also, I remember reading Tim Peters who suggested that a "raw unicode"
notation (ur"...") might be necessary, to encode regular expressions.
I tend to agree.
While I'm on the topic, I don't see in your proposal a description of
the source file character encoding. Currently, this is undefined, and
in fact can be (ab)used to enter non-ASCII in string literals. For
example, a programmer named François might write a file containing
this statement:
print "Written by François." # (There's a cedilla in there!)
(He assumes his source character encoding is Latin-1, and he doesn't
want to have to type \347 when he can type a cedilla on his keyboard.)
If his source file (or .pyc file!) is executed by a Japanese user,
this will probably print some garbage.
Using the new Unicode strings, François could change his program as
follows:
print unicode("Written by François.", "latin-1")
Assuming that François sets his sys.stdout to use Latin-1, while the
Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
But when the Japanese user views François' source file, he will again
see garbage. If he uses a generic tool to translate latin-1 files to
shift-JIS (assuming shift-JIS has a cedilla character) the program
will no longer work correctly -- the string "latin-1" has to be
changed to "shift-jis".
What should we do about this? The safest and most radical solution is
to disallow non-ASCII source characters; François will then have to
type
print u"Written by Fran\u00E7ois."
but, knowing François, he probably won't like this solution very much
(since he didn't like the \347 version either).
--Guido van Rossum (home page: http://www.python.org/~guido/)