[Python-Dev] Codecs and StreamCodecs

Fredrik Lundh fredrik@pythonware.com
Tue, 16 Nov 1999 20:33:52 +0100


Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some
> compatibility so it can also feed directly into a file).

seeing this made me switch on my brain for a moment,
and recall how things are done in PIL (which is, as I've
bragged about before, another library with an internal
format, and many possible external encodings).  among
other things, PIL lets you read and write images to both
ordinary files and arbitrary file objects, but it also lets
you incrementally decode images by feeding it chunks
of data (through ImageFile.Parser).  and it's fast -- it has
to be, since images tends to contain lots of pixels...

anyway, here's what I came up with (code will follow,
if someone's interested).

--------------------------------------------------------------------
A PIL-like Unicode Codec Proposal
--------------------------------------------------------------------

In the PIL model, the codecs are called with a piece of data, and
returns the result to the caller.  The codecs maintain internal state
when needed.

class decoder:

    def decode(self, s, offset=0):
        # decode as much data as we possibly can from the
        # given string.  if there's not enough data in the
        # input string to form a full character, return
        # what we've got this far (this might be an empty
        # string).

    def flush(self):
        # flush the decoding buffers.  this should usually
        # return None, unless the fact that knowing that the
        # input stream has ended means that the state can be
        # interpreted in a meaningful way.  however, if the
        # state indicates that there last character was not
        # finished, this method should raise a UnicodeError
        # exception.

class encoder:

    def encode(self, u, offset=0, buffersize=0):
        # encode data from the given offset in the input
        # unicode string into a buffer of the given size
        # (or slightly larger, if required to proceed).
        # if the buffer size is 0, the decoder is free
        # to pick a suitable size itself (if at all
        # possible, it should make it large enough to
        # encode the entire input string).  returns a
        # 2-tuple containing the encoded data, and the
        # number of characters consumed by this call.

    def flush(self):
        # flush the encoding buffers.  returns an ordinary
        # string (which may be empty), or None.

Note that a codec instance can be used for a single string; the codec
registry should hold codec factories, not codec instances.  In
addition, you may use a single type or class to implement both
interfaces at once.

--------------------------------------------------------------------
Use Cases
--------------------------------------------------------------------

A null decoder:

    class decoder:
        def decode(self, s, offset=0):
            return s[offset:]
        def flush(self):
            pass

A null encoder:

    class encoder:
        def encode(self, s, offset=0, buffersize=0):
            if buffersize:
                s = s[offset:offset+buffersize]
            else:
                s = s[offset:]
            return s, len(s)
        def flush(self):
            pass

Decoding a string:

    def decode(s, encoding)
        c = registry.getdecoder(encoding)
        u = c.decode(s)
        t = c.flush()
        if not t:
            return u
        return u + t # not very common

Encoding a string:

    def encode(u, encoding)
        c = registry.getencoder(encoding)
        p = []
        o = 0
        while o < len(u):
            s, n = c.encode(u, o)
            p.append(s)
            o = o + n
        if len(p) == 1:
            return p[0]
        return string.join(p, "") # not very common

Implementing stream codecs is left as an exercise (see the zlib
material in the eff-bot guide for a decoder example).

--- end of proposal