[Python-Dev] Codecs and StreamCodecs

M.-A. Lemburg mal@lemburg.com
Thu, 18 Nov 1999 17:23:09 +0100


Guido van Rossum wrote:
> 
> > The problem is that the encoding names are not Python identifiers,
> > e.g. iso-8859-1 is allowed as identifier.
> 
> This is easily taken care of by translating each string of consecutive
> non-identifier-characters to an underscore, so this would import the
> iso_8859_1.py module.  (I also noticed in an earlier post that the
> official name for Shift_JIS has an underscore, while most other
> encodings use hyphens.)

Right. That's one way of doing it.

> > This and
> > the fact that applications may want to ship their own codecs (which
> > do not get installed under the system wide encodings package)
> > make the registry necessary.
> 
> But it could be enough to register a package where to look for
> encodings (in addition to the system package).
> 
> Or there could be a registry for encoding search functions.  (See the
> import discussion.)

Like a path of search functions ? Not a bad idea... I will still
want the internal dict for caching purposes though. I'm not sure
how often these encodings will be, but even a few hundred function
call will slow down the Unicode implementation quite a bit.

The implementation could proceed as follows:

def lookup(encoding):

    codecs = _internal_dict.get(encoding,None)
    if codecs:
	return codecs
    for query in sys.encoders:
	codecs = query(encoding)
	if codecs:
	    break
    else:
	raise UnicodeError,'unkown encoding: %s' % encoding
    _internal_dict[encoding] = codecs
    return codecs

For simplicity, codecs should be a tuple (encoder,decoder,
stream_writer,stream_reader) of factory functions.

...that is if we can agree on these 4 APIs :-) Here are my
current versions:
-----------------------------------------------------------------------
class Codec:

    """ Defines the interface for stateless encoders/decoders.
    """

    def __init__(self,errors='strict'):

	""" Creates a Codec instance.

	    The Codec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.errors = errors

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,offset=0):

	""" Decodes data from the Python string s and returns a tuple 
	    (Unicode object, bytes consumed).
	
	    If offset is given, the decoding process starts at
	    s[offset]. It defaults to 0.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...


StreamWriter and StreamReader define the interface for stateful
encoders/decoders:

class StreamWriter(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamWriter instance.

	    stream must be a file-like object open for writing
	    (binary) data.

	    The StreamWriter may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream
	    and returns the number of bytes written.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""
	pass

class StreamReader(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamReader instance.

	    stream must be a file-like object open for reading
	    (binary) data.

	    The StreamReader may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""

In addition to the above methods, the StreamWriter and StreamReader
instances should also provide access to all other methods defined for
the stream object.

Stream codecs are free to combine the StreamWriter and StreamReader
interfaces into one class.
-----------------------------------------------------------------------

> > I don't see a problem with the registry though -- the encodings
> > package can take care of the registration process without any
> > user interaction. There would only have to be an API for
> > looking up an encoding published by the encodings package for
> > the Unicode implementation to use. The magic behind that API
> > is left to the encodings package...
> 
> I think that the collection of encodings will eventually grow large
> enough to make it a requirement to avoid doing work proportional to
> the number of supported encodings at startup (or even when an encoding
> is referenced for the first time).  Any "lazy" mechanism (of which
> module search is an example) will do.

Right. The list of search functions should provide this kind
of lazyness. It also provides ways to implement other strategies
to look for codecs, e.g. PIL could provide such a search function
for its codecs, mxCrypto for the included ciphers, etc.
 
> > BTW, nothing's wrong with your idea :-) In fact, I like it
> > a lot because it keeps the encoding modules out of the
> > top-level scope which is good.
> 
> Yes.
> 
> > PS: we could probably even take the whole codec idea one step
> > further and also allow other input/output formats to be registered,
> > e.g. stream ciphers or pickle mechanisms. The step in that
> > direction is not a big one: we'd only have to drop the specification
> > of the Unicode object in the spec and replace it with an arbitrary
> > object. Of course, this will still have to be a Unicode object
> > for use by the Unicode implementation.
> 
> This is a step towards Java's architecture of stackable streams.
> 
> But I'm always in favor of tackling what we know we need before
> tackling the most generalized version of the problem.

Well, I just wanted to mention the possibility... might be
something to look into next year. I find it rather thrilling
to be able to create encrypted streams by just hooking together
a few stream codecs...

f = open('myfile.txt','w')

CipherWriter = sys.codec('rc5-cipher')[3]
sf = StreamWriter(f,key='xxxxxxxx')

UTF8Writer = sys.codec('utf-8')[3]
sfx = UTF8Writer(sf)

sfx.write('asdfasdfasdfasdf')
sfx.close()

Hmm, we should probably define the additional constructor
arguments to be keyword arguments... writers/readers other
than Unicode ones will probably need different kinds of
parameters (such as the key in the above example).

Ahem, ...I'm getting distracted here :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/