[Python-Dev] Stateful codecs [Was: str object going in Py3K]

Sat Feb 18 18:24:46 CET 2006

Walter Dörwald wrote:
> M.-A. Lemburg wrote:
>> Walter Dörwald wrote:
>>>>>> I'd suggest we keep codecs.lookup() the way it is and
>>>>>> instead add new functions to the codecs module, e.g.
>>>>>> codecs.getencoderobject() and codecs.getdecoderobject().
>>>>>>
>>>>>> Changing the codec registration is not much of a problem:
>>>>>> we could simply allow 6-tuples to be passed into the
>>>>>> registry.
>>>>> OK, so codecs.lookup() returns 4-tuples, but the registry stores 6-tuples and the search functions must return 6-tuples.
>>>>> And we add codecs.getencoderobject() and codecs.getdecoderobject() as well as new classes codecs.StatefulEncoder and
>>>>> codecs.StatefulDecoder. What about old search functions that return 4-tuples?
>>>> The registry should then simply set the missing entries to None and the getencoderobject()/getdecoderobject() would then
>>>> have
>>>> to raise an error.
>>> Sounds simple enough and we don't loose backwards compatibility.
>>>
>>>> Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?!
>>> +1, but I'd like to have a replacement for this, i.e. a function that returns all info the registry has about an encoding:
>>>
>>> 1. Name
>>> 2. Encoder function
>>> 3. Decoder function
>>> 4. Stateful encoder factory
>>> 5. Stateful decoder factory
>>> 6. Stream writer factory
>>> 7. Stream reader factory
>>>
>>> and if this is an object with attributes, we won't have any problems if we extend it in the future.
>> Shouldn't be a problem: just expose the registry dictionary
>> via the _codecs module.
>>
>> The rest can then be done in a Python function defined in
>> codecs.py using a CodecInfo class.
> 
> This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the encoding
> name again). Maybe we should make a version of __PyCodec_Lookup() that allows 4- and 6-tuples available to Python and use that?
> The official PyCodec_Lookup() would then have to downgrade the 6-tuples to 4-tuples.

Hmm, you're right: the dictionary may not have the requested codec
info yet (it's only used as cache) and only a call to _PyCodec_Lookup()
would fill it.

>>> BTW, if we change the API, can we fix the return value of the stateless functions? As the stateless function always
>>> encodes/decodes the complete string, returning the length of the string doesn't make sense.
>>> codecs.getencoder() and codecs.getdecoder() would have to continue to return the old variant of the functions, but
>>> codecs.getinfo("latin-1").encoder would be the new encoding function.
>> No: you can still write stateless encoders or decoders that do
>> not process the whole input string. Just because we don't have
>> any of those in Python, doesn't mean that they can't be written
>> and used. A stateless codec might want to leave the work
>> of buffering bytes at the end of the input data which cannot
>> be processed to the caller.
> 
> But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the codec
> has been thrown away already.

This depends a lot on the nature of the codec. It may well be
possible to work on chunks of input data in a stateless way,
e.g. say you have a string of 4-byte hex values, then the decode
function would be able to work on 4 bytes each and let the caller
buffer any remaining bytes for the next call. There'd be no need for
keeping state in the decoder function.

>> It is also possible to write
>> stateful codecs on top of such stateless encoding and decoding
>> functions.
> 
> That's what the codec helper functions from Python/_codecs.c are for.

I'm not sure what you mean here.

> Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, UTF-16,
> UTF-16-LE and UTF-16-BE are already working.

Nice :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::