[Python-Dev] Stateful codecs [Was: str object going in Py3K]

Sat Feb 18 22:08:19 CET 2006

M.-A. Lemburg wrote:
> Walter Dörwald wrote:
>> M.-A. Lemburg wrote:
>>> Walter Dörwald wrote:
>>>> [...]
>>>>> Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?!
>>>> +1, but I'd like to have a replacement for this, i.e. a function that returns all info the registry has about an encoding:
>>>>
>>>> 1. Name
>>>> 2. Encoder function
>>>> 3. Decoder function
>>>> 4. Stateful encoder factory
>>>> 5. Stateful decoder factory
>>>> 6. Stream writer factory
>>>> 7. Stream reader factory
>>>>
>>>> and if this is an object with attributes, we won't have any problems if we extend it in the future.
>>> Shouldn't be a problem: just expose the registry dictionary
>>> via the _codecs module.
>>>
>>> The rest can then be done in a Python function defined in
>>> codecs.py using a CodecInfo class.
>>
>> This would require the Python code to call codecs.lookup() and then look into the codecs dictionary (normalizing the
>> encoding name again). Maybe we should make a version of __PyCodec_Lookup() that allows 4- and 6-tuples available to Python
>> and use that? The official PyCodec_Lookup() would then have to downgrade the 6-tuples to 4-tuples.
>
> Hmm, you're right: the dictionary may not have the requested codec info yet (it's only used as cache) and only a call to
> _PyCodec_Lookup() would fill it.

I'm now trying a different approach: codecs.lookup() returns a subclass of tuple. We could deprecate calling __getitem__() in
2.5/2.6 and then remove the tuple subclassing later.
>>>> BTW, if we change the API, can we fix the return value of the stateless functions? As the stateless function always
>>>> encodes/decodes the complete string, returning the length of the string doesn't make sense. codecs.getencoder() and
>>>> codecs.getdecoder() would have to continue to return the old variant of the functions, but
>>>> codecs.getinfo("latin-1").encoder would be the new encoding function.
>>> No: you can still write stateless encoders or decoders that do
>>> not process the whole input string. Just because we don't have
>>> any of those in Python, doesn't mean that they can't be written and used. A stateless codec might want to leave the work
>>> of buffering bytes at the end of the input data which cannot
>>> be processed to the caller.
>>
>> But what would the call do with that info? It can't retry encoding/decoding the rejected input, because the state of the
>> codec has been thrown away already.
>
> This depends a lot on the nature of the codec. It may well be
> possible to work on chunks of input data in a stateless way,
> e.g. say you have a string of 4-byte hex values, then the decode
> function would be able to work on 4 bytes each and let the caller
> buffer any remaining bytes for the next call. There'd be no need for keeping state in the decoder function.

So incomplete byte sequence would be silently ignored.

>>> It is also possible to write
>>> stateful codecs on top of such stateless encoding and decoding
>>> functions.
>>
>> That's what the codec helper functions from Python/_codecs.c are for.
>
> I'm not sure what you mean here.

_codecs.utf_8_decode() etc. use (result, count) tuples as the return value, because those functions are the building blocks of
the codecs themselves.
>> Anyway, I've started implementing a patch that just adds codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig,
>> UTF-16, UTF-16-LE and UTF-16-BE are already working.
>
> Nice :-)

gencodec.py is updated now too. The rest should be manageble too. I'll leave updating the CJKV codecs to Hye-Shik though.

Bye,
   Walter Dörwald