[Python-Dev] Add transform() and untranform() methods

Fri Nov 15 08:13:34 CET 2013

On 15 November 2013 11:10, Terry Reedy <tjreedy at udel.edu> wrote:
> On 11/14/2013 5:32 PM, Victor Stinner wrote:
>
>> I don't like the functions codecs.encode() and codecs.decode() because
>> the type of the result depends on the encoding (second parameter). We
>> try to avoid this in Python.
>
>
> Such dependence is common with arithmetic.
>
>>>> 1 + 2
> 3
>>>> 1 + 2.0
> 3.0
>>>> 1 + 2+0j
> (3+0j)
>
>>>> sum((1,2,3), 0)
> 6
>>>> sum((1,2,3), 0.0)
> 6.0
>>>> sum((1,2,3), 0.0+0j)
> (6+0j)
>
> for f in (compile, eval, getattr, iter, max, min, next, open, pow, round,
> type, vars):
>   type(f(*args)) # depends on the inputs
> That is a large fraction of the non-class builtin functions.

*Type* dependence between inputs and outputs is common (and completely
non-controversial). The codecs system is different, since the
supported input and output types are *value* dependent, driven by the
name of the codec.

That's the part which makes the codec machinery interesting in
general, since it combines a value driven lazy loading mechanism
(based on the codec name) with the subsequent invocation of that
mechanism: the default codec search algorithm goes hunting in the
"encodings" package (or the alias dictionary), but you can register
custom search algorithms and provide encodings any way you want. It
does mean, however, that the most you can claim for the type signature
of codecs.encode and codecs.decode is that they accept an object and
return an object. Beyond that, it's completely driven by the value of
the codec.

In Python 2.x, the type constraints imposed by the str and unicode
convenience methods is "basestring in, basestring out". As it happens,
all of the standard library codecs abide by that restriction , so it
was easy to interpret the codecs module itself as having the same
"basestring in, basestring out" limitation, especially given the heavy
focus on text encodings in the way it was documented. In practice, the
codecs weren't that open ended - some of them only accepted 8 bit
strings, some only accepted unicode, some accepted both (perhaps
relying on implicit decoding to unicode),

The migration to Python 3 made the contrast between the two far more
stark however, hence the long and involved discussion on issue 7475,
and the fact that the non-Unicode codecs are currently still missing
their shorthand aliases.

The proposal I posted to issue 7475 back in April (and, in the absence
of any objections to the proposal, finally implemented over the past
few weeks) was to take advantage of the fact that the codecs.encode
and codecs.decode convenience functions exist (and have been covered
by the regression test suite) as far back as Python 2.4. I did this
merely by documenting the existing of the functions for Python 2.7,
3.3 and 3.4, changing the exception messages thrown for codec output
type errors on the convenience methods to reference them, and by
updating the Python 3.4 What's New document to explain the changes.

This approach provides a Python 2/3 compatible solution for usage of
non-Unicode encodings: users simply need to call the existing module
level functions in the codecs module, rather than using the methods on
specific builtin types. This approach also means that the binary
codecs can be used with any bytes-like object (including memoryview
and array.array), rather than being limited to types that implement a
new method (like "transform"), and can also be used in Python 2/3
source compatible APIs (since the data driven nature of the problem
makes 2to3 unusable as a solution, and that doesn't help single code
base projects anyway).

>From my point of view, this is now just a matter of better documenting
the status quo, and nudging people in the right direction when it
comes to using the appropriate API for non-Unicode codecs. Since we
now realise these functions have existed since Python 2.4, it doesn't
make sense to try to fundamentally change direction, but instead to
work on making it better.

A few things I noticed while implementing the recent updates:

- as you noted in your other email, while MAL is on record as saying
the codecs module is intended for arbitrary codecs, not just Unicode
encodings, readers of the current docs can definitely be forgiven for
not realising that. We really need to better separate the codecs
module docs from the text model docs (two new sections in the language
reference, one for the codecs machinery and one for the text model
would likely be appropriate. The io module docs and those for the
builtin open function may also be affected)
- a mechanism for annotating frames would help avoid the need for
nasty hacks like the exception wrapping that aims to make codec
failures easier to debug
- if codecs exposed a way to separate the input type check from the
invocation of the codec, we could redirect users to the module API for
bad input types as well (e.g. calling "input str".encode("bz2")
- if we want something that doesn't need to be imported, then encode()
and decode() builtins make more sense than new methods on str, bytes
and bytearray objects (since builtins would support memoryview and
array.array as well, and it avoids ambiguity regarding the direction
of the operation)
- an ABC for "ByteSequence" might be a good idea (ducktyped to check
if memoryview can retrieve a 1-D byte array)
- an ABC for "String" might be a good idea (but opens a big can of worms)
- the codecs module should offer a way to register a new alias for an
existing codec
- the codecs module should offer a way to map a name to a CodecInfo
object without registering a new search function
- encodings should be converted to a namespace package

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia