[Python-Dev] Add transform() and untranform() methods

Fri Nov 15 09:37:09 CET 2013

On 15.11.2013 08:13, Nick Coghlan wrote:
> On 15 November 2013 11:10, Terry Reedy <tjreedy at udel.edu> wrote:
>> On 11/14/2013 5:32 PM, Victor Stinner wrote:
>>
>>> I don't like the functions codecs.encode() and codecs.decode() because
>>> the type of the result depends on the encoding (second parameter). We
>>> try to avoid this in Python.
>>
>>
>> Such dependence is common with arithmetic.
>>
>>>>> 1 + 2
>> 3
>>>>> 1 + 2.0
>> 3.0
>>>>> 1 + 2+0j
>> (3+0j)
>>
>>>>> sum((1,2,3), 0)
>> 6
>>>>> sum((1,2,3), 0.0)
>> 6.0
>>>>> sum((1,2,3), 0.0+0j)
>> (6+0j)
>>
>> for f in (compile, eval, getattr, iter, max, min, next, open, pow, round,
>> type, vars):
>>   type(f(*args)) # depends on the inputs
>> That is a large fraction of the non-class builtin functions.
> 
> *Type* dependence between inputs and outputs is common (and completely
> non-controversial). The codecs system is different, since the
> supported input and output types are *value* dependent, driven by the
> name of the codec.
> 
> That's the part which makes the codec machinery interesting in
> general, since it combines a value driven lazy loading mechanism
> (based on the codec name) with the subsequent invocation of that
> mechanism: the default codec search algorithm goes hunting in the
> "encodings" package (or the alias dictionary), but you can register
> custom search algorithms and provide encodings any way you want. It
> does mean, however, that the most you can claim for the type signature
> of codecs.encode and codecs.decode is that they accept an object and
> return an object. Beyond that, it's completely driven by the value of
> the codec.

Indeed. You have to think of the codec registry as a mere
lookup mechanism - very much like an import. The implementation
of the imported module defines which types are supported and
how the encode/decode steps work.

> In Python 2.x, the type constraints imposed by the str and unicode
> convenience methods is "basestring in, basestring out". As it happens,
> all of the standard library codecs abide by that restriction , so it
> was easy to interpret the codecs module itself as having the same
> "basestring in, basestring out" limitation, especially given the heavy
> focus on text encodings in the way it was documented. In practice, the
> codecs weren't that open ended - some of them only accepted 8 bit
> strings, some only accepted unicode, some accepted both (perhaps
> relying on implicit decoding to unicode),
> 
> The migration to Python 3 made the contrast between the two far more
> stark however, hence the long and involved discussion on issue 7475,
> and the fact that the non-Unicode codecs are currently still missing
> their shorthand aliases.
> 
> The proposal I posted to issue 7475 back in April (and, in the absence
> of any objections to the proposal, finally implemented over the past
> few weeks) was to take advantage of the fact that the codecs.encode
> and codecs.decode convenience functions exist (and have been covered
> by the regression test suite) as far back as Python 2.4. I did this
> merely by documenting the existing of the functions for Python 2.7,
> 3.3 and 3.4, changing the exception messages thrown for codec output
> type errors on the convenience methods to reference them, and by
> updating the Python 3.4 What's New document to explain the changes.
> 
> This approach provides a Python 2/3 compatible solution for usage of
> non-Unicode encodings: users simply need to call the existing module
> level functions in the codecs module, rather than using the methods on
> specific builtin types. This approach also means that the binary
> codecs can be used with any bytes-like object (including memoryview
> and array.array), rather than being limited to types that implement a
> new method (like "transform"), and can also be used in Python 2/3
> source compatible APIs (since the data driven nature of the problem
> makes 2to3 unusable as a solution, and that doesn't help single code
> base projects anyway).

Right, and that was the main point in making codecs flexible
in this respect. There are many other types which can serve
as input and output - in the stdlib and interpreter as well as
in extension modules that implement their own types.

>>From my point of view, this is now just a matter of better documenting
> the status quo, and nudging people in the right direction when it
> comes to using the appropriate API for non-Unicode codecs. Since we
> now realise these functions have existed since Python 2.4, it doesn't
> make sense to try to fundamentally change direction, but instead to
> work on making it better.
> 
> A few things I noticed while implementing the recent updates:
> 
> - as you noted in your other email, while MAL is on record as saying
> the codecs module is intended for arbitrary codecs, not just Unicode
> encodings, readers of the current docs can definitely be forgiven for
> not realising that. We really need to better separate the codecs
> module docs from the text model docs (two new sections in the language
> reference, one for the codecs machinery and one for the text model
> would likely be appropriate. The io module docs and those for the
> builtin open function may also be affected)

Agreed.

> - a mechanism for annotating frames would help avoid the need for
> nasty hacks like the exception wrapping that aims to make codec
> failures easier to debug
> - if codecs exposed a way to separate the input type check from the
> invocation of the codec, we could redirect users to the module API for
> bad input types as well (e.g. calling "input str".encode("bz2")

This is one feature that's still missing from the codec design:
there's currently no way to do input/output type introspection.
It would be great to have codecs expose mapping of input to
output types in their CodecInfo structure.

> - if we want something that doesn't need to be imported, then encode()
> and decode() builtins make more sense than new methods on str, bytes
> and bytearray objects (since builtins would support memoryview and
> array.array as well, and it avoids ambiguity regarding the direction
> of the operation)

I'm not sure I understand this part.

> - an ABC for "ByteSequence" might be a good idea (ducktyped to check
> if memoryview can retrieve a 1-D byte array)
> - an ABC for "String" might be a good idea (but opens a big can of worms)
> - the codecs module should offer a way to register a new alias for an
> existing codec

For the builtin codecs, this is already possible via the
encodings.aliases module, but I agree: making the registry more
flexible in this respect would probably be better.

I must admit that the codec module is somewhat over-engineered with
respect to the search function idea. The motivation was to abstract
the lookup mechanism, in order to make it possible to e.g.
dynamically load codecs in other ways, or to have codecs
which implement a whole set of encodings with a single
implementation, or to implement new ways of aliasing encoding
names, etc.

At the time I designed this, it wasn't clear which approach
would be used. Today, I guess most people simply rely on the
search function that is implemented in the encodings package.

It may be a good idea to make the encodings package search
function the preferred search function for codecs in Python
and then slowly deprecate the search function registry and
replace it with a more direct codec registry built into the
encodings package.

> - the codecs module should offer a way to map a name to a CodecInfo
> object without registering a new search function

Yep. See above.

> - encodings should be converted to a namespace package

Agreed.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 15 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2013-11-19: Python Meeting Duesseldorf ...                  4 days to go

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/