[Python-Dev] Add transform() and untranform() methods

Sat Nov 16 10:44:51 CET 2013

On 16 Nov 2013 10:47, "Victor Stinner" <victor.stinner at gmail.com> wrote:
>
> 2013/11/16 Nick Coghlan <ncoghlan at gmail.com>:
> > To address Serhiy's security concerns with the compression codecs (which are
> > technically independent of the question of restoring the aliases), I also
> > plan to document how to systematically blacklist particular codecs in an
> > application by setting attributes on the encodings module and/or appropriate
> > entries in sys.modules.
>
> I would be simpler and safer to blacklist bytes=>bytes and str=>str
> codecs from bytes.decode() and str.encode() directly. Marc Andre
> Lemburg proposed to add new attributes in CodecInfo to specify input
> and output types.

Yes, but that type compatibility introspection is a change for 3.5 at
the earliest (although I commented on
http://bugs.python.org/issue19619 with two alternate suggestions that
I think would be reasonable to implement for 3.4).

Everything codec related that I am doing at the moment is about
improving the state of 3.4 and source compatible 2/3 code. Proposals
for further 3.5+ only improvements are relevant only in the sense that
I don't want to lock us out from future improvements (which is why my
main aim is to clarify the status quo, with the only functional
changes related to restoring feature parity with Python 2 for
non-Unicode codecs).

> > The only functional *change* I'd still like to make for 3.4 is to restore
> > the shorthand aliases for the non-Unicode codecs (to ease the migration for
> > folks coming from Python 2), but this thread has convinced me I likely need
> > to write the PEP *before* doing that, and I still have to integrate
> > ensurepip into pyvenv before the beta 1 deadline.
> >
> > So unless you and Victor are prepared to +1 the restoration of the codec
> > aliases (closing issue 7475) in anticipation of that codecs infrastructure
> > documentation PEP, the change to restore the aliases probably won't be in
> > 3.4. (I *might* get the PEP written in time regardless, but I'm not betting
> > on it at this point).
>
> Using StackOverflow search engine, I found some posts where people
> asks for "hex" codec on Python 3. There are two answers: use binascii
> module or use codecs.encode(). So even if codecs.encode() was never
> documented, it looks like it is used. So I now agree that documenting
> it would not make the situation worse.

Aye, that was my conclusion (hence my proposal on issue 7475 back in April).

Can I take that observation as a +1 for restoring the aliases as well?
(That and more efficiently rejecting the non-Unicode codecs from
str.encode, bytes.decode and bytearray.decode are the only aspects of
this subject to the beta 1 deadline - we can be a bit more leisurely
when it comes to working out the details of the docs updates)

> Adding transform()/untransform() method to bytes and str is a non
> trivial change and not everybody likes them. Anyway, it's too late for
> Python 3.4.
>
> In my opinion, the best option is to add new input_type/output_type
> attributes to CodecInfo right now, and modify the codecs so
> "abc".encode("hex") raises a LookupError (instead of tricky error
> message with some evil low-level hacks on the traceback and the
> exception, which is my initial concern in this mail thread). It fixes
> also the security vulnerability.

The C level code for catching the input type errors only looks evil because:

- the C level equivalent of "exception Exception as Y: raise X from Y"
is just plain ugly in the first place
- the chaining includes a *lot* of checks of the original exception to
ensure that no data is lost by raising a new instance of the same
exception Type and chaining
- it chains ValueError, AttributeError and any other currently
stateless (aside from a str description) error the codec might throw,
not just input type validation errors (it deliberately doesn't chain
stateful errors as doing so might be backwards incompatible with
existing error handling).

However, the ugliness of that code is the reason I'm intrigued by the
possibility of traceback annotations as a potentially cleaner solution
than trying to seamlessly wrap exceptions with a new one that adds
more context information. While I think the gain in codec
debuggability is worth it in this case, my concern over the complexity
and the current limitations are the reason I didn't make it a public C
API.

> To keep backward compatibility (even with custom codecs registered
> manually), if input_type/output_type is not defined, we should
> consider that the codec is a classical text encoding (encode
> str=>bytes, decode bytes=>str).

Without an already existing ByteSequence ABC , it isn't feasible to
propose and implement this completely in the 3.4 time frame (since you
would need such an ABC to express the input type accepted by our
Unicode and binary codecs - the only one that wouldn't need it is
rot_13, since that's str->str).

However, the output types could be expressed solely as concrete types,
and that's all we need for the blacklist (since we could replace the
current instance check on the result with a subclass check on the
specified output type (if any) prior to decoding.

Cheers,
Nick.