[Python-Dev] Add transform() and untranform() methods

Sat Nov 16 01:47:38 CET 2013

2013/11/16 Nick Coghlan <ncoghlan at gmail.com>:
> To address Serhiy's security concerns with the compression codecs (which are
> technically independent of the question of restoring the aliases), I also
> plan to document how to systematically blacklist particular codecs in an
> application by setting attributes on the encodings module and/or appropriate
> entries in sys.modules.

I would be simpler and safer to blacklist bytes=>bytes and str=>str
codecs from bytes.decode() and str.encode() directly. Marc Andre
Lemburg proposed to add new attributes in CodecInfo to specify input
and output types.

> The only functional *change* I'd still like to make for 3.4 is to restore
> the shorthand aliases for the non-Unicode codecs (to ease the migration for
> folks coming from Python 2), but this thread has convinced me I likely need
> to write the PEP *before* doing that, and I still have to integrate
> ensurepip into pyvenv before the beta 1 deadline.
>
> So unless you and Victor are prepared to +1 the restoration of the codec
> aliases (closing issue 7475) in anticipation of that codecs infrastructure
> documentation PEP, the change to restore the aliases probably won't be in
> 3.4. (I *might* get the PEP written in time regardless, but I'm not betting
> on it at this point).

Using StackOverflow search engine, I found some posts where people
asks for "hex" codec on Python 3. There are two answers: use binascii
module or use codecs.encode(). So even if codecs.encode() was never
documented, it looks like it is used. So I now agree that documenting
it would not make the situation worse.

Adding transform()/untransform() method to bytes and str is a non
trivial change and not everybody likes them. Anyway, it's too late for
Python 3.4.

In my opinion, the best option is to add new input_type/output_type
attributes to CodecInfo right now, and modify the codecs so
"abc".encode("hex") raises a LookupError (instead of tricky error
message with some evil low-level hacks on the traceback and the
exception, which is my initial concern in this mail thread). It fixes
also the security vulnerability.

To keep backward compatibility (even with custom codecs registered
manually), if input_type/output_type is not defined, we should
consider that the codec is a classical text encoding (encode
str=>bytes, decode bytes=>str).

The type of codecs.encode() result is my least concern in this topic.

I created the following issue to implement my idea:
http://bugs.python.org/issue19619

Victor