[Python-Dev] Add transform() and untranform() methods

Victor Stinner victor.stinner at gmail.com
Sat Nov 16 11:45:08 CET 2013


Why not using str type for str and str subtypes, and bytes type for bytes
and bytes-like object (bytearray, memoryview)? I don't think that we need
an ABC here.

Victor
Le 16 nov. 2013 10:44, "Nick Coghlan" <ncoghlan at gmail.com> a écrit :

> On 16 Nov 2013 10:47, "Victor Stinner" <victor.stinner at gmail.com> wrote:
> >
> > 2013/11/16 Nick Coghlan <ncoghlan at gmail.com>:
> > > To address Serhiy's security concerns with the compression codecs
> (which are
> > > technically independent of the question of restoring the aliases), I
> also
> > > plan to document how to systematically blacklist particular codecs in
> an
> > > application by setting attributes on the encodings module and/or
> appropriate
> > > entries in sys.modules.
> >
> > I would be simpler and safer to blacklist bytes=>bytes and str=>str
> > codecs from bytes.decode() and str.encode() directly. Marc Andre
> > Lemburg proposed to add new attributes in CodecInfo to specify input
> > and output types.
>
> Yes, but that type compatibility introspection is a change for 3.5 at
> the earliest (although I commented on
> http://bugs.python.org/issue19619 with two alternate suggestions that
> I think would be reasonable to implement for 3.4).
>
> Everything codec related that I am doing at the moment is about
> improving the state of 3.4 and source compatible 2/3 code. Proposals
> for further 3.5+ only improvements are relevant only in the sense that
> I don't want to lock us out from future improvements (which is why my
> main aim is to clarify the status quo, with the only functional
> changes related to restoring feature parity with Python 2 for
> non-Unicode codecs).
>
> > > The only functional *change* I'd still like to make for 3.4 is to
> restore
> > > the shorthand aliases for the non-Unicode codecs (to ease the
> migration for
> > > folks coming from Python 2), but this thread has convinced me I likely
> need
> > > to write the PEP *before* doing that, and I still have to integrate
> > > ensurepip into pyvenv before the beta 1 deadline.
> > >
> > > So unless you and Victor are prepared to +1 the restoration of the
> codec
> > > aliases (closing issue 7475) in anticipation of that codecs
> infrastructure
> > > documentation PEP, the change to restore the aliases probably won't be
> in
> > > 3.4. (I *might* get the PEP written in time regardless, but I'm not
> betting
> > > on it at this point).
> >
> > Using StackOverflow search engine, I found some posts where people
> > asks for "hex" codec on Python 3. There are two answers: use binascii
> > module or use codecs.encode(). So even if codecs.encode() was never
> > documented, it looks like it is used. So I now agree that documenting
> > it would not make the situation worse.
>
> Aye, that was my conclusion (hence my proposal on issue 7475 back in
> April).
>
> Can I take that observation as a +1 for restoring the aliases as well?
> (That and more efficiently rejecting the non-Unicode codecs from
> str.encode, bytes.decode and bytearray.decode are the only aspects of
> this subject to the beta 1 deadline - we can be a bit more leisurely
> when it comes to working out the details of the docs updates)
>
> > Adding transform()/untransform() method to bytes and str is a non
> > trivial change and not everybody likes them. Anyway, it's too late for
> > Python 3.4.
> >
> > In my opinion, the best option is to add new input_type/output_type
> > attributes to CodecInfo right now, and modify the codecs so
> > "abc".encode("hex") raises a LookupError (instead of tricky error
> > message with some evil low-level hacks on the traceback and the
> > exception, which is my initial concern in this mail thread). It fixes
> > also the security vulnerability.
>
> The C level code for catching the input type errors only looks evil
> because:
>
> - the C level equivalent of "exception Exception as Y: raise X from Y"
> is just plain ugly in the first place
> - the chaining includes a *lot* of checks of the original exception to
> ensure that no data is lost by raising a new instance of the same
> exception Type and chaining
> - it chains ValueError, AttributeError and any other currently
> stateless (aside from a str description) error the codec might throw,
> not just input type validation errors (it deliberately doesn't chain
> stateful errors as doing so might be backwards incompatible with
> existing error handling).
>
> However, the ugliness of that code is the reason I'm intrigued by the
> possibility of traceback annotations as a potentially cleaner solution
> than trying to seamlessly wrap exceptions with a new one that adds
> more context information. While I think the gain in codec
> debuggability is worth it in this case, my concern over the complexity
> and the current limitations are the reason I didn't make it a public C
> API.
>
> > To keep backward compatibility (even with custom codecs registered
> > manually), if input_type/output_type is not defined, we should
> > consider that the codec is a classical text encoding (encode
> > str=>bytes, decode bytes=>str).
>
> Without an already existing ByteSequence ABC , it isn't feasible to
> propose and implement this completely in the 3.4 time frame (since you
> would need such an ABC to express the input type accepted by our
> Unicode and binary codecs - the only one that wouldn't need it is
> rot_13, since that's str->str).
>
> However, the output types could be expressed solely as concrete types,
> and that's all we need for the blacklist (since we could replace the
> current instance check on the result with a subclass check on the
> specified output type (if any) prior to decoding.
>
> Cheers,
> Nick.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20131116/d72f7cdd/attachment.html>


More information about the Python-Dev mailing list