[issue19619] Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()

Fri Nov 22 13:09:27 CET 2013

Marc-Andre Lemburg added the comment:

On 22.11.2013 12:43, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
> If _is_text_encoding may change in Python 3.5, you should add a comment to warn users to not use it and explain its purpose, maybe with a reference to this issue.

+1

> --
> 
> We have talking about a very few codecs:
> 
> * base64: bytes => bytes
> * bz2: bytes => bytes
> * hex: bytes => bytes; decode supports also ASCII string (str) => bytes
> * quopri: bytes => bytes
> * rot_13: str => str
> * uu: bytes => bytes
> * zlib: bytes => bytes
> 
> I suppose that supporting ASCII string input to the hex decoder is a border effect of its implementation. I don't know if it is expected *for the codec*.
> 
> If we simplify the hex decoder to reject str types, all these codecs would have simply one type: same input and output type. Anyway, if you want something based on types, the special case for the hex decoder cannot be expressed with a type nor ABC. "ASCII string" is not a type.
> 
> So instead of  _is_text_encoding=False could be transform=bytes or transform=str. (I don't care of the name: transform_type, type, codec_type, data_type, etc.)
> 
> I know that bytes is not exact: bytearray, memoryview and any bytes-like object is accepted, but it is a probably enough for now.

I think it's better to go with something that's explicitly internal
now than to fix a public API in form of a constructor parameter
this late in the release process.

For 3.5 it may make sense to declare a few codec feature flags which
would then make lookups such as the one done for the blacklist easier
to implement and faster to check as well.

Such flags could provide introspection at a higher level than what
would be possible with type mappings (even though I still like the
idea of adding those to CodecInfo at some point).

One possible use for such flags would be to declare whether a
codec is reversible or not - in other words, whether .decode(.encode(x))
works for all possible inputs x. This flag could then be used to
quickly check whether a codec would fail on a Unicode str which
has non-Latin-1 code points or to create a list of valid encodings
for certain applications, e.g. a list which only contains reversible
Unicode encodings such as the UTF ones.

Anyway: Thanks to Nick for implementing this, to Serhiy for the black
list idea and Victor for the attribute idea :-)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue19619>
_______________________________________