[Python-Dev] accept string in a2b and base64?

Stephen J. Turnbull stephen at xemacs.org
Wed Feb 22 08:37:55 CET 2012


R. David Murray writes:

 > If most people agree with Antoine I won't fight it, but it seems to me
 > that accepting unicode in the binascii and base64 APIs is a bad
 > idea.

First, I agree with David that this change should have been brought up
on python-dev before committing it.  The distinctions Python 3 has
made between APIs for bytes and those for str are both obviously
controversial and genuinely delicate.

Second, if Unicode is to be accepted in these APIs, there is a doc
issue (which I haven't checked).  It must be made clear that the
"printable ASCII" is question is the set represented by the *integers*
33 to 126, *not* the ASCII characters ! to ~.  Those characters are
present in the Unicode repertoire in many other places (specifically
the "full-width ASCII" compatibility character set around U+FF20, but
also several Greek and Cyrillic characters, and possibly others.)

I'm going to side with Antoine and Nick on these particular changes
because in practice (except maybe in the email module :-( ) the
BASE-encoded "text" to be decoded is going to be consistently defined
by the client as either str or bytes, but not both.  The fact that the
repr of the encoded text is identical (except for the presence or
absence of a leading "b") is very suggestive here.  I do harbor a
slight niggle that I think there is more room for confusion here than
in Nick's urllib work.

However, once we clarify that confusion in *our* minds, I don't think
there's much potential for dangerous confusion for API clients.  (I
agree with Antoine on that point.)  The BASE## decoding APIs in
abstract are "text" to bytes.  Pedantically in Python that suggests a
str -> bytes signature, but RFC 4648 doesn't anywhere require a 1-byte
representation of ASCII, only that the representation be interpreted
as integers in the ASCII coding.  However, an RFC-4648-conforming
implementation MUST reject any string containing characters not
allowed in the representation, so it's actually stricter than
requiring ASCII.  I see no problem with allowing str-or-bytes -> bytes
polymorphism here.

The remaining issue to my mind is we'd also like bytes -> str-or-bytes
polymorphism for symmetry, but this is not Haskell, we can't have it.

The same is true for binascii, I suppose -- assuming that the module
is specified (as the name suggests) to produce and consume only ASCII
text as a representation of bytes.


More information about the Python-Dev mailing list