[Python-ideas] Fwd: Python Convert

Andrew Barnert abarnert at yahoo.com
Fri Jul 12 08:47:24 CEST 2013


From: Daniel Rode <dth4h95 at gmail.com>
Sent: Thursday, July 11, 2013 9:10 PM


>I don't think things through very well sometimes.

That's exactly what this list is for. Except for trivial cases (which just get filed as bugs and fixed), nobody ever thinks through all the issues of a cool idea he just had. But a bunch of people together often can, and, if we're lucky, can even find solutions to them.

>I guess the bottom line for me is what you said:
> 
>>In most cases, there are other ways to do it--base64.encode(), binascii.hexlify(), etc. It would be nice if there was a convenient and consistent way to do all of them instead of having to hunt around the stdlib...


Yeah, I understand the motivation, and would love a good solution.

>So have you thought of a solution for this problem? 


I was going to say "If I had, I would have already written a proposal"—but then I thought a solution. Or maybe two.

---

First, how about this:


    b'abc'.a2b('hex') == b'616263'
    b'616263'.b2a('hex') == b'abc'

It works with four binary transfer encodings: 'quopri'/'qp', 'hex', 'b64'/'base64'/'base_64', and 'uu'.

That avoids the confusion with encode/decode, and it implies the implementation pretty nicely, which is basically:

    def a2b(self, encoding):
        encoding = binascii.aliases.get(encoding, encoding)
        return getattr(binascii, 'a2b_' + encoding)(self)
    def b2a(self, encoding):
        encoding = binascii.aliases.get(encoding, encoding)
        return getattr(binascii, 'b2a_' + encoding)(self)

Where binascii.aliases is something like {'quopri': 'qp', 'base64': 'b64', 'base_64': 'b64'}—copy the exact set of aliases from the 2.7 encodings.aliases.aliases dict.

I think these four are all we need, because they're almost all we lost going from 2.x to 3.x.

All of the encodings are still there. However, the ones that can't encode str->bytes and decode bytes->str can't be used with the encode and decode functions, and they've had their friendly aliases removed for safety. So, the real charsets and the text transfer encodings are fine; just the toy rot13, the binary transformations bz2 and gzip, and the binary transfer encodings hex, base64, quopri, and uu can no longer be used with encode/decode. That's it. I don't think anyone cares about rot13, and we can live without bz2 and gzip. So it's really just these four.

---

Alternatively, maybe we don't need _any_ language change.

The "right" way to do these four encodings today is to use the binascii, base64, quopri, and uu modules, respectively. However, they have different APIs. (And most of the docs still refer to "strings" instead of bytes, which implies just how often people are finding and using these modules.)

But there's a perfectly consistent API to base64, quopri, and uu sitting in the binascii module. The only problem is that the documentation says "Normally, you will not use these functions directly but use wrapper modules like uu, base64, or binhex instead." For one thing, there is no wrapper module for hexlify. For another, some of these modules aren't actually wrappers around binascii. And as of 3.3, the "low level" methods are actually _more_ usable than the wrappers, because they can take ASCII-only str arguments. I suspect the reason for that note is that, years ago, you used encode if you wanted the trivial use case, and dug into the specific modules (which offer things like filesystem-safe base64, encoding a whole file at once, etc.) when you need more. But in 3.x, there is no longer a way to get to the trivial use case.

So, just strike that line from the docs, maybe reference binascii in the codecs docs and the 2->3 transition guide, and we're done.

Yes, binascii is a bit ugly, with abbreviated names for everything, hexlify sitting alongside the "low-level" methods, and a bunch of helper functions for the long-obsolete binhex module, but so what? Is "binascii.a2b_hex(b'abc')" really any worse than "b'abc'.a2b('hex')"?

---

Meanwhile, the left-over bytes-bytes encodings are still theoretically usable in some cases, but is anyone actually using them? Besides str.encode/bytes.decode, you also can't use them with the io module, or any of the higher-level stuff in codecs (which is mostly unnecessary now too, for that matter). So, maybe it's time to deprecate them and eventually remove them?


More information about the Python-ideas mailing list