Why can't I encode/decode base64 without importing a module?

Hi everyone, Take a look at this question: http://stackoverflow.com/questions/16122435/python-3-how-do-i-use-bytes-to-b... Is there really no way to use base64 that's as short as: b'whatever'.encode('base64') Because doing this: import codecs codecs.decode(b"whatever", "base64_codec") Or this: import base64 encoded = base64.b64encode(b'whatever') Is cumbersome! Why can't I do something like b'whatever'.encode('base64')? Or maybe using a different method than `encode`? Thanks, Ram.

Hi, Your question is discussed since 4 years in the following issue: http://bugs.python.org/issue7475 The last proposition is to add transform() and untransform() methods to bytes and str types. But nobody implemented the idea. If I remember correctly, the missing point is how to define which types are supported by a codec (ex: only bytes for bz2 codec, bytes and str for rot13). Victor 2013/4/22 Ram Rachum <ram@rachum.com>:

Victor Stinner wrote:
Also, for any given codec, which direction is "transform" and which is "untransform"? Also also, what's so special about base64 et al that they deserve an ultra-special way of invoking them, instead of having to import a class or function like you do for *every* *other* piece of library functionality? -- Greg

On Tue, 23 Apr 2013 11:16:20 +1200, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
You transform *into* the encoding, and untransform *out* of the encoding. Do you have an example where that would be ambiguous?
You can ask the same question about all the other codecs. (And that question has indeed been asked in the past.) (One answer is that they used to work in Python2...but the longer we go without restoring the functionality to Python3, the weaker that particular argument becomes.) --David

--Guido van Rossum (sent from Android phone) On Apr 22, 2013 6:09 PM, "R. David Murray" <rdmurray@bitdance.com> wrote:
On Tue, 23 Apr 2013 11:16:20 +1200, Greg Ewing <
greg.ewing@canterbury.ac.nz> wrote:
Except for rot13. :-)

On Apr 22, 2013, at 06:22 PM, Guido van Rossum wrote:
The fact that you can do this instead *is* a bit odd. ;) from codecs import getencoder encoder = getencoder('rot-13') r13 = encoder('hello world')[0] -Barry

On 23.04.2013 17:15, Barry Warsaw wrote:
Just as reminder: we have the general purpose encode()/decode() functions in the codecs module: import codecs r13 = codecs.encode('hello world', 'rot-13') These interface directly to the codec interfaces, without enforcing type restrictions. The codec defines the supported input and output types. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Apr 23, 2013 at 8:22 AM, M.-A. Lemburg <mal@egenix.com> wrote:
As an implementation mechanism I see nothing wrong with this. I hope the codecs module lets you introspect the input and output types of a codec given by name? -- --Guido van Rossum (python.org/~guido)

On 23.04.2013 17:47, Guido van Rossum wrote:
At the moment there is no standard interface to access supported input and output types... but then: regular Python functions or methods also don't provide such functionality, so no surprise there ;-) It's mostly a matter of specifying the supported type combinations in the codec documentation. BTW: What would be a use case where you'd want to programmatically access such information before calling the codec ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Apr 23, 2013 at 9:04 AM, M.-A. Lemburg <mal@egenix.com> wrote:
Not quite the same though. Each function has its own unique behavior. But codecs support a standard interface, *except* that the input and output types sometimes vary.
As you know, in Python 3, most code working with bytes doesn't also work with strings, and vice versa (except for a few cases where we've gone out of our way to write polymorphic code -- but users rarely do so, and any time you use a string or bytes literal you basically limit yourself to that type). Suppose I write a command-line utility that reads a file, runs it through a codec, and writes the result to another file. Suppose the name of the codec is a command-line argument (as well as the filenames). I need to know whether to open the files in text or binary mode based on the name of the codec. -- --Guido van Rossum (python.org/~guido)

On 23.04.2013 19:24, Guido van Rossum wrote:
The codec system itself
Ok, so you need to know which codecs your tool can support and which of those need text input and which bytes input. I've been thinking about this some more: I think that type information alone is not flexible enough to cover such use cases. In your use case you'd want to only permit use of a certain set of codecs, not simply all of them, since some might not implement what you actually want to achieve with the tool, e.g. a user might have installed a codec set that adds support for reading and writing image data, but your intended use was to only support text data. So what we need is a way to allow the codecs to say e.g. "I work on text", "I support encoding bytes and text", "I encode to bytes", "I'm reversible", "I transform input data", "I support bytes and text, and will create same type output", "I work on image data", "I work on X509 certificates", "I work on XML data", etc. In other words, we need a form of tagging system, with a set of standard tags that each codec can publish and which also allows non-standard tags (which can then at some point be made standard, if there's agreement on them). Given a codec name you could then ask the codec registry for the codec tags and verify that the chosen codec handles text data, needs bytes or text encoding input and creates bytes as encoding output. If the registry returns codec tags that don't include the "I work on text" tag, the tool could then raise an error. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 24 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 4/24/2013 1:22 AM, M.-A. Lemburg wrote:
Maybe MIME type and encoding would be sufficient type information, but probably not str vs. bytes.
MIME type supports this sort of concept, with the two-level hierarchy of naming the type... text/xml text/plain image/jpeg
Guess what I think you are re-inventing here.... Nope, guess again.... Yep, MIME types _plus_ encodings.
Hmm. Sounds just like the registry for, um, you guessed it: MIME types.
For just doing text encoding transformations, text/plain would work as a MIME type, and the encodings of interest for the encodings. Seems like "str" always means "Unicode" but the MIME type can vary; "bytes" might mean encoded text, and the MIME type can also vary. For non-textual transformations, "encoding" might mean Base 64, BinHex, or other such representations... but those can also be applied to text, so it might be a 3rd dimension, or it might just be a list of encodings rather than a single encoding. Compression could be another dimension, or perhaps another encoding. But really, then, a transformation needs to be a list of steps; a codec can sign up to perform one or more of the steps, a sequence of codecs would have to be found, capable of performing a subsequence of the steps, and then run in the appropriate order. This all sounds so general, that probably the Python compiler could be implemented as a codec :) Or any compiler. Probably a web server could be implemented as a codec too :) Well, maybe not, codecs have limited error handling and reporting abilities.

On 24 Apr 2013 01:25, "M.-A. Lemburg" <mal@egenix.com> wrote:
If we already have those, why aren't they documented? If they exist, they should be the first thing in the codecs module docs and the porting guide should list them as the replacement for the method versions when using encodings that aren't directly related to the text model, or when the input buffer for decoding isn't a bytes or bytearray object. Regards, Nick.
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 23.04.2013 23:37, Nick Coghlan wrote:
Good question. I added them in 2004 and probably just forgot to add the documentation: http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598 I guess the doc-strings could be used as basis for the documentation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 24 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

R. David Murray writes:
You transform *into* the encoding, and untransform *out* of the encoding. Do you have an example where that would be ambiguous?
In the bytes-to-bytes case, any pair of character encodings (eg, UTF-8 and ISO-8859-15) would do. Or how about in text, ReST to HTML? BASE64 itself is ambiguous. By RFC specification, BASE64 is a *textual* representation of arbitrary binary data. (Cf. URIs.) The natural interpretation of .encode('base64') in that context would be as a bytes-to-text encoder. However, this has several problems. In practice, we invariably use an ASCII octet stream to carry BASE64- encoded data. So web developers would almost certainly expect a bytes-to-bytes encoder. Such a bytes-to-bytes encoder can't be duck-typed. Double-encoding bugs wouldn't be detected until the stream arrives at the user. And the RFC-based signature of .encode('base64') as bytes-to-text is precisely opposite to that of .encode('utf-8') (text-to-bytes). It is certainly true that there are many unambiguous cases. In the case of a true text processing facility (eg, Emacs buffers or Python 3 str) where there is an unambiguous text type with a constant and opaque internal representation, it makes a lot of sense to treat the text type as special/central, and use the terminology "encode [from text]" and "decode [to text]". It's easy to remember, which one is special is obvious, and the difference in input and output types means that mistaken use of the API will be detected by duck-typing. However, in the case of bytes-bytes or text-text transformations, it's not the presence of unambiguous cases that should drive API design IMO. It's the presence of the ambiguous cases that we should cater to. I don't see easy solutions to this issue. Steve

On Tue, 23 Apr 2013 22:29:33 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
If I write: bytestring.transform('ISO-8859-15') that would indeed be ambiguous, but only because I haven't named the source encoding of the bytestring. So the above is obviously nonsense, and the easiest "fix" is to have the things that are currently bytes-to-text or text-to-bytes character set transformations *only* work with encode/decode, and not transform/untransform.
I believe that after much discussion we have settled on these transformations (in their respective modules) accepting either bytes or strings as input for decoding, only bytes as input for encoding, and *always* producing bytes as output. (Note that the base64 docs need some clarification about this.) Given this, the possible valid transformations would be: bytestring.transform('base64') bytesstring.untransform('base64') string.untransform('base64') and all would produce a byte string. That byte string would be in base64 for the first one, and a decoded binary string for the second two. Given our existing API, I don't think we want string.encode('base64') to work (taking an ascii-only unicode string and returning bytes), and we've already agreed that adding a 'decode' method to string is not going to happen. We could, however, and quite possibly should, disallow string.untransform('base64') even though the underly module supports it. Thus we would only have bytes-to-bytes transformations for 'base64' and its siblings, and you would write the unicode-ascii-to-bytes transformation as: string.encode('ascii').untransform('base64') which has some pedagogical value :). If you do transform('base64') on a bytestring already encoded as base64 you get a double encoding, yes. I don't see that it is our responsibility to try to protect you from this mistake. The module functions certainly don't. Given that, is there anything ambiguous about the proposed API? (Note: if you would like to argue that, eg, base64.b64encode or binascii.b2a_base64 should return a string, it is too late for that argument for backward compatibility reasons.)
When I asked about ambiguous cases, I was asking for cases where the meaning of "transform('somecodec')" was ambiguous. Sure, it is possible to feed the wrong input into that transformation, but I consider that a programming error, not an ambiguity in the API. After all, you have exactly the same problem if you use the module functions directly, which is currently the only option. --David

On Wed, Apr 24, 2013 at 12:16 AM, R. David Murray <rdmurray@bitdance.com> wrote:
And that's where it all falls down - to make that work, you need to engineer a complex system into the codecs module to say "this codec can be used with that API, but not with this one". I designed such a beast in http://bugs.python.org/issue7475 and I now think it's a *bad idea*. By contrast, the convenience function approach dispenses with all that, and simply says: 1. If you just want to deal with text encodings, use str.encode (which always produces bytes), along with bytes.decode and bytearray.decode (which always produce str) 2. If you want to use arbitrary codecs without any additional type constraints, do "from codecs import encode, decode" I think there's value in hiding the arbitrary codec support behind an import barrier (as they definitely have potential to be an attractive nuisance that makes it harder to grasp the nature of Unicode and text encodings, particularly for those coming from Python 2.x), but I'm not hugely opposed to providing them as builtins either. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

R. David Murray writes:
I think you're completely missing my point here. The problem is that in the cases I mention, what is encoded data and what is decoded data can only be decided by asking the user.
Which, of course, is quite broken from the point of view of the RFC! Of course, the RFC be damned[1], for the purposes of the Python stdlib, the specific codecs used for Content-Transfer-Encoding have a clear intuitive directionality, and their encoding methods should turn bytes into bytes (and str or bytes into bytes on decoding). Nevertheless, it's not TOOWTDI, it's a careful compromise.
Which is an obnoxious API, since (1) you've now made it impossible to use "transform" for bytestring.transform(from='utf-8', to='iso-8859-1') bytestring.transform(from='ulaw', to='mp3') textstring.transform(from='rest', to='html') without confusion, and (2) the whole world is going to wonder why you don't use .encode and .decode instead of .transform and .untransform. The idea in the examples is that we could generalize the codec registry to look up codecs by pairs of media-types. I'm not sure this makes sense ... much of the codec API presumes a stream, especially the incremental methods. But many MIME media types are streams only because they're serializations, incremental en/decoding is nonsense. So I suppose I would want to write bytestring.transform(from='octet-stream', to='BASE64') for this hypothetical API. (I suspect that in practice the 'application/octet-stream' media type would be spelled 'bytes', of course.) This kind of API could be used to improve the security of composition of transforms. In the case of BASE64, it would make sense to match anything at all as the other type (as long as it's represented in Python by a bytes object). So it would be possible to do object = bytestring.transform(from='BASE64', to='PNG') giving object a media_type attribute such that object.decode('iso-8859-1') would fail. (This would require changes to the charset codecs, to pay heed to the media_type attribute, so it's not immediately feasible.)
No, we don't, but for reasons that have little to do with "ASCII-only". The problem with supporting that idiom is that *people can't read strs* [in the Python 3 internal representation] -- they can only read a str that has been encoded implicitly into the PYTHONIOENCODING or explicitly to an explicitly requested encoding. So the usage above is clearly ambiguous. Even if it is ASCII-only, in theory the user could want EBCDIC.
Not for BASE64. But what's so special about BASE64 that it deserves a new method name for the same old idiom, using a word that's an obvious candidate for naming a more general idiom?
Even if it weren't too late, the byte-shoveling lobby is way too strong; that's not a winnable agument.
When I asked about ambiguous cases, I was asking for cases where the meaning of "transform('somecodec')" was ambiguous.
If "transform('somecodec')" isn't ambiguous, you really really want to spell it "encode" instead of "transform" IMO. Even though I don't see how to do that without generating more confusion than it's worth at the moment, I still harbor the hope that somebody will come up with a way to do it so everything still fits together. Footnotes: [1] I am *not* one to damn RFCs lightly!

On Wed, 24 Apr 2013 01:49:39 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
I think I understood that. I don't understand why that's a problem. (But see below.)
I've been trying to explain what I thought the transform/untransform proposal was: a minimalist extension of the encode/decode semantic (under a different name) so that functionality that was lost from Python2 encode/decode could be restored to Python3 in a reasonably understandable way. This would be a *limited* convenience function, just as encode/decode are limited convenience functions with respect to the full power of the codecs module. I myself don't have any real investment in the proposal, or I would have long since tried to push the tracker issue forward. People (at least you and Nick, and maybe Guido) seem to be more interested in a more general/powerful mechanism. I'm fine with that :) --David

R. David Murray writes:
It's a problem because in that case it's hard for users to remember the directionality of the codec based only on a single name; the API needs to indicate what is being transformed to what else.
I think that the intention of the proposal is reasonably understandable, and reasonable. I just don't think the API proposed is understandable, and therefore it's not reasonable.<wink/>
I can't speak to the opinions of people who actually know about language design. For myself, I'm sympathetic to the proposal of a specific API limited to cases where the directionality is clear as a generality. I just don't think the "transform" proposal helps much, partly because the actual applications are few, and partly because "transform" is more ambiguous (to be unambiguous in English, you need both the direct object ("from media type") and the indirect object ("to media type") specified. It is quite possible to say "transform encoded text to raw text" or similar. At least for me, "encode transformed text to raw text" raises a WTFAssertion. I know that I've experienced worlds of pain in the character coding sphere from Emacs APIs and UIs that don't indicate directionality clearly. This is very delicate; GNU Emacs had an ugly bug that regressed multiple times over more than a decade merely because they exposed the internal representation of text to Lisp. XEmacs has never experienced that bug (to be precise, the presence of that kind of bug resulted in an immediate assertion, so it was eliminated early in development). Surprisingly to me, the fact that XEmacs uses the internal representation of *text* to also represent "byte streams" (with bytes of variable width!) has never caused me confusion. It does cause others confusion, though, so although the XEmacs model of text is easier to work with than Emacs's, I tend to think Python 3's (which never confounds text with bytes) is better. I suspect that delicacy extends to non-character transformations, so I am pretty demanding about proposals in this area. Specifically I insist on EIBTI and TOOWTDI.

On 4/23/2013 12:49 PM, Stephen J. Turnbull wrote:
I think the unambiguous solution is to get rid of the notion of 'untransform' (which only means 'transform in the other direction'), since it requires and presumes an asymmetry that is not always present. It it precisely the lack of asymmetry in examples like the above that makes the transform/untransform pair ambiguous as to which is which. .transform should be explicit and always take two args, no implicit defaults, the 'from form' and the 'to' form. They can labelled by position in the natural order (from, to) or by keyword, as in your examples. For text, the plain undifferentiated form which one might think of as default could be called 'text' and that for bytes 'bytes' (as you suggest) or 'ascii' as appropriate. str.transform would always be unicode to unicode and bytes.transform always bytes to bytes.

Terry Jan Reedy writes:
Not natural to escaped-from-C programmers, though. I hesitate to say "make it keywords-only", but using keywords should be *strongly* encouraged.
str.transform would always be unicode to unicode and bytes.transform always bytes to bytes.
Which leaves the salient cases (MIME content transfer encodings) out in the cold, although I guess string.encode('ascii').transform(from='base64', to='bytes') isn't too horrible.

Stephen J. Turnbull wrote:
As an aside, if we'd had the flexible string representation sooner, this needn't have been such a big problem. With it, the base64 encoder could return a unicode string with 8-bit representation, which could then be turned into an ascii byte string with negligible overhead. Web developers might grumble about the need for an extra call, but they can no longer claim it would kill the performance of their web server. -- Greg

Greg Ewing writes:
Of course they can. There never was any performance measurement that supported that claim in the first place. I don't see how PEP 393 makes a difference to them. The real problem for them is that conceptually they think ASCII in byte form *is* text, and they want to do text processing on it. They'll use any flimsy excuse to avoid a transform to str, because it's just unbearably ugly given their givens. I have sympathy for their position, I just (even today) think it's the wrong thing for Python. However, I've long since been overruled, and I have no evidence to justify saying "I told you so".<wink/>

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/23/2013 09:29 AM, Stephen J. Turnbull wrote:
By RFC specification, BASE64 is a *textual* representation of arbitrary binary data.
It isn't "text" in the sense Py3k means: it is a representation for transmission on-the-wire for protocols which requre 7-bit-safe data. Nobody working with base64-encoded data is going to expect to do "normal" string processing on that data: the closest thing to that is splitting it into 72-byte chunks for transmission via e-mail. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlF4D9YACgkQ+gerLs4ltQ5nUACfWm4YEMarjUb7fEEpP+aMtaQr a7kAn1Pc8ufUwJzKHD0DgSxQ4H/uqf82 =CzTZ -----END PGP SIGNATURE-----

Tres Seaver writes:
RFC 4648 repeatedly refers to *characters*, without specifying an encoding for them. In fact, if you copy accurately, you can write BASE64 on a napkin and that napkin will accurate transmit the data (assuming it doesn't run into sleet or gloom of night). What else is that but "text in the sense of Py3k"? My point is not that Python's base64 codec *should* be bytes-to-str and back. My point is that, both in the formal spec and in historical evolution, that is a plausible interpretation of ".encode('base64')" which happens to be the reverse of the normal codec convention, where ".encode(codec)" is a *string* method, and ".decode(codec)" is a *bytes* method. This is not harder to learn for people (for BASE64 encoding or for coded character sets), because in each case there's a natural sense of direction for *en*coding vs. *de*coding. But it does break duck- typing, as does the web developer bytes-to-bytes usage of BASE64. What I'm groping toward is an idea of a "variable method", so that we could use .encode and .decode where they are TOOWTDI for people even though a purely formal interpretation of duck-typing would say "but why is that blue whale quacking, waddling, and flying?" In other words (although I have no idea how best to implement it), I would like "somestring.encode('base64')" to fail with "I don't know how to do that" (an attribute lookup error?), the same way that "somebytes.encode('utf-8')" does in Python 3 today.

On Thu, Apr 25, 2013 at 3:54 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Or Mrs Cake.
What else is that but "text in the sense of Py3k"?
Text in the sense of Py3k is Unicode. That a 8-bit character stream (or in this case 6-bit) fits in the 31 bit character space of Unicode doesn't make it Unicode, and hence not text. (Napkins of course have even higher bit density than 31 bits per character, unless you write very small). From the viewpoint of Py3k, bytes data is not text. This is a very useful way to deal with Unicode. See also http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/
My point is not that Python's base64 codec *should* be bytes-to-str and back.
Base64 does not convert between a Unicode character stream and an 8-bite byte stream. It converts between a 8-bit byte-stream and an 8-bit byte stream. It therefore should be bytes to bytes. To fit Unicode text into Base64 you have to first use an encoding on that Unicode text to convert it to bytes.
There's only two options there. Either you get a "LookupError: unknown encoding: base64", which is what you get now, or you get an UnicodeEncodingError if the text is not ASCII. We don't want the latter, because it means that code that looks fine for the developer breaks in real life because the developer was American and didn't think of this, but his client happens to have an accent in the name. Base64 is an encoding that transforms between 8-bit streams. Let it be that. Don't try to shoehorn it into a completely different kind of encoding. //Lennart

On Thu, 25 Apr 2013 04:19:36 +0200 Lennart Regebro <regebro@gmail.com> wrote:
No, it isn't. What Stephen wrote above.
That's bogus. By the same argument, we should suppress any encoding which isn't able to represent all possible unicode strings. That's almost all encodings provided by Python (including utf-8, if you consider lone surrogates). I'm sorry for Americans, but they *still* must know about character encodings, and be ready to handle UnicodeErrors, when using Python 3 for encoding/decoding bytestrings. There's no way around it. Regards Antoine.

On Thu, Apr 25, 2013 at 7:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes it is. Base64 takes 8-bit bytes and transforms them into another 8-bit stream that can be safely transmitted over various channels that would mangle an unencoded 8-bit stream, such as email etc. http://en.wikipedia.org/wiki/Base64
No, that's real life.
By the same argument, we should suppress any encoding which isn't able to represent all possible unicode strings.
No, if you explicitly use such an encoding it is because you need to because you are transferring data to a system that needs the encoding in question. Unicode errors are unavoidable at that point, not an unexpected surprise because a conversion happened implicitly that you didn't know about. //Lennart

Le Thu, 25 Apr 2013 08:38:12 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I don't see anything in that Wikipedia page that validates your opinion. The Wikipedia page does talk about *text* and *characters* for the result of base64 encoding. Besides, I would consider a RFC more authoritative than a Wikipedia definition.
I don't know what "implicit conversion" you are talking about. There's no "implicit conversion" in a scheme where the result of base64 encoding is a text string. Regards Antoine.

On Thu, Apr 25, 2013 at 11:25 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
OK, quote me the exact page text from the Wikipedia article or RFC that explains how you map the 31-bit character space of Unicode to Base64.
The Wikipedia page does talk about *text* and *characters* for the result of base64 encoding.
So are saying that you want the Python implementation of base64 encoding to take 8-bit binary data in bytes format and return a Unicode string containing the Base64 encoded data? I think that would surprise most people, and be of significantly less use than a base64 encoding that returns bytes. Python 3 still views text as Unicode only. Everything else is not text, but binary data. This makes sense, is consistent and makes things easier to handle. This is the whole point of making the str into Unicode in Python 3.
I'm sorry, I thought you were arguing for a base64 encoding taking Unicode strings and returning 8-bit bytes. That position I can understand, although I disagree with it. The position that a base64 encoding should take 8-bit bytes and return Unicode strings is incomprehensible to me. I have no idea why you would want that, how you would use it, how you would implement that API in a reasonable way, nor how you would explain why it is like that. I can't think of any usecase where you would want base64 encoded data unless you intend to transmit it over an 8-bit channel, so why it should return a Unicode string instead of 8-bit bytes is completely beyond my comprehension. Sorry. //Lennart

Le Thu, 25 Apr 2013 12:05:01 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I'm not wanting anything here, since that would clearly break backwards compatibility. But I think binascii should have gone that way in Python 3, indeed. binascii.b2a_hex(), for example, would be much more practical if it returned str, rather than bytes.
Python 3 still views text as Unicode only.
Python 3 doesn't *view* text as unicode, it *represents* it as unicode. That is, unicode is the character set that Python 3 is able to represent in the canonical text type, str. If you ever encounter a hypothetical text that uses characters outside of Unicode (obviously it will be encoded using a non-unicode encoding :-)), then you can't represent it as a str. And base64 is clearly representable as unicode, since it's representable using the ASCII character set (which is a subset of the unicode character set).
I can think of many usecases where I want to *embed* base64-encoded data in a larger text *before* encoding that text and transmitting it over a 8-bit channel. (GPG signatures, binary data embedded in JSON objects, etc.) Regards Antoine.

On Thu, Apr 25, 2013 at 2:57 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
That still doesn't mean that this should be the default behavior. Just because you *can* represent base64 as Unicode text doesn't mean that it should be.
(GPG signatures, binary data embedded in JSON objects, etc.)
Is the GPG signature calculated on the *Unicode* data? How is that done? Isn't it done on the encoded message? As I understand it a GPG signature is done on any sort of document. Either me or you have completely misunderstood how GPG works, I think. :-) In the case of JSON objects, they are intended for data exchange, and hence in the end need to be byte strings. So if you have a byte string you want to base64 encode before transmitting it with json, you would just end up transforming it to a unicode string and then back. That doesn't seem useful. One use case where you clearly *do* want the base64 encoded data to be unicode strings is because you want to embed it in a text discussing base64 strings, for a blog or a book or something. That doesn't seem to be a very common usecase. For the most part you base64 encode things because it's going to be transmitted, and hence the natural result of a base64 encoding should be data that is ready to be transmitted, hence byte strings, and not Unicode strings.
Python 3 doesn't *view* text as unicode, it *represents* it as unicode.
I don't agree that there is a significant difference between those wordings in this context. The end result is the same: Things intended to be handled/seen as textual should be unicode strings, things intended for data exchange should be byte strings. Something that is base64 encoded is primarily intended for data exchange. A base64 encoding should therefore return byte strings, especially since most API's that perform this transmission will take byte strings as input. If you want to include this in textual data, for whatever reason, like printing it in a book, then the conversion is trivial, but that is clearly the less common use case, and should therefore not be the default behavior. //Lennart

On Apr 25, 2013, at 03:34 PM, Lennart Regebro wrote:
In the case of JSON objects, they are intended for data exchange, and hence in the end need to be byte strings.
Except that they're not. http://bugs.python.org/issue10976 -Barry

On Thu, Apr 25, 2013 at 10:07 AM, Barry Warsaw <barry@python.org> wrote:
What am I doing wrong in this JSON crypto signature verification snippet that features many conversions between binary and text? recipients = jwsjs["recipients"] encoded_payload = binary(jwsjs["payload"]) headers = [] for recipient in recipients: h = binary(recipient["header"]) s = binary(recipient["signature"]) header = json.loads(native(urlsafe_b64decode(h))) vk = urlsafe_b64decode(binary(header["jwk"]["vk"])) secured_input = b".".join((h, encoded_payload)) sig = urlsafe_b64decode(s) sig_msg = sig+secured_input verified_input = native(ed25519ll.crypto_sign_open(sig_msg, vk)) verified_header, verified_payload = verified_input.split('.') verified_header = binary(verified_header) decoded_header = native(urlsafe_b64decode(verified_header)) headers.append(json.loads(decoded_header)) verified_payload = binary(verified_payload) # only return header, payload that have passed through the crypto library. payload = json.loads(native(urlsafe_b64decode(verified_payload))) return headers, payload

On 25/04/2013 14:34, Lennart Regebro wrote:
The JSON specification says that it's text. Its string literals can contain Unicode codepoints. It needs to be encoded to bytes for transmission and storage, but JSON itself is not a bytestring format.
base64 is a way of encoding binary data as text. The problem is that traditionally text has been encoded with one byte per character, except in those locales where there were too many characters in the character set for that to be possible. In Python 3 we're trying to stop mixing binary data (bytestrings) with text (Unicode strings).

On Thu, Apr 25, 2013 at 4:22 PM, MRAB <python@mrabarnett.plus.com> wrote:
OK, fair enough.
base64 is a way of encoding binary data as text.
It's a way of encoding binary data using ASCII. There is a subtle but important difference.
In Python 3 we're trying to stop mixing binary data (bytestrings) with text (Unicode strings).
Yup. And that's why a byte64 encoding shouldn't return Unicode strings. //Lennart

On Thu, 25 Apr 2013, Lennart Regebro wrote:
It is a way of encoding arrays of 8-bit bytes as arrays of characters that are part of the printable, non-whitespace subset of the ASCII repertoire. Since the ASCII repertoire is now simply the first 128 code points in the Unicode repertoire, it is equally correct to say that base64 is a way of encoding binary data as Unicode text.
That is exactly why it should return Unicode strings. What bytes should get sent if base64 is used to send a byte array over an EBCDIC link? [*] Having said that, there may be other reasons for base64 encoding to return bytes - I can conceive of arguments involving efficiency, or practicality, or the most common use cases. So I can't say for sure what base64 encoding actually ought to return in Python. But the purist stance should be that base64 encoding should return text, i.e. a string, i.e. unicode. [*] I apologize to anybody who just ate. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

Lennart Regebro writes:
Yes, there is a difference, but I think you're wrong. RFC 4648 explicitly states that Base-n encodings are intended for "human handling" and even makes reference to character glyphs (the rationale for excluding confusable digits from the Base32 alphabet). That's text. Even if it is a rather restricted subset of text, those restrictions are much stronger than merely to ASCII, and they are based on aspects of text that go well beyond merely an encoding with a small code unit.
That's inaccurate. Antoine has presented several examples of why *some* base64 encoders might return Unicode strings, precisely because their output will be embedded in Unicode streams. Debugging the MIME composition functions in the email module is another. An accurate statement is that these use cases are relatively unusual. The common use case is feeding a binary stream directly into a wire protocol. Supporting that use case demands a base64 encoder with a bytes-to-bytes signature in the stdlib, for both convenience and to some extent efficiency. I don't really care if the stdlib supports the specialized use cases with a separate base64 encoder (Antoine suggested the binascii module), or if it leaves that up to the user (it's just an occasional use of ".decode('ascii')", after all).

On 25/04/2013 15:22, MRAB wrote:
RFC 4648 says """Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data.""". To me, "US-ASCII" is an encoding, so it appears to be talking about encoding binary data (bytestrings) to ASCII-encoded text (bytestrings).

MRAB writes:
I think that's a misreading, inconsistent with the rest of the RFC. The references to US-ASCII are not clearly normative, as the value- character mappings are given in tables, and are self-contained. (The one you quote is clearly informative, since it describes a use-case.) The term "subset of US-ASCII" suggests repertoire, not encoding, as does the use of "alphabet" to refer to these subsets. *Every* (other?) normative statement is very careful to say that input of a Base-n encoder is "octets" (with two uses of "bytes" in the definition of Base32), and the output is "characters". There are no exceptions, and there are *no* references to encoding of characters or the corresponding character codes (except the possible implicit reference via "US-ASCII"). I can make no sense of those facts if the intent of the RFC is to restrict the output of a Base-n encoder to characters encoded in (8-bit) US-ASCII. Why not just say so, and use "octets" and their ASCII codes throughout, with the corresponding characters used as informative commentary? I think it much more likely that "subset of the character repertoire of US-ASCII" was intended, but abbreviated to "subset of US-ASCII". This kind of abbreviation is very common in informal discussion of coded character sets. I admit it's a little surprising that the author would be so incautious in his use of "US-ASCII", but if he really meant US-ASCII- the-encoding, I find the style of the rest of the RFC astonishing!

Le Thu, 25 Apr 2013 15:34:45 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I don't think this distinction is meaningful at all. In the end, everything is a byte string on a classical computer (including unicode strings displayed on your monitor, obviously). If you think the technicalities of an operation should never be hidden or abstracted away, then you're better off with C than Python ;-) Regards Antoine.

On Thu, Apr 25, 2013 at 5:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
OK, then I think we have found the core of the problem, and the end of the discussion (from my side, that is).
Yes of course. Especially since my monitor is an output device. ;-)
If you think the technicalities of an operation should never be hidden or abstracted away, then you're better off with C than Python ;-)
The whole point is that Python *does* abstract it away. It abstract the internals of Unicode strings in such a way that they are no longer, conceptually, 8-bit data. This *is* a distinction Python does, and it is a useful distinction. I do not see any reason to remove it. http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/ //Lennart

On 2013-04-25, at 11:25 , Antoine Pitrou wrote:
Besides, I would consider a RFC more authoritative than a Wikipedia definition.
so the output is US-ASCII data, a byte stream. Stephen is correct that you could decide you don't care about those semantics, and implement base64 encoding as a bytes -> str decoding then requiring a re-encoding (to ascii) before wire transmission. The clarity of the interface (or lack thereof) would probably make users want to send a strongly worded letter to whoever implemented it though, I don't think `data.decode('base64').encode('ascii')` would fit the "obviousness" or "readability" expectations of most users.

Le Thu, 25 Apr 2013 12:46:43 +0200, Xavier Morel <catch-all@masklinn.net> a écrit :
Well, depending on the context, US-ASCII can be a character set or a character encoding. If some specification is talking about text and characters, then it is something that can reasonably be a str in Python land. Similarly, we have chosen to make filesystem paths str by default in Python 3, even though many Unix-heads would claim that filesystem paths are "bytes only". The reason is that while they are technically bytes (under Unix), they are functionally text. Now, if the base64-encoded data is your entire payload, this clearly amounts to nitpicking. But when you want to *embed* that data in some larger chunk of text (e.g. a JSON object), then it makes a lot of sense to consider the base64-encoded data a piece of *text*, not bytes. Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/25/2013 01:43 AM, Antoine Pitrou wrote:
Stephen was incorrect: the base64 standard is about encoding a binary stream (8-bit bites) onto another binary stream (6-bit bytes), but one which can be safely transmitted over a 7-bit-only medium. Text in Py3ks sense is irrelevant.
WHat does that snark have to do with this discussion? base64 has no more to do with character set encodings than it does the moon. It would be a "transform" (bytes -> bytes), not an "encoding". Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlF5Nc4ACgkQ+gerLs4ltQ7f9ACgx19dzyLXCDzkLkWITSU+7WyD XEMAn38mZgK8F1/FGWJc+ANOJz2tfHI/ =qpSL -----END PGP SIGNATURE-----

Lennart Regebro writes:
By "completely different kind of encoding" do you mean "codec"? I think that would be an unfortunate result. These operations on streams are theoretically nicely composable. It would be nice if practice reflected that by having a uniform API for all of these operations (charset translation, encoded text to internal, content transfer encoding, compression ...). I think it would be useful, too, though I can't prove that. Anyway, this discussion belongs on python-ideas at this point. Or would, if I had an idea about implementation. I'll take it there when I do have something to say about implementation.

On Thu, Apr 25, 2013 at 8:57 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
But the translation to and from Unicode to some 8-bit encoding is different from the others. It makes sense that they have a different API. If you have a Unicode string you can go: Unicode text -> UTF8 -> ZIP -> BASE64. Or you can go Unicode text -> UTF8 -> BASE64 -> ZIP Although admittedly that makes much less sense. :-) But you can not go: Unicode text -> BASE64 -> ZIP -> UTF8 The str/bytes encoding/decoding is not like the others. //Lennart

On Thu, Apr 25, 2013 at 4:57 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Bringing the mailing list thread up to date with the state of the relevant tracker issues: I created http://bugs.python.org/issue17827 to cover adding the missing documentation for "codecs.encode" and "codecs.decode" as the officially supported solutions for easy use of the codec infrastructure *without* the additional text model specific input and output type restrictions imposed by the str.encode, bytes.decode and bytearray.decode methods. I created http://bugs.python.org/issue17828 to cover emitting more meaningful exceptions when a codec throws TypeError or ValueError, as well as when the additional type checking fails for str.encode, bytes.decode and bytearray.decode. I created http://bugs.python.org/issue17839 to cover the fact that part of the problem here is that the base64 module currently only accepts bytes and bytearray as inputs, rather than anything that supports the PEP 3118 buffer interface. http://bugs.python.org/issue7475 (linked earlier in the thread) is now strictly about restoring the shorthand aliases for "base64_codec", "bz2_codec" et al that were removed in http://bugs.python.org/issue10807. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 23/04/13 09:16, Greg Ewing wrote:
As others have pointed out in the past, repeatedly, the codec system is completely general and can transform bytes->bytes and text->text just as easily as bytes<->text. Or indeed any bijection, as the docs for 2.7 point out. The question isn't "What's so special about base64?" The questions should be: - What's so special about exotic legacy transformations like ISO-8859-10 and MacRoman that they deserve a string method for invoking them? - Why have common transformations like base64, which worked in 2.x, been relegated to second-class status in 3.x? - If it is no burden to have to import a module and call an external function for some transformations, why have encode and decode methods at all? If you haven't read this, you should: http://lucumr.pocoo.org/2012/8/11/codec-confusion/ -- Steven

On Apr 22, 2013, at 10:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I may be dull, but it wasn't until I started using Python 3 that it really clicked in my head what encode/decode did exactly. In Python2 I just sort of sprinkled one or the other when there was errors until the pain stopped. I mostly attribute this to str.decode and bytes.encode not existing.
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Apr 22, 2013, at 10:30 PM, Donald Stufft wrote:
This is a key observation. It's also now much easier to *explain* what's going on and recommend correct code in Python 3, so overall it's a win. That's not to downplay the inconvenience of not being able to easily do bytes->bytes or str->str transformations as easily as was possible in Python 2. I've not thought about it much, but placing those types of transformations on a different set of functions (methods or builtins) seems like the right direction. IOW, don't mess with encode/decode. -Barry

On Mon, Apr 22, 2013 at 7:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
There are good answers to all of these, and your rhetoric is not appreciated. The special status is for the translation between bytes and Unicode characters (code points). There are many contexts where a byte stream is labeled (either separately or in-line) as being encoded using some specific encoding. -- --Guido van Rossum (python.org/~guido)

On Tue, Apr 23, 2013 at 4:04 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, but the encode()/decode() methods are not, and the fact that you now know what goes in and what comes out means that people get much fewer Decode/EncodeErrors. Which is a good thing. //Lennart

Using decode() and encode() would break that predictability. But someone suggested the use of transform() and untransform() instead. That would clarify that the transformation is bytes > bytes and Unicode string > Unicode string. On 23 Apr 2013 05:50, "Lennart Regebro" <regebro@gmail.com> wrote:

Steven D'Aprano wrote:
Now that all text strings are unicode, the unicode codecs are in a sense special, in that you can't do any string I/O at all without using them at some stage. So arguably it makes sense to have a very easy way of invoking them. I suspect that without this, the idea of all strings being unicode would have been even harder to sell than it was. -- Greg

On 22 April 2013 12:39, Calvin Spealman <ironfroggy@gmail.com> wrote:
if two lines is cumbersome, you're in for a cumbersome life a programmer.
One of which is essentially Python's equivalent of a declaration... Paul

On Mon, Apr 22, 2013 at 7:39 AM, Calvin Spealman <ironfroggy@gmail.com> wrote:
if two lines is cumbersome, you're in for a cumbersome life a programmer.
Other encodings are either missing completely from the stdlib, or have corrupted behavior. For example, string_escape is gone, and unicode_escape doesn't make any sense anymore -- python code is text, not bytes, so why does 'abc'.encode('unicode_escape') return bytes? I don't think this change was thought through completely before it was implemented. I agree base64 is a bad place to pick at the encode/decode changes, though. :( -- Devin

On Mon, Apr 22, 2013 at 09:50:14AM -0400, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
unicode_escape doesn't make any sense anymore -- python code is text, not bytes, so why does 'abc'.encode('unicode_escape') return bytes?
AFAIU the situation is simple: unicode.encode(encoding) returns bytes, bytes.decode(encoding) returns unicode, and neither unicode.decode() nor bytes.encode() exist. Transformations like base64 and bz2 are nor encoding/decoding -- they are bytes/bytes or unicode/unicode transformations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Mon, 22 Apr 2013 09:50:14 -0400, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
We use unicode_escape (actually raw_unicode_escape) in the email package, and there we are converting between string and bytes. It is used as an encoder when we are supposed to have ASCII input but have other stuff, and need ASCII output and don't want to lose information. So yes, that encoder does still make sense. It would also be useful as a transform function, but as someone has pointed out there's an issue for that. --David

Hi, Your question is discussed since 4 years in the following issue: http://bugs.python.org/issue7475 The last proposition is to add transform() and untransform() methods to bytes and str types. But nobody implemented the idea. If I remember correctly, the missing point is how to define which types are supported by a codec (ex: only bytes for bz2 codec, bytes and str for rot13). Victor 2013/4/22 Ram Rachum <ram@rachum.com>:

Victor Stinner wrote:
Also, for any given codec, which direction is "transform" and which is "untransform"? Also also, what's so special about base64 et al that they deserve an ultra-special way of invoking them, instead of having to import a class or function like you do for *every* *other* piece of library functionality? -- Greg

On Tue, 23 Apr 2013 11:16:20 +1200, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
You transform *into* the encoding, and untransform *out* of the encoding. Do you have an example where that would be ambiguous?
You can ask the same question about all the other codecs. (And that question has indeed been asked in the past.) (One answer is that they used to work in Python2...but the longer we go without restoring the functionality to Python3, the weaker that particular argument becomes.) --David

--Guido van Rossum (sent from Android phone) On Apr 22, 2013 6:09 PM, "R. David Murray" <rdmurray@bitdance.com> wrote:
On Tue, 23 Apr 2013 11:16:20 +1200, Greg Ewing <
greg.ewing@canterbury.ac.nz> wrote:
Except for rot13. :-)

On Apr 22, 2013, at 06:22 PM, Guido van Rossum wrote:
The fact that you can do this instead *is* a bit odd. ;) from codecs import getencoder encoder = getencoder('rot-13') r13 = encoder('hello world')[0] -Barry

On 23.04.2013 17:15, Barry Warsaw wrote:
Just as reminder: we have the general purpose encode()/decode() functions in the codecs module: import codecs r13 = codecs.encode('hello world', 'rot-13') These interface directly to the codec interfaces, without enforcing type restrictions. The codec defines the supported input and output types. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Apr 23, 2013 at 8:22 AM, M.-A. Lemburg <mal@egenix.com> wrote:
As an implementation mechanism I see nothing wrong with this. I hope the codecs module lets you introspect the input and output types of a codec given by name? -- --Guido van Rossum (python.org/~guido)

On 23.04.2013 17:47, Guido van Rossum wrote:
At the moment there is no standard interface to access supported input and output types... but then: regular Python functions or methods also don't provide such functionality, so no surprise there ;-) It's mostly a matter of specifying the supported type combinations in the codec documentation. BTW: What would be a use case where you'd want to programmatically access such information before calling the codec ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 23 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Tue, Apr 23, 2013 at 9:04 AM, M.-A. Lemburg <mal@egenix.com> wrote:
Not quite the same though. Each function has its own unique behavior. But codecs support a standard interface, *except* that the input and output types sometimes vary.
As you know, in Python 3, most code working with bytes doesn't also work with strings, and vice versa (except for a few cases where we've gone out of our way to write polymorphic code -- but users rarely do so, and any time you use a string or bytes literal you basically limit yourself to that type). Suppose I write a command-line utility that reads a file, runs it through a codec, and writes the result to another file. Suppose the name of the codec is a command-line argument (as well as the filenames). I need to know whether to open the files in text or binary mode based on the name of the codec. -- --Guido van Rossum (python.org/~guido)

On 23.04.2013 19:24, Guido van Rossum wrote:
The codec system itself
Ok, so you need to know which codecs your tool can support and which of those need text input and which bytes input. I've been thinking about this some more: I think that type information alone is not flexible enough to cover such use cases. In your use case you'd want to only permit use of a certain set of codecs, not simply all of them, since some might not implement what you actually want to achieve with the tool, e.g. a user might have installed a codec set that adds support for reading and writing image data, but your intended use was to only support text data. So what we need is a way to allow the codecs to say e.g. "I work on text", "I support encoding bytes and text", "I encode to bytes", "I'm reversible", "I transform input data", "I support bytes and text, and will create same type output", "I work on image data", "I work on X509 certificates", "I work on XML data", etc. In other words, we need a form of tagging system, with a set of standard tags that each codec can publish and which also allows non-standard tags (which can then at some point be made standard, if there's agreement on them). Given a codec name you could then ask the codec registry for the codec tags and verify that the chosen codec handles text data, needs bytes or text encoding input and creates bytes as encoding output. If the registry returns codec tags that don't include the "I work on text" tag, the tool could then raise an error. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 24 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 4/24/2013 1:22 AM, M.-A. Lemburg wrote:
Maybe MIME type and encoding would be sufficient type information, but probably not str vs. bytes.
MIME type supports this sort of concept, with the two-level hierarchy of naming the type... text/xml text/plain image/jpeg
Guess what I think you are re-inventing here.... Nope, guess again.... Yep, MIME types _plus_ encodings.
Hmm. Sounds just like the registry for, um, you guessed it: MIME types.
For just doing text encoding transformations, text/plain would work as a MIME type, and the encodings of interest for the encodings. Seems like "str" always means "Unicode" but the MIME type can vary; "bytes" might mean encoded text, and the MIME type can also vary. For non-textual transformations, "encoding" might mean Base 64, BinHex, or other such representations... but those can also be applied to text, so it might be a 3rd dimension, or it might just be a list of encodings rather than a single encoding. Compression could be another dimension, or perhaps another encoding. But really, then, a transformation needs to be a list of steps; a codec can sign up to perform one or more of the steps, a sequence of codecs would have to be found, capable of performing a subsequence of the steps, and then run in the appropriate order. This all sounds so general, that probably the Python compiler could be implemented as a codec :) Or any compiler. Probably a web server could be implemented as a codec too :) Well, maybe not, codecs have limited error handling and reporting abilities.

On 24 Apr 2013 01:25, "M.-A. Lemburg" <mal@egenix.com> wrote:
If we already have those, why aren't they documented? If they exist, they should be the first thing in the codecs module docs and the porting guide should list them as the replacement for the method versions when using encodings that aren't directly related to the text model, or when the input buffer for decoding isn't a bytes or bytearray object. Regards, Nick.
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

On 23.04.2013 23:37, Nick Coghlan wrote:
Good question. I added them in 2004 and probably just forgot to add the documentation: http://hg.python.org/cpython-fullhistory/rev/8ea2cb1ec598 I guess the doc-strings could be used as basis for the documentation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 24 2013)
2013-04-17: Released eGenix mx Base 3.2.6 ... http://egenix.com/go43 ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

R. David Murray writes:
You transform *into* the encoding, and untransform *out* of the encoding. Do you have an example where that would be ambiguous?
In the bytes-to-bytes case, any pair of character encodings (eg, UTF-8 and ISO-8859-15) would do. Or how about in text, ReST to HTML? BASE64 itself is ambiguous. By RFC specification, BASE64 is a *textual* representation of arbitrary binary data. (Cf. URIs.) The natural interpretation of .encode('base64') in that context would be as a bytes-to-text encoder. However, this has several problems. In practice, we invariably use an ASCII octet stream to carry BASE64- encoded data. So web developers would almost certainly expect a bytes-to-bytes encoder. Such a bytes-to-bytes encoder can't be duck-typed. Double-encoding bugs wouldn't be detected until the stream arrives at the user. And the RFC-based signature of .encode('base64') as bytes-to-text is precisely opposite to that of .encode('utf-8') (text-to-bytes). It is certainly true that there are many unambiguous cases. In the case of a true text processing facility (eg, Emacs buffers or Python 3 str) where there is an unambiguous text type with a constant and opaque internal representation, it makes a lot of sense to treat the text type as special/central, and use the terminology "encode [from text]" and "decode [to text]". It's easy to remember, which one is special is obvious, and the difference in input and output types means that mistaken use of the API will be detected by duck-typing. However, in the case of bytes-bytes or text-text transformations, it's not the presence of unambiguous cases that should drive API design IMO. It's the presence of the ambiguous cases that we should cater to. I don't see easy solutions to this issue. Steve

On Tue, 23 Apr 2013 22:29:33 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
If I write: bytestring.transform('ISO-8859-15') that would indeed be ambiguous, but only because I haven't named the source encoding of the bytestring. So the above is obviously nonsense, and the easiest "fix" is to have the things that are currently bytes-to-text or text-to-bytes character set transformations *only* work with encode/decode, and not transform/untransform.
I believe that after much discussion we have settled on these transformations (in their respective modules) accepting either bytes or strings as input for decoding, only bytes as input for encoding, and *always* producing bytes as output. (Note that the base64 docs need some clarification about this.) Given this, the possible valid transformations would be: bytestring.transform('base64') bytesstring.untransform('base64') string.untransform('base64') and all would produce a byte string. That byte string would be in base64 for the first one, and a decoded binary string for the second two. Given our existing API, I don't think we want string.encode('base64') to work (taking an ascii-only unicode string and returning bytes), and we've already agreed that adding a 'decode' method to string is not going to happen. We could, however, and quite possibly should, disallow string.untransform('base64') even though the underly module supports it. Thus we would only have bytes-to-bytes transformations for 'base64' and its siblings, and you would write the unicode-ascii-to-bytes transformation as: string.encode('ascii').untransform('base64') which has some pedagogical value :). If you do transform('base64') on a bytestring already encoded as base64 you get a double encoding, yes. I don't see that it is our responsibility to try to protect you from this mistake. The module functions certainly don't. Given that, is there anything ambiguous about the proposed API? (Note: if you would like to argue that, eg, base64.b64encode or binascii.b2a_base64 should return a string, it is too late for that argument for backward compatibility reasons.)
When I asked about ambiguous cases, I was asking for cases where the meaning of "transform('somecodec')" was ambiguous. Sure, it is possible to feed the wrong input into that transformation, but I consider that a programming error, not an ambiguity in the API. After all, you have exactly the same problem if you use the module functions directly, which is currently the only option. --David

On Wed, Apr 24, 2013 at 12:16 AM, R. David Murray <rdmurray@bitdance.com> wrote:
And that's where it all falls down - to make that work, you need to engineer a complex system into the codecs module to say "this codec can be used with that API, but not with this one". I designed such a beast in http://bugs.python.org/issue7475 and I now think it's a *bad idea*. By contrast, the convenience function approach dispenses with all that, and simply says: 1. If you just want to deal with text encodings, use str.encode (which always produces bytes), along with bytes.decode and bytearray.decode (which always produce str) 2. If you want to use arbitrary codecs without any additional type constraints, do "from codecs import encode, decode" I think there's value in hiding the arbitrary codec support behind an import barrier (as they definitely have potential to be an attractive nuisance that makes it harder to grasp the nature of Unicode and text encodings, particularly for those coming from Python 2.x), but I'm not hugely opposed to providing them as builtins either. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

R. David Murray writes:
I think you're completely missing my point here. The problem is that in the cases I mention, what is encoded data and what is decoded data can only be decided by asking the user.
Which, of course, is quite broken from the point of view of the RFC! Of course, the RFC be damned[1], for the purposes of the Python stdlib, the specific codecs used for Content-Transfer-Encoding have a clear intuitive directionality, and their encoding methods should turn bytes into bytes (and str or bytes into bytes on decoding). Nevertheless, it's not TOOWTDI, it's a careful compromise.
Which is an obnoxious API, since (1) you've now made it impossible to use "transform" for bytestring.transform(from='utf-8', to='iso-8859-1') bytestring.transform(from='ulaw', to='mp3') textstring.transform(from='rest', to='html') without confusion, and (2) the whole world is going to wonder why you don't use .encode and .decode instead of .transform and .untransform. The idea in the examples is that we could generalize the codec registry to look up codecs by pairs of media-types. I'm not sure this makes sense ... much of the codec API presumes a stream, especially the incremental methods. But many MIME media types are streams only because they're serializations, incremental en/decoding is nonsense. So I suppose I would want to write bytestring.transform(from='octet-stream', to='BASE64') for this hypothetical API. (I suspect that in practice the 'application/octet-stream' media type would be spelled 'bytes', of course.) This kind of API could be used to improve the security of composition of transforms. In the case of BASE64, it would make sense to match anything at all as the other type (as long as it's represented in Python by a bytes object). So it would be possible to do object = bytestring.transform(from='BASE64', to='PNG') giving object a media_type attribute such that object.decode('iso-8859-1') would fail. (This would require changes to the charset codecs, to pay heed to the media_type attribute, so it's not immediately feasible.)
No, we don't, but for reasons that have little to do with "ASCII-only". The problem with supporting that idiom is that *people can't read strs* [in the Python 3 internal representation] -- they can only read a str that has been encoded implicitly into the PYTHONIOENCODING or explicitly to an explicitly requested encoding. So the usage above is clearly ambiguous. Even if it is ASCII-only, in theory the user could want EBCDIC.
Not for BASE64. But what's so special about BASE64 that it deserves a new method name for the same old idiom, using a word that's an obvious candidate for naming a more general idiom?
Even if it weren't too late, the byte-shoveling lobby is way too strong; that's not a winnable agument.
When I asked about ambiguous cases, I was asking for cases where the meaning of "transform('somecodec')" was ambiguous.
If "transform('somecodec')" isn't ambiguous, you really really want to spell it "encode" instead of "transform" IMO. Even though I don't see how to do that without generating more confusion than it's worth at the moment, I still harbor the hope that somebody will come up with a way to do it so everything still fits together. Footnotes: [1] I am *not* one to damn RFCs lightly!

On Wed, 24 Apr 2013 01:49:39 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
I think I understood that. I don't understand why that's a problem. (But see below.)
I've been trying to explain what I thought the transform/untransform proposal was: a minimalist extension of the encode/decode semantic (under a different name) so that functionality that was lost from Python2 encode/decode could be restored to Python3 in a reasonably understandable way. This would be a *limited* convenience function, just as encode/decode are limited convenience functions with respect to the full power of the codecs module. I myself don't have any real investment in the proposal, or I would have long since tried to push the tracker issue forward. People (at least you and Nick, and maybe Guido) seem to be more interested in a more general/powerful mechanism. I'm fine with that :) --David

R. David Murray writes:
It's a problem because in that case it's hard for users to remember the directionality of the codec based only on a single name; the API needs to indicate what is being transformed to what else.
I think that the intention of the proposal is reasonably understandable, and reasonable. I just don't think the API proposed is understandable, and therefore it's not reasonable.<wink/>
I can't speak to the opinions of people who actually know about language design. For myself, I'm sympathetic to the proposal of a specific API limited to cases where the directionality is clear as a generality. I just don't think the "transform" proposal helps much, partly because the actual applications are few, and partly because "transform" is more ambiguous (to be unambiguous in English, you need both the direct object ("from media type") and the indirect object ("to media type") specified. It is quite possible to say "transform encoded text to raw text" or similar. At least for me, "encode transformed text to raw text" raises a WTFAssertion. I know that I've experienced worlds of pain in the character coding sphere from Emacs APIs and UIs that don't indicate directionality clearly. This is very delicate; GNU Emacs had an ugly bug that regressed multiple times over more than a decade merely because they exposed the internal representation of text to Lisp. XEmacs has never experienced that bug (to be precise, the presence of that kind of bug resulted in an immediate assertion, so it was eliminated early in development). Surprisingly to me, the fact that XEmacs uses the internal representation of *text* to also represent "byte streams" (with bytes of variable width!) has never caused me confusion. It does cause others confusion, though, so although the XEmacs model of text is easier to work with than Emacs's, I tend to think Python 3's (which never confounds text with bytes) is better. I suspect that delicacy extends to non-character transformations, so I am pretty demanding about proposals in this area. Specifically I insist on EIBTI and TOOWTDI.

On 4/23/2013 12:49 PM, Stephen J. Turnbull wrote:
I think the unambiguous solution is to get rid of the notion of 'untransform' (which only means 'transform in the other direction'), since it requires and presumes an asymmetry that is not always present. It it precisely the lack of asymmetry in examples like the above that makes the transform/untransform pair ambiguous as to which is which. .transform should be explicit and always take two args, no implicit defaults, the 'from form' and the 'to' form. They can labelled by position in the natural order (from, to) or by keyword, as in your examples. For text, the plain undifferentiated form which one might think of as default could be called 'text' and that for bytes 'bytes' (as you suggest) or 'ascii' as appropriate. str.transform would always be unicode to unicode and bytes.transform always bytes to bytes.

Terry Jan Reedy writes:
Not natural to escaped-from-C programmers, though. I hesitate to say "make it keywords-only", but using keywords should be *strongly* encouraged.
str.transform would always be unicode to unicode and bytes.transform always bytes to bytes.
Which leaves the salient cases (MIME content transfer encodings) out in the cold, although I guess string.encode('ascii').transform(from='base64', to='bytes') isn't too horrible.

Stephen J. Turnbull wrote:
As an aside, if we'd had the flexible string representation sooner, this needn't have been such a big problem. With it, the base64 encoder could return a unicode string with 8-bit representation, which could then be turned into an ascii byte string with negligible overhead. Web developers might grumble about the need for an extra call, but they can no longer claim it would kill the performance of their web server. -- Greg

Greg Ewing writes:
Of course they can. There never was any performance measurement that supported that claim in the first place. I don't see how PEP 393 makes a difference to them. The real problem for them is that conceptually they think ASCII in byte form *is* text, and they want to do text processing on it. They'll use any flimsy excuse to avoid a transform to str, because it's just unbearably ugly given their givens. I have sympathy for their position, I just (even today) think it's the wrong thing for Python. However, I've long since been overruled, and I have no evidence to justify saying "I told you so".<wink/>

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/23/2013 09:29 AM, Stephen J. Turnbull wrote:
By RFC specification, BASE64 is a *textual* representation of arbitrary binary data.
It isn't "text" in the sense Py3k means: it is a representation for transmission on-the-wire for protocols which requre 7-bit-safe data. Nobody working with base64-encoded data is going to expect to do "normal" string processing on that data: the closest thing to that is splitting it into 72-byte chunks for transmission via e-mail. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlF4D9YACgkQ+gerLs4ltQ5nUACfWm4YEMarjUb7fEEpP+aMtaQr a7kAn1Pc8ufUwJzKHD0DgSxQ4H/uqf82 =CzTZ -----END PGP SIGNATURE-----

Tres Seaver writes:
RFC 4648 repeatedly refers to *characters*, without specifying an encoding for them. In fact, if you copy accurately, you can write BASE64 on a napkin and that napkin will accurate transmit the data (assuming it doesn't run into sleet or gloom of night). What else is that but "text in the sense of Py3k"? My point is not that Python's base64 codec *should* be bytes-to-str and back. My point is that, both in the formal spec and in historical evolution, that is a plausible interpretation of ".encode('base64')" which happens to be the reverse of the normal codec convention, where ".encode(codec)" is a *string* method, and ".decode(codec)" is a *bytes* method. This is not harder to learn for people (for BASE64 encoding or for coded character sets), because in each case there's a natural sense of direction for *en*coding vs. *de*coding. But it does break duck- typing, as does the web developer bytes-to-bytes usage of BASE64. What I'm groping toward is an idea of a "variable method", so that we could use .encode and .decode where they are TOOWTDI for people even though a purely formal interpretation of duck-typing would say "but why is that blue whale quacking, waddling, and flying?" In other words (although I have no idea how best to implement it), I would like "somestring.encode('base64')" to fail with "I don't know how to do that" (an attribute lookup error?), the same way that "somebytes.encode('utf-8')" does in Python 3 today.

On Thu, Apr 25, 2013 at 3:54 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Or Mrs Cake.
What else is that but "text in the sense of Py3k"?
Text in the sense of Py3k is Unicode. That a 8-bit character stream (or in this case 6-bit) fits in the 31 bit character space of Unicode doesn't make it Unicode, and hence not text. (Napkins of course have even higher bit density than 31 bits per character, unless you write very small). From the viewpoint of Py3k, bytes data is not text. This is a very useful way to deal with Unicode. See also http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/
My point is not that Python's base64 codec *should* be bytes-to-str and back.
Base64 does not convert between a Unicode character stream and an 8-bite byte stream. It converts between a 8-bit byte-stream and an 8-bit byte stream. It therefore should be bytes to bytes. To fit Unicode text into Base64 you have to first use an encoding on that Unicode text to convert it to bytes.
There's only two options there. Either you get a "LookupError: unknown encoding: base64", which is what you get now, or you get an UnicodeEncodingError if the text is not ASCII. We don't want the latter, because it means that code that looks fine for the developer breaks in real life because the developer was American and didn't think of this, but his client happens to have an accent in the name. Base64 is an encoding that transforms between 8-bit streams. Let it be that. Don't try to shoehorn it into a completely different kind of encoding. //Lennart

On Thu, 25 Apr 2013 04:19:36 +0200 Lennart Regebro <regebro@gmail.com> wrote:
No, it isn't. What Stephen wrote above.
That's bogus. By the same argument, we should suppress any encoding which isn't able to represent all possible unicode strings. That's almost all encodings provided by Python (including utf-8, if you consider lone surrogates). I'm sorry for Americans, but they *still* must know about character encodings, and be ready to handle UnicodeErrors, when using Python 3 for encoding/decoding bytestrings. There's no way around it. Regards Antoine.

On Thu, Apr 25, 2013 at 7:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Yes it is. Base64 takes 8-bit bytes and transforms them into another 8-bit stream that can be safely transmitted over various channels that would mangle an unencoded 8-bit stream, such as email etc. http://en.wikipedia.org/wiki/Base64
No, that's real life.
By the same argument, we should suppress any encoding which isn't able to represent all possible unicode strings.
No, if you explicitly use such an encoding it is because you need to because you are transferring data to a system that needs the encoding in question. Unicode errors are unavoidable at that point, not an unexpected surprise because a conversion happened implicitly that you didn't know about. //Lennart

Le Thu, 25 Apr 2013 08:38:12 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I don't see anything in that Wikipedia page that validates your opinion. The Wikipedia page does talk about *text* and *characters* for the result of base64 encoding. Besides, I would consider a RFC more authoritative than a Wikipedia definition.
I don't know what "implicit conversion" you are talking about. There's no "implicit conversion" in a scheme where the result of base64 encoding is a text string. Regards Antoine.

On Thu, Apr 25, 2013 at 11:25 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
OK, quote me the exact page text from the Wikipedia article or RFC that explains how you map the 31-bit character space of Unicode to Base64.
The Wikipedia page does talk about *text* and *characters* for the result of base64 encoding.
So are saying that you want the Python implementation of base64 encoding to take 8-bit binary data in bytes format and return a Unicode string containing the Base64 encoded data? I think that would surprise most people, and be of significantly less use than a base64 encoding that returns bytes. Python 3 still views text as Unicode only. Everything else is not text, but binary data. This makes sense, is consistent and makes things easier to handle. This is the whole point of making the str into Unicode in Python 3.
I'm sorry, I thought you were arguing for a base64 encoding taking Unicode strings and returning 8-bit bytes. That position I can understand, although I disagree with it. The position that a base64 encoding should take 8-bit bytes and return Unicode strings is incomprehensible to me. I have no idea why you would want that, how you would use it, how you would implement that API in a reasonable way, nor how you would explain why it is like that. I can't think of any usecase where you would want base64 encoded data unless you intend to transmit it over an 8-bit channel, so why it should return a Unicode string instead of 8-bit bytes is completely beyond my comprehension. Sorry. //Lennart

Le Thu, 25 Apr 2013 12:05:01 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I'm not wanting anything here, since that would clearly break backwards compatibility. But I think binascii should have gone that way in Python 3, indeed. binascii.b2a_hex(), for example, would be much more practical if it returned str, rather than bytes.
Python 3 still views text as Unicode only.
Python 3 doesn't *view* text as unicode, it *represents* it as unicode. That is, unicode is the character set that Python 3 is able to represent in the canonical text type, str. If you ever encounter a hypothetical text that uses characters outside of Unicode (obviously it will be encoded using a non-unicode encoding :-)), then you can't represent it as a str. And base64 is clearly representable as unicode, since it's representable using the ASCII character set (which is a subset of the unicode character set).
I can think of many usecases where I want to *embed* base64-encoded data in a larger text *before* encoding that text and transmitting it over a 8-bit channel. (GPG signatures, binary data embedded in JSON objects, etc.) Regards Antoine.

On Thu, Apr 25, 2013 at 2:57 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
That still doesn't mean that this should be the default behavior. Just because you *can* represent base64 as Unicode text doesn't mean that it should be.
(GPG signatures, binary data embedded in JSON objects, etc.)
Is the GPG signature calculated on the *Unicode* data? How is that done? Isn't it done on the encoded message? As I understand it a GPG signature is done on any sort of document. Either me or you have completely misunderstood how GPG works, I think. :-) In the case of JSON objects, they are intended for data exchange, and hence in the end need to be byte strings. So if you have a byte string you want to base64 encode before transmitting it with json, you would just end up transforming it to a unicode string and then back. That doesn't seem useful. One use case where you clearly *do* want the base64 encoded data to be unicode strings is because you want to embed it in a text discussing base64 strings, for a blog or a book or something. That doesn't seem to be a very common usecase. For the most part you base64 encode things because it's going to be transmitted, and hence the natural result of a base64 encoding should be data that is ready to be transmitted, hence byte strings, and not Unicode strings.
Python 3 doesn't *view* text as unicode, it *represents* it as unicode.
I don't agree that there is a significant difference between those wordings in this context. The end result is the same: Things intended to be handled/seen as textual should be unicode strings, things intended for data exchange should be byte strings. Something that is base64 encoded is primarily intended for data exchange. A base64 encoding should therefore return byte strings, especially since most API's that perform this transmission will take byte strings as input. If you want to include this in textual data, for whatever reason, like printing it in a book, then the conversion is trivial, but that is clearly the less common use case, and should therefore not be the default behavior. //Lennart

On Apr 25, 2013, at 03:34 PM, Lennart Regebro wrote:
In the case of JSON objects, they are intended for data exchange, and hence in the end need to be byte strings.
Except that they're not. http://bugs.python.org/issue10976 -Barry

On Thu, Apr 25, 2013 at 10:07 AM, Barry Warsaw <barry@python.org> wrote:
What am I doing wrong in this JSON crypto signature verification snippet that features many conversions between binary and text? recipients = jwsjs["recipients"] encoded_payload = binary(jwsjs["payload"]) headers = [] for recipient in recipients: h = binary(recipient["header"]) s = binary(recipient["signature"]) header = json.loads(native(urlsafe_b64decode(h))) vk = urlsafe_b64decode(binary(header["jwk"]["vk"])) secured_input = b".".join((h, encoded_payload)) sig = urlsafe_b64decode(s) sig_msg = sig+secured_input verified_input = native(ed25519ll.crypto_sign_open(sig_msg, vk)) verified_header, verified_payload = verified_input.split('.') verified_header = binary(verified_header) decoded_header = native(urlsafe_b64decode(verified_header)) headers.append(json.loads(decoded_header)) verified_payload = binary(verified_payload) # only return header, payload that have passed through the crypto library. payload = json.loads(native(urlsafe_b64decode(verified_payload))) return headers, payload

On 25/04/2013 14:34, Lennart Regebro wrote:
The JSON specification says that it's text. Its string literals can contain Unicode codepoints. It needs to be encoded to bytes for transmission and storage, but JSON itself is not a bytestring format.
base64 is a way of encoding binary data as text. The problem is that traditionally text has been encoded with one byte per character, except in those locales where there were too many characters in the character set for that to be possible. In Python 3 we're trying to stop mixing binary data (bytestrings) with text (Unicode strings).

On Thu, Apr 25, 2013 at 4:22 PM, MRAB <python@mrabarnett.plus.com> wrote:
OK, fair enough.
base64 is a way of encoding binary data as text.
It's a way of encoding binary data using ASCII. There is a subtle but important difference.
In Python 3 we're trying to stop mixing binary data (bytestrings) with text (Unicode strings).
Yup. And that's why a byte64 encoding shouldn't return Unicode strings. //Lennart

On Thu, 25 Apr 2013, Lennart Regebro wrote:
It is a way of encoding arrays of 8-bit bytes as arrays of characters that are part of the printable, non-whitespace subset of the ASCII repertoire. Since the ASCII repertoire is now simply the first 128 code points in the Unicode repertoire, it is equally correct to say that base64 is a way of encoding binary data as Unicode text.
That is exactly why it should return Unicode strings. What bytes should get sent if base64 is used to send a byte array over an EBCDIC link? [*] Having said that, there may be other reasons for base64 encoding to return bytes - I can conceive of arguments involving efficiency, or practicality, or the most common use cases. So I can't say for sure what base64 encoding actually ought to return in Python. But the purist stance should be that base64 encoding should return text, i.e. a string, i.e. unicode. [*] I apologize to anybody who just ate. Isaac Morland CSCF Web Guru DC 2554C, x36650 WWW Software Specialist

Lennart Regebro writes:
Yes, there is a difference, but I think you're wrong. RFC 4648 explicitly states that Base-n encodings are intended for "human handling" and even makes reference to character glyphs (the rationale for excluding confusable digits from the Base32 alphabet). That's text. Even if it is a rather restricted subset of text, those restrictions are much stronger than merely to ASCII, and they are based on aspects of text that go well beyond merely an encoding with a small code unit.
That's inaccurate. Antoine has presented several examples of why *some* base64 encoders might return Unicode strings, precisely because their output will be embedded in Unicode streams. Debugging the MIME composition functions in the email module is another. An accurate statement is that these use cases are relatively unusual. The common use case is feeding a binary stream directly into a wire protocol. Supporting that use case demands a base64 encoder with a bytes-to-bytes signature in the stdlib, for both convenience and to some extent efficiency. I don't really care if the stdlib supports the specialized use cases with a separate base64 encoder (Antoine suggested the binascii module), or if it leaves that up to the user (it's just an occasional use of ".decode('ascii')", after all).

On 25/04/2013 15:22, MRAB wrote:
RFC 4648 says """Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII [1] data.""". To me, "US-ASCII" is an encoding, so it appears to be talking about encoding binary data (bytestrings) to ASCII-encoded text (bytestrings).

MRAB writes:
I think that's a misreading, inconsistent with the rest of the RFC. The references to US-ASCII are not clearly normative, as the value- character mappings are given in tables, and are self-contained. (The one you quote is clearly informative, since it describes a use-case.) The term "subset of US-ASCII" suggests repertoire, not encoding, as does the use of "alphabet" to refer to these subsets. *Every* (other?) normative statement is very careful to say that input of a Base-n encoder is "octets" (with two uses of "bytes" in the definition of Base32), and the output is "characters". There are no exceptions, and there are *no* references to encoding of characters or the corresponding character codes (except the possible implicit reference via "US-ASCII"). I can make no sense of those facts if the intent of the RFC is to restrict the output of a Base-n encoder to characters encoded in (8-bit) US-ASCII. Why not just say so, and use "octets" and their ASCII codes throughout, with the corresponding characters used as informative commentary? I think it much more likely that "subset of the character repertoire of US-ASCII" was intended, but abbreviated to "subset of US-ASCII". This kind of abbreviation is very common in informal discussion of coded character sets. I admit it's a little surprising that the author would be so incautious in his use of "US-ASCII", but if he really meant US-ASCII- the-encoding, I find the style of the rest of the RFC astonishing!

Le Thu, 25 Apr 2013 15:34:45 +0200, Lennart Regebro <regebro@gmail.com> a écrit :
I don't think this distinction is meaningful at all. In the end, everything is a byte string on a classical computer (including unicode strings displayed on your monitor, obviously). If you think the technicalities of an operation should never be hidden or abstracted away, then you're better off with C than Python ;-) Regards Antoine.

On Thu, Apr 25, 2013 at 5:27 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
OK, then I think we have found the core of the problem, and the end of the discussion (from my side, that is).
Yes of course. Especially since my monitor is an output device. ;-)
If you think the technicalities of an operation should never be hidden or abstracted away, then you're better off with C than Python ;-)
The whole point is that Python *does* abstract it away. It abstract the internals of Unicode strings in such a way that they are no longer, conceptually, 8-bit data. This *is* a distinction Python does, and it is a useful distinction. I do not see any reason to remove it. http://regebro.wordpress.com/2011/03/23/unconfusing-unicode-what-is-unicode/ //Lennart

On 2013-04-25, at 11:25 , Antoine Pitrou wrote:
Besides, I would consider a RFC more authoritative than a Wikipedia definition.
so the output is US-ASCII data, a byte stream. Stephen is correct that you could decide you don't care about those semantics, and implement base64 encoding as a bytes -> str decoding then requiring a re-encoding (to ascii) before wire transmission. The clarity of the interface (or lack thereof) would probably make users want to send a strongly worded letter to whoever implemented it though, I don't think `data.decode('base64').encode('ascii')` would fit the "obviousness" or "readability" expectations of most users.

Le Thu, 25 Apr 2013 12:46:43 +0200, Xavier Morel <catch-all@masklinn.net> a écrit :
Well, depending on the context, US-ASCII can be a character set or a character encoding. If some specification is talking about text and characters, then it is something that can reasonably be a str in Python land. Similarly, we have chosen to make filesystem paths str by default in Python 3, even though many Unix-heads would claim that filesystem paths are "bytes only". The reason is that while they are technically bytes (under Unix), they are functionally text. Now, if the base64-encoded data is your entire payload, this clearly amounts to nitpicking. But when you want to *embed* that data in some larger chunk of text (e.g. a JSON object), then it makes a lot of sense to consider the base64-encoded data a piece of *text*, not bytes. Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/25/2013 01:43 AM, Antoine Pitrou wrote:
Stephen was incorrect: the base64 standard is about encoding a binary stream (8-bit bites) onto another binary stream (6-bit bytes), but one which can be safely transmitted over a 7-bit-only medium. Text in Py3ks sense is irrelevant.
WHat does that snark have to do with this discussion? base64 has no more to do with character set encodings than it does the moon. It would be a "transform" (bytes -> bytes), not an "encoding". Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iEYEARECAAYFAlF5Nc4ACgkQ+gerLs4ltQ7f9ACgx19dzyLXCDzkLkWITSU+7WyD XEMAn38mZgK8F1/FGWJc+ANOJz2tfHI/ =qpSL -----END PGP SIGNATURE-----

Lennart Regebro writes:
By "completely different kind of encoding" do you mean "codec"? I think that would be an unfortunate result. These operations on streams are theoretically nicely composable. It would be nice if practice reflected that by having a uniform API for all of these operations (charset translation, encoded text to internal, content transfer encoding, compression ...). I think it would be useful, too, though I can't prove that. Anyway, this discussion belongs on python-ideas at this point. Or would, if I had an idea about implementation. I'll take it there when I do have something to say about implementation.

On Thu, Apr 25, 2013 at 8:57 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
But the translation to and from Unicode to some 8-bit encoding is different from the others. It makes sense that they have a different API. If you have a Unicode string you can go: Unicode text -> UTF8 -> ZIP -> BASE64. Or you can go Unicode text -> UTF8 -> BASE64 -> ZIP Although admittedly that makes much less sense. :-) But you can not go: Unicode text -> BASE64 -> ZIP -> UTF8 The str/bytes encoding/decoding is not like the others. //Lennart

On Thu, Apr 25, 2013 at 4:57 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Bringing the mailing list thread up to date with the state of the relevant tracker issues: I created http://bugs.python.org/issue17827 to cover adding the missing documentation for "codecs.encode" and "codecs.decode" as the officially supported solutions for easy use of the codec infrastructure *without* the additional text model specific input and output type restrictions imposed by the str.encode, bytes.decode and bytearray.decode methods. I created http://bugs.python.org/issue17828 to cover emitting more meaningful exceptions when a codec throws TypeError or ValueError, as well as when the additional type checking fails for str.encode, bytes.decode and bytearray.decode. I created http://bugs.python.org/issue17839 to cover the fact that part of the problem here is that the base64 module currently only accepts bytes and bytearray as inputs, rather than anything that supports the PEP 3118 buffer interface. http://bugs.python.org/issue7475 (linked earlier in the thread) is now strictly about restoring the shorthand aliases for "base64_codec", "bz2_codec" et al that were removed in http://bugs.python.org/issue10807. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 23/04/13 09:16, Greg Ewing wrote:
As others have pointed out in the past, repeatedly, the codec system is completely general and can transform bytes->bytes and text->text just as easily as bytes<->text. Or indeed any bijection, as the docs for 2.7 point out. The question isn't "What's so special about base64?" The questions should be: - What's so special about exotic legacy transformations like ISO-8859-10 and MacRoman that they deserve a string method for invoking them? - Why have common transformations like base64, which worked in 2.x, been relegated to second-class status in 3.x? - If it is no burden to have to import a module and call an external function for some transformations, why have encode and decode methods at all? If you haven't read this, you should: http://lucumr.pocoo.org/2012/8/11/codec-confusion/ -- Steven

On Apr 22, 2013, at 10:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
I may be dull, but it wasn't until I started using Python 3 that it really clicked in my head what encode/decode did exactly. In Python2 I just sort of sprinkled one or the other when there was errors until the pain stopped. I mostly attribute this to str.decode and bytes.encode not existing.
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Apr 22, 2013, at 10:30 PM, Donald Stufft wrote:
This is a key observation. It's also now much easier to *explain* what's going on and recommend correct code in Python 3, so overall it's a win. That's not to downplay the inconvenience of not being able to easily do bytes->bytes or str->str transformations as easily as was possible in Python 2. I've not thought about it much, but placing those types of transformations on a different set of functions (methods or builtins) seems like the right direction. IOW, don't mess with encode/decode. -Barry

On Mon, Apr 22, 2013 at 7:04 PM, Steven D'Aprano <steve@pearwood.info> wrote:
There are good answers to all of these, and your rhetoric is not appreciated. The special status is for the translation between bytes and Unicode characters (code points). There are many contexts where a byte stream is labeled (either separately or in-line) as being encoded using some specific encoding. -- --Guido van Rossum (python.org/~guido)

On Tue, Apr 23, 2013 at 4:04 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, but the encode()/decode() methods are not, and the fact that you now know what goes in and what comes out means that people get much fewer Decode/EncodeErrors. Which is a good thing. //Lennart

Using decode() and encode() would break that predictability. But someone suggested the use of transform() and untransform() instead. That would clarify that the transformation is bytes > bytes and Unicode string > Unicode string. On 23 Apr 2013 05:50, "Lennart Regebro" <regebro@gmail.com> wrote:

Steven D'Aprano wrote:
Now that all text strings are unicode, the unicode codecs are in a sense special, in that you can't do any string I/O at all without using them at some stage. So arguably it makes sense to have a very easy way of invoking them. I suspect that without this, the idea of all strings being unicode would have been even harder to sell than it was. -- Greg

On 22 April 2013 12:39, Calvin Spealman <ironfroggy@gmail.com> wrote:
if two lines is cumbersome, you're in for a cumbersome life a programmer.
One of which is essentially Python's equivalent of a declaration... Paul

On Mon, Apr 22, 2013 at 7:39 AM, Calvin Spealman <ironfroggy@gmail.com> wrote:
if two lines is cumbersome, you're in for a cumbersome life a programmer.
Other encodings are either missing completely from the stdlib, or have corrupted behavior. For example, string_escape is gone, and unicode_escape doesn't make any sense anymore -- python code is text, not bytes, so why does 'abc'.encode('unicode_escape') return bytes? I don't think this change was thought through completely before it was implemented. I agree base64 is a bad place to pick at the encode/decode changes, though. :( -- Devin

On Mon, Apr 22, 2013 at 09:50:14AM -0400, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
unicode_escape doesn't make any sense anymore -- python code is text, not bytes, so why does 'abc'.encode('unicode_escape') return bytes?
AFAIU the situation is simple: unicode.encode(encoding) returns bytes, bytes.decode(encoding) returns unicode, and neither unicode.decode() nor bytes.encode() exist. Transformations like base64 and bz2 are nor encoding/decoding -- they are bytes/bytes or unicode/unicode transformations. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

On Mon, 22 Apr 2013 09:50:14 -0400, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
We use unicode_escape (actually raw_unicode_escape) in the email package, and there we are converting between string and bytes. It is used as an encoder when we are supposed to have ASCII input but have other stuff, and need ASCII output and don't want to lose information. So yes, that encoder does still make sense. It would also be useful as a transform function, but as someone has pointed out there's an issue for that. --David
participants (25)
-
Antoine Pitrou
-
Barry Warsaw
-
Calvin Spealman
-
Daniel Holth
-
Devin Jeanpierre
-
Donald Stufft
-
Fábio Santos
-
Glenn Linderman
-
Greg Ewing
-
Guido van Rossum
-
Isaac Morland
-
Lennart Regebro
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Oleg Broytman
-
Paul Moore
-
R. David Murray
-
Ram Rachum
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Jan Reedy
-
Tres Seaver
-
Victor Stinner
-
Xavier Morel