Re: [Python-Dev] bytes.from_hex()

Bill Janssen wrote:
bytes -> base64 -> text text -> de-base64 -> bytes
It's nice to hear I'm not out of step with the entire world on this. :-) -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+

On 2/28/06, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Bill Janssen wrote:
bytes -> base64 -> text text -> de-base64 -> bytes
It's nice to hear I'm not out of step with the entire world on this. :-)
What Bill proposes makes sense to me. -- --Guido van Rossum (home page: http://www.python.org/~guido/)

Greg Ewing wrote:
Bill Janssen wrote:
bytes -> base64 -> text text -> de-base64 -> bytes
It's nice to hear I'm not out of step with the entire world on this. :-)
Well, I can certainly understand the bytes->base64->bytes side of thing too. The "text" produced is specified as using "a 65-character subset of US-ASCII", so that's really bytes. Bill

On Tue, 2006-02-28 at 15:23 -0800, Bill Janssen wrote:
Greg Ewing wrote:
Bill Janssen wrote:
bytes -> base64 -> text text -> de-base64 -> bytes
It's nice to hear I'm not out of step with the entire world on this. :-)
Well, I can certainly understand the bytes->base64->bytes side of thing too. The "text" produced is specified as using "a 65-character subset of US-ASCII", so that's really bytes.
Huh... just joining here but surely you don't mean a text string that doesn't use every character available in a particular encoding is "really bytes"... it's still a text string... If you base64 encode some bytes, you get a string. If you then want to access that base64 string as if it was a bunch of bytes, cast it to bytes. Be careful not to confuse "(type)cast" with "(type)convert"... A "convert" transforms the data from one type/class to another, modifying it to be a valid equivalent instance of the other type/class; ie int -> float. A "cast" does not modify the data in any way, it just changes its type/class to be the other type, and assumes that the data is a valid instance of the other type; eg int32 -> bytes[4]. Minor data munging under the hood to cleanly switch the type/class is acceptable (ie adding array length info etc) provided you keep to the spirit of the "cast". Keep these two concepts separate and you should be right :-) -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/

Huh... just joining here but surely you don't mean a text string that doesn't use every character available in a particular encoding is "really bytes"... it's still a text string...
No, once it's in a particular encoding it's bytes, no longer text. As you say,
Keep these two concepts separate and you should be right :-)
Bill

Bill Janssen wrote:
Greg Ewing wrote:
Bill Janssen wrote:
bytes -> base64 -> text text -> de-base64 -> bytes It's nice to hear I'm not out of step with the entire world on this. :-)
Well, I can certainly understand the bytes->base64->bytes side of thing too. The "text" produced is specified as using "a 65-character subset of US-ASCII", so that's really bytes.
If the base64 codec was a text<->bytes codec, and bytes did not have an encode method, then if you want to convert your original bytes to ascii bytes, you would do: ascii_bytes = orig_bytes.decode("base64").encode("ascii") "Use base64 to convert my byte sequence to characters, then give me the corresponding ascii byte sequence" To reverse the process: orig_bytes = ascii_bytes.decode("ascii").encode("base64") "Use ascii to convert my byte sequence to characters, then use base64 to convert those characters back to the original byte sequence" The only slightly odd aspect is that this inverts the conventional meaning of base64 encoding and decoding, where you expect to encode from bytes to characters and decode from characters to bytes. As strings currently have both methods, the existing codec is able to use the conventional sense for base64: encode goes from "str-as-bytes" to "str-as-text" (giving a longer string with characters that fit in the base64 subset) and decode goes from "str-as-text" to "str-as-bytes" (giving back the original string) All the unicode codecs, on the other hand, use encode to get from characters to bytes and decode to get from bytes to characters. So if bytes objects *did* have an encode method, it should still result in a unicode object, just the same as a decode method does (because you are encoding bytes as characters), and unicode objects would acquire a corresponding decode method (that decodes from a character format such as base64 to the original byte sequence). In the name of TOOWTDI, I'd suggest that we just eat the slight terminology glitch in the rare cases like base64, hex and oct (where the character format is technically the encoded format), and leave it so that there is a single method pair (bytes.decode to go from bytes to characters, and text.encode to go from characters to bytes). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org

Nick Coghlan wrote:
All the unicode codecs, on the other hand, use encode to get from characters to bytes and decode to get from bytes to characters.
So if bytes objects *did* have an encode method, it should still result in a unicode object, just the same as a decode method does (because you are encoding bytes as characters), and unicode objects would acquire a corresponding decode method (that decodes from a character format such as base64 to the original byte sequence).
In the name of TOOWTDI, I'd suggest that we just eat the slight terminology glitch in the rare cases like base64, hex and oct (where the character format is technically the encoded format), and leave it so that there is a single method pair (bytes.decode to go from bytes to characters, and text.encode to go from characters to bytes).
I think you have it pretty straight here. While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring. b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64') The bytes could then *not* ignore the string decode codec but use it for string to string decoding. b64string = b.tostring('base64') b = bytes(b64string, 'base64') b = bytes(hexstring, 'hex') hexstring = b.tostring('hex') hexstring = b.tounicode('hex') An exception could be raised if the codec does not support input or output type depending on the situation. This would allow for differnt types of codecs to live together without as much confusion I think. I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode. Expecting it not to fly, but just maybe it could? Ron

Ron Adam wrote:
While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring.
b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64')
I don't like that, because it creates a dependency (conceptually, at least) between the bytes type and the unicode type. And why unicode in particular? Why should it have a tounicode() method, but not a toint() or tofloat() or tolist() etc.?
I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode.
Another thing is that it only works if the codec transforms between two different types. If you have a bytes-to-bytes transformation, for example, then b2 = b1.tobytes('some-weird-encoding') is ambiguous. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+

Greg Ewing wrote:
Ron Adam wrote:
While playing around with the example bytes class I noticed code reads much better when I use methods called tounicode and tostring.
b64ustring = b.tounicode('base64') b = bytes(b64ustring, 'base64')
I don't like that, because it creates a dependency (conceptually, at least) between the bytes type and the unicode type. And why unicode in particular? Why should it have a tounicode() method, but not a toint() or tofloat() or tolist() etc.?
I don't think it creates a dependency between the types, but it does create a stronger relationship between them when a method that returns a fixed type is used. No reason not to other than avoiding having methods that really aren't needed. But if it makes sense to have them, sure. If a codec isn't needed probably using a regular constructor should be used instead.
I'm not suggesting we start using to-type everywhere, just where it might make things clearer over decode and encode.
Another thing is that it only works if the codec transforms between two different types. If you have a bytes-to-bytes transformation, for example, then
b2 = b1.tobytes('some-weird-encoding')
is ambiguous.
Are you asking if it's decoding or encoding? bytes to unicode -> decoding unicode to bytes -> encoding bytes to bytes -> ? Good point, I think this defines part the difficulty. 1. We can specify the operation and not be sure of the resulting type. *or* 2. We can specify the type and not always be sure of the operation. maybe there's a way to specify both so it's unambiguous? Ron

Ron Adam wrote:
1. We can specify the operation and not be sure of the resulting type.
*or*
2. We can specify the type and not always be sure of the operation.
maybe there's a way to specify both so it's unambiguous?
Here's another take on the matter. When we're doing Unicode encoding or decoding, we're performing a type conversion. The natural way to write a type conversion in Python is with a constructor. But we can't just say u = unicode(b) because that doesn't give enough information. We want to say that b is really of type e.g. "bytes containing utf8 encoded text": u = unicode(b, 'utf8') Here we're not thinking of the 'utf8' as selecting an encoder or decoder, but of giving extra information about the type of b, that isn't carried by b itself. Now, going in the other direction, we might think to write b = bytes(u, 'utf8') But that wouldn't be right, because if we interpret this consistently it would mean we're saying that u contains utf8-encoded information, which is nonsense. What we need is a way of saying "construct me something of type 'bytes containing utf8-encoded text'": b = bytes['utf8'](u) Here I've coined the notation t[enc] which evaluates to a callable object which constructs an object of type t by encoding its argument according to enc. Now let's consider base64. Here, the roles of bytes and unicode are reversed, because the bytes are just bytes without any further interpretation, whereas the unicode is really "unicode containing base64 encoded data". So we write u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding Note that this scheme is reasonably idiot-proof, e.g. u = unicode(b, 'base64') results in a type error, because this specifies a decoding operation, and the base64 decoder takes text as input, not bytes. What happens with transformations where the input and output types are the same? In this scheme, they're not really the same any more, because we're providing extra type information. Suppose we had a code called 'piglatin' which goes from unicode to unicode. The types involved are really "text" and "piglatin-encoded text", so we write u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding Here you won't get any type error if you get things backwards, but there's not much that can be done about that. You just have to keep straight which of your strings contain piglatin and which don't. Is this scheme any better than having encode and decode methods/functions? I'm not sure, but it shows that a suitably enhanced notion of "data type" can be used to replace the notions of encoding and decoding and maybe reduce potential confusion about which direction is which. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+

Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding
Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() . - Josiah

Josiah Carlson wrote:
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding
Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() .
- Josiah
This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple. u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64') b = bytes(u, 'encode', 'utf-8') u = unicode(b, 'decode', 'utf-8') u2 = unicode(u1, 'encode', 'piglatin') u1 = unicode(u2, 'decode', 'piglatin') It looks somewhat cleaner if you combine them in a path style string. b = bytes(u, 'encode/utf-8') u = unicode(b, 'decode/utf-8') Ron

Ron Adam wrote:
Josiah Carlson wrote:
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding
Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() .
- Josiah
This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple.
u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64')
b = bytes(u, 'encode', 'utf-8') u = unicode(b, 'decode', 'utf-8')
u2 = unicode(u1, 'encode', 'piglatin') u1 = unicode(u2, 'decode', 'piglatin')
It looks somewhat cleaner if you combine them in a path style string.
b = bytes(u, 'encode/utf-8') u = unicode(b, 'decode/utf-8')
It gets from bad to worse :( I always liked the assymmetry between u = unicode(s, "utf8") and s = u.encode("utf8") which I think was the original design of the unicode API. Cudos for whoever came up with that. When I saw b = bytes(u, "utf8") mentioned for the first time, I thought: why on earth must the bytes constructor be coupled to the unicode API?!?! It makes no sense to me whatsoever. Bytes have so much more use besides encoded text. I believe (please correct me if I'm wrong) that the encoding argument of bytes() was invented to make it easier to write byte literals. Perhaps a true bytes literal notation is in order after all? My preference for bytes -> unicode -> bytes API would be this: u = unicode(b, "utf8") # just like we have now b = u.tobytes("utf8") # like u.encode(), but being explicit # about the resulting type As to base64, while it works as a codec ("Why a base64 codec? Because we can!"), I don't find it a natural API at all, for such conversions. (I do however agree with Greg Ewing that base64 encoded data is text, not ascii-encoded bytes ;-) Just-my-2-cts

Just van Rossum <just@letterror.com> wrote:
Ron Adam wrote:
Josiah Carlson wrote:
Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
u = unicode(b) u = unicode(b, 'utf8') b = bytes['utf8'](u) u = unicode['base64'](b) # encoding b = bytes(u, 'base64') # decoding u2 = unicode['piglatin'](u1) # encoding u1 = unicode(u2, 'piglatin') # decoding
Your provided semantics feel cumbersome and confusing to me, as compared with str/unicode.encode/decode() .
- Josiah
This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple.
u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64')
b = bytes(u, 'encode', 'utf-8') u = unicode(b, 'decode', 'utf-8')
u2 = unicode(u1, 'encode', 'piglatin') u1 = unicode(u2, 'decode', 'piglatin')
It looks somewhat cleaner if you combine them in a path style string.
b = bytes(u, 'encode/utf-8') u = unicode(b, 'decode/utf-8')
It gets from bad to worse :(
I always liked the assymmetry between
u = unicode(s, "utf8")
and
s = u.encode("utf8")
which I think was the original design of the unicode API. Cudos for whoever came up with that.
I personally have never used that mechanism. I always used s.decode('utf8') and u.encode('utf8'). I prefer the symmetry that .encode() and .decode() offer.
When I saw
b = bytes(u, "utf8")
mentioned for the first time, I thought: why on earth must the bytes constructor be coupled to the unicode API?!?! It makes no sense to me whatsoever.
It's not a 'unicode API'. See integers for another example where a second argument to a type object defines how to interpret the other argument, or even arrays/structs where the first argument defines the interpretation.
Bytes have so much more use besides encoded text.
Agreed.
I believe (please correct me if I'm wrong) that the encoding argument of bytes() was invented to make it easier to write byte literals. Perhaps a true bytes literal notation is in order after all?
Maybe, but I think the other earlier use-case was for using: s2 = bytes(s1, 'base64') If bytes objects recieved an .encode() method, or even a .tobytes() method. I could be misremembering.
My preference for bytes -> unicode -> bytes API would be this:
u = unicode(b, "utf8") # just like we have now b = u.tobytes("utf8") # like u.encode(), but being explicit # about the resulting type
As to base64, while it works as a codec ("Why a base64 codec? Because we can!"), I don't find it a natural API at all, for such conversions.
Depending on whose definiton of codec you listen to (is it a compressor/decompressor, or a coder/decoder?), either very little of what we have as 'codecs' are actual codecs (only zlib, etc.), or all of them are. I would imagine that base64, etc., were made into codecs, or really encodings, because base64 is an 'encoding' of binary data in base64 format. Similar to the way you can think of utf8 is an 'encoding' of textual data in utf8 format. I would argue, due to the "one obvious way to do it", that using encodings/codecs should be preferred to one-shot encoding/decoding functions in various modules (with some exceptions). These exceptions are things like pickle, marshal, struct, etc., which may take a non-basestring object and convert it into a byte string, which is arguably an encoding of the object in a particular format. - Josiah

Ron Adam wrote:
This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple.
u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64')
The point of the exercise was to avoid using the terms 'encode' and 'decode' entirely, since some people claim to be confused by them. While I succeeded in that, I concede that the result isn't particularly intuitive and is arguably even more confusing. If we're going to continue to use 'encode' and 'decode', why not just make them functions: b = encode(u, 'utf-8') u = decode(b, 'utf-8') In the case of Unicode encodings, if you get them backwards you'll get a type error. The advantage of using functions over methods or constructor arguments is that they can be applied uniformly to any input and output types. -- Greg

Greg Ewing wrote:
Ron Adam wrote:
This uses syntax to determine the direction of encoding. It would be easier and clearer to just require two arguments or a tuple.
u = unicode(b, 'encode', 'base64') b = bytes(u, 'decode', 'base64')
The point of the exercise was to avoid using the terms 'encode' and 'decode' entirely, since some people claim to be confused by them.
Yes, that was what I was trying for with the tounicode, tostring (tobyte) suggestion, but the direction could become ambiguous as you pointed out. The constructors above have 4 data items implied: 1: The source object which includes the source type and data 2: The codec to use 3: The direction of the operation 4: The destination type (determined by the constructor used) There isn't any ambiguity other than when to use encode or decode, but in this case that really is a documentation problem because there is no ambiguities in this form. Everything is explicit. Another version of the above was pointed out to me off line that might be preferable. u = unicode(b, encode='base64') b = bytes(u, decode='base64') Which would also work with the tostring and tounicode methods. u = b.tounicode(decode='base64') b = u.tobytes(incode='base64')
If we're going to continue to use 'encode' and 'decode', why not just make them functions:
b = encode(u, 'utf-8') u = decode(b, 'utf-8')
import codecs codecs.decode('abc', 'ascii') u'abc'
There's that time machine again. ;-)
In the case of Unicode encodings, if you get them backwards you'll get a type error.
The advantage of using functions over methods or constructor arguments is that they can be applied uniformly to any input and output types.
If codecs are to be more general, then there may be time when the returned type needs to be specified. This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode. Some inputs may equally work with more than one output type. Of course, the answer in these cases may be to just 'know' what you will get, and then convert it to what you want. Cheers, Ron

Ron Adam wrote:
This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode.
I'd need to see some concrete examples of such codecs before being convinced that they exist, or that they couldn't just as well return a fixed type that you then transform to what you want. I suspect that said transformation would involve some further encoding or decoding, in which case you really have more than one codec. -- Greg

Greg Ewing wrote:
Ron Adam wrote:
This would apply to codecs that could return either bytes or strings, or strings or unicode, or bytes or unicode.
I'd need to see some concrete examples of such codecs before being convinced that they exist, or that they couldn't just as well return a fixed type that you then transform to what you want.
I think text some codecs that currently return 'ascii' encoded text would be candidates. If you use u'abc'.encode('rot13') you get an ascii string back and not a unicode string. And if you use decode to get back, you don't get the original unicode back, but an ascii representation of the original you then need to decode to unicode.
I suspect that said transformation would involve some further encoding or decoding, in which case you really have more than one codec.
Yes, I can see that. So the following are probable better reasons to specify the type. Codecs are very close to types and they quite often result in a type change, having the change visible in the code adds to overall readability. This is probably my main desire for this. There is another reason for being explicit about types with codecs. If you store the codecs with a tuple of attributes as the keys, (name, in_type, out_type), then it makes it possible to look up the codec with the correct behavior and then just do it. The alternative is to test the input, try it, then test the output. The look up doesn't add much overhead, but does adds safety. Codecs don't seem to be the type of thing where you will want to be able to pass a wide variety of objects into. So a narrow slot is probably preferable to a wide one here. In cases where a codec might be useful in more than one combination of types, it could have an entry for each valid combination in the lookup table. The codec lookup also validates the desired operation for nearly free. Of course, the data will need to be valid as well. ;-) Cheers, Ron

Nick Coghlan wrote:
ascii_bytes = orig_bytes.decode("base64").encode("ascii")
orig_bytes = ascii_bytes.decode("ascii").encode("base64")
The only slightly odd aspect is that this inverts the conventional meaning of base64 encoding and decoding,
-1. Whatever we do, we shouldn't design things so that it's necessary to write anything as unintuitive as that. We need to make up our minds whether the .encode() and .decode() methods are only meant for Unicode encodings, or whether they are for general transformations between bytes and characters. If they're only meant for Unicode, then bytes should only have .decode(), unicode strings should only have .encode(), and only Unicode codecs should be available that way. Things like base64 would need to have a different interface. If they're for general transformations, then both types should have both methods, with the return type depending on the codec you're using, and it's the programmer's responsibility to use codecs that make sense for what he's doing. But if they're for general transformations, why limit them to just bytes and characters? Following that through leads to giving *every* object .encode() and .decode() methods. I don't think we should go that far, but it's hard to see where to draw the line. Are bytes and strings special enough to justify them having their own peculiar methods for codec access? -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+
participants (8)
-
Bill Janssen
-
Donovan Baarda
-
Greg Ewing
-
Guido van Rossum
-
Josiah Carlson
-
Just van Rossum
-
Nick Coghlan
-
Ron Adam