The bytes type in Python 3 does not feel very consistent. For example: --> some_var = 'abcdef' --> some_var 'abcdef' --> some_var[3] 'd' --> some_other_var = b'abcdef' --> some_other_var b'abcdef' --> some_other_var[3] 100 On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface. And trying to use the two is not intuitive: --> some_other_var[3] == b'd' False When I'm parsing a .dbf file and extracting field types from the byte stream, I'm not thinking, "okay, 67 is a Character field" -- what I'm thinking is, "b'C' is a Character field". Considering that ord() still works fine, I'm not sure why it was done this way. Is there code out there that is using this "list of int's" interface, or is there time to make changes to bytes? ~Ethan~
2011/5/17 Ethan Furman <ethan@stoneleaf.us>:
Considering that ord() still works fine, I'm not sure why it was done this way.
I agree that this change was unfortunate and not too useful in practice.
Is there code out there that is using this "list of int's" interface, or is there time to make changes to bytes?
I don't doubt there is, and I'm afraid it's far to late to change this. -- Regards, Benjamin
On May 17, 2011, at 5:27 PM, Ethan Furman wrote:
The bytes type in Python 3 does not feel very consistent.
For example:
--> some_var = 'abcdef' --> some_var 'abcdef' --> some_var[3] 'd' --> some_other_var = b'abcdef' --> some_other_var b'abcdef' --> some_other_var[3] 100
On the one hand we have the 'bytes are ascii data' type interface,
This is incidental. Bytes can and often do contain data with non-ascii encoded text, plain binary data, or structs, or raw data read off a disk, etc.
and on the other we have the 'bytes are a list of integers between 0 - 256' interface. And trying to use the two is not intuitive:
--> some_other_var[3] == b'd' False
When I'm parsing a .dbf file and extracting field types from the byte stream, I'm not thinking, "okay, 67 is a Character field" -- what I'm thinking is, "b'C' is a Character field".
Considering that ord() still works fine, I'm not sure why it was done this way.
Is there code out there that is using this "list of int's" interface,
Yes.
or is there time to make changes to bytes?
No. Raymond
On Wed, May 18, 2011 at 8:27 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface.
No. Bytes are a list of integers between 0-256. End of story. Using them to represent text as well was precisely the problem with 2.x 8-bit strings, since the boundaries got blurred. However, as a matter of practicality, many byte-oriented protocols use ASCII to make elements of the protocol readable by humans. The "text-like" elements of the bytes and bytearray types are a concession to the existence of those protocols. However, that doesn't make them text - they're still binary data streams. If you want to treat them as text, convert them to "str" objects first (e.g. that's what urlib.urlparse does internally in order to operate on bytes and bytearray instances). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Wed, May 18, 2011 at 3:13 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Wed, May 18, 2011 at 8:27 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface.
No. Bytes are a list of integers between 0-256. End of story. Using them to represent text as well was precisely the problem with 2.x 8-bit strings, since the boundaries got blurred.
However, as a matter of practicality, many byte-oriented protocols use ASCII to make elements of the protocol readable by humans. The "text-like" elements of the bytes and bytearray types are a concession to the existence of those protocols. However, that doesn't make them text - they're still binary data streams. If you want to treat them as text, convert them to "str" objects first (e.g. that's what urlib.urlparse does internally in order to operate on bytes and bytearray instances).
This is a not a useful argument - its an implementation choice in Python 3, and urlparse converting bytes to 'str' to operate on them is at best a kludge - you're forcing 5 times the storage (the original bytes + 4 bytes-per-byte when its decoded into unicode) to work on something which is defined as a BNF * that uses ascii *. The Python 2 confusion was deplorable, but it doesn't make the Python 3 situation better: its different, but still very awkward for people to write code that is correct and fast in. Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward. _Rob
On Wed, May 18, 2011 at 1:23 PM, Robert Collins <robertc@robertcollins.net> wrote:
The Python 2 confusion was deplorable, but it doesn't make the Python 3 situation better: its different, but still very awkward for people to write code that is correct and fast in.
When Python 3 goes wrong, it raises exceptions or executes the wrong control flow. That's a vast improvement over silently corrupting the data stream the way that 2.x does. If it really bothers anyone, they should feel free to implement and promote their own "ascii" data type on PyPI. If it is explicitly restricted to 7 bit characters, it may even avoid many of the problems of silent corruption that the 2.x str had. Speculation on python-dev isn't going to be convincing here, though: only code in real use will be effective on that front. As far as the memory and runtime overhead goes, yes, that's a real problem (indeed, that overhead is *why* bytes and bytearray have as many str-like features as they do). PEP 393 is intended to at least alleviate the memory burden of the Unicode text. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Robert Collins wrote:
urlparse converting bytes to 'str' to operate on them is at best a kludge - you're forcing 5 times the storage (the original bytes + 4 bytes-per-byte when its decoded into unicode)
That is itself an implementation detail of current Python, though, due to it only having one internal representation of unicode. In principle there could be a form of str that keeps its data encoded in latin1, in which case constructing it from a byte string could simply involve storing a pointer to the original bytes data. -- Greg
Robert Collins writes:
Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here. Confusion is indeed happening, but it's real confusion in the way people think about the problem space, not a language design cockup. The problem can't be solved by embedding ASCII in Unicode, because non-ASCII bytes don't have a canonical embedding in Unicode. Ie, the situation is inherently confusing. You can't wish it away, you can only choose to impose more or less of it on particular constituencies. Now, it's quite possible that there are other correct approaches that allow straightforward manipulation of non-ASCII text, but I don't know what they are, and I don't know anybody else who does.
On Thu, 19 May 2011 01:16:44 +0900, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:
Robert Collins writes:
Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here.
Confusion is indeed happening, but it's real confusion in the way people think about the problem space, not a language design cockup.
Note that the more common idiom (not that I can measure it, mind) when dealing with byte strings is something analogous to if my_byte_string[i:i+1] == b'x': rather than if my_byte_string[i] == 170: and the former is a lot more readable than the latter, even though you have to stare at the slice for a couple seconds the first time you encounter it to realize what is going on. So *something* is wrong with Python3's approach. Python2 was wronger, though :) --David
Note that the more common idiom (not that I can measure it, mind) when dealing with byte strings is something analogous to
if my_byte_string[i:i+1] == b'x':
rather than
if my_byte_string[i] == 170:
FWIW, Another spelling of this is if my_byte_string[i] == ord(b'x')
From a readability point, it's in the same category as the first one, but less twisted.
Regards, Martin
On 18.05.2011 21:06, "Martin v. Löwis" wrote:
Note that the more common idiom (not that I can measure it, mind) when dealing with byte strings is something analogous to
if my_byte_string[i:i+1] == b'x':
rather than
if my_byte_string[i] == 170:
FWIW, Another spelling of this is
if my_byte_string[i] == ord(b'x')
From a readability point, it's in the same category as the first one, but less twisted.
Probably more twisted: if my_byte_string[i] == b'x'[0]: :) Georg
On 05/18/2011 12:16 PM, Stephen J. Turnbull wrote:
Robert Collins writes:
Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here.
I don't think there's any connection between the way 2.x confused text strings and binary data (which certainly needed addressing) with the way that 3.x returns a different type for byte_str[i] than it does for byte_str[i:i+1]. I think it's the latter that's confusing to people. There's no particular requirement for different types that's needed to fix the byte/str problem. And of course it's too late to make any change to this. Eric.
On 5/18/2011 6:32 PM, Greg Ewing wrote:
Eric Smith wrote:
And of course it's too late to make any change to this.
It's too late to change the meaning of b'...', but is it really too late to introduce an x'...' literal and change the repr() to produce it?
My "this" was the different types returned by b[i] and b[i:i+1]. Eric.
On Thu, May 19, 2011 at 5:10 AM, Eric Smith <eric@trueblade.com> wrote:
On 05/18/2011 12:16 PM, Stephen J. Turnbull wrote:
Robert Collins writes:
> Its probably too late to change, but please don't try to argue that > its correct: the continued confusion of folk running into this is > evidence that confusion *is happening*. Treat that as evidence and > think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here.
I don't think there's any connection between the way 2.x confused text strings and binary data (which certainly needed addressing) with the way that 3.x returns a different type for byte_str[i] than it does for byte_str[i:i+1]. I think it's the latter that's confusing to people. There's no particular requirement for different types that's needed to fix the byte/str problem.
It's a mental model problem. People try to think of bytes as equivalent to 2.x str and that's just wrong, wrong, wrong. It's far closer to array.array('c'). Strings are basically *unique* in returning a length 1 instance of themselves for indexing operations. For every other sequence type, including tuples, lists and arrays, slicing returns a new instance of the same type, while indexing will typically return something different. Now, we definitely didn't *help* matters by keeping so many of the default behaviours of bytes() and bytearray() coupled to ASCII-encoded text, but that was a matter of practicality beating purity: there really *are* a lot of wire protocols out there that are ASCII based. In hindsight, perhaps we should have gone further in breaking things to try to make the point about the mental model shift more forcefully. (However, that idea carries with it its own problems). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 2011-05-19, at 09:49 , Nick Coghlan wrote:
On Thu, May 19, 2011 at 5:10 AM, Eric Smith <eric@trueblade.com> wrote:
On 05/18/2011 12:16 PM, Stephen J. Turnbull wrote:
Robert Collins writes:
Its probably too late to change, but please don't try to argue that its correct: the continued confusion of folk running into this is evidence that confusion *is happening*. Treat that as evidence and think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here.
I don't think there's any connection between the way 2.x confused text strings and binary data (which certainly needed addressing) with the way that 3.x returns a different type for byte_str[i] than it does for byte_str[i:i+1]. I think it's the latter that's confusing to people. There's no particular requirement for different types that's needed to fix the byte/str problem.
It's a mental model problem. People try to think of bytes as equivalent to 2.x str and that's just wrong, wrong, wrong. It's far closer to array.array('c'). Strings are basically *unique* in returning a length 1 instance of themselves for indexing operations. For every other sequence type, including tuples, lists and arrays, slicing returns a new instance of the same type, while indexing will typically return something different.
Now, we definitely didn't *help* matters by keeping so many of the default behaviours of bytes() and bytearray() coupled to ASCII-encoded text, but that was a matter of practicality beating purity: there really *are* a lot of wire protocols out there that are ASCII based. In hindsight, perhaps we should have gone further in breaking things to try to make the point about the mental model shift more forcefully. (However, that idea carries with it its own problems).
For what it's worth, Erlang's approach to the subject is — in my opinion — excellent: binaries (whose literals are called "bit syntax" there) are quite distinct from strings in both syntax and API, but you can put chunks of strings within binaries (the bit syntax acts as a container, in which you can put a literal or non-literal string). This simultaneously impresses upon the user that binaries are *not* strings and that they can still easily create binaries from strings.
On Thu, 19 May 2011 17:49:47 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
It's a mental model problem. People try to think of bytes as equivalent to 2.x str and that's just wrong, wrong, wrong. It's far closer to array.array('c'). Strings are basically *unique* in returning a length 1 instance of themselves for indexing operations. For every other sequence type, including tuples, lists and arrays, slicing returns a new instance of the same type, while indexing will typically return something different.
Now, we definitely didn't *help* matters by keeping so many of the default behaviours of bytes() and bytearray() coupled to ASCII-encoded text, but that was a matter of practicality beating purity: there really *are* a lot of wire protocols out there that are ASCII based.
I think "practicality beating purity" should have been extended to __getitem__ as well. I have almost never had a use for treating a bytestring as a sequence of integers, while treating a bytestring as a sequence of one-byte strings is *very* common. (and, as you say, if you want a sequence of integers you can already use array.array() which gives you more flexibility as to the width and signedness of integers) Regards Antoine.
On 5/19/2011 3:49 AM, Nick Coghlan wrote:
It's a mental model problem. People try to think of bytes as equivalent to 2.x str and that's just wrong, wrong, wrong. It's far closer to array.array('c').
Or like C char arrays
Strings are basically *unique* in returning a length 1 instance of themselves for indexing operations.
I still remember having to work that out and get used to it. -- Terry Jan Reedy
On Thu, May 19, 2011 at 4:16 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Robert Collins writes:
> Its probably too late to change, but please don't try to argue that > its correct: the continued confusion of folk running into this is > evidence that confusion *is happening*. Treat that as evidence and > think about how to fix it going forward.
Sorry, Rob, but you're just wrong here, and Nick is right. It's possible to improve Python 3, but not to "fix" it in this respect. The Python 3 solution is correct, the Python 2 approach is not. There's no way to avoid discontinuity and confusion here.
The top level description: 'bytes is a different type to text[unicode] and casting between them must be explicit' is completely correct in Python 3: I didn't (and have never AFAIK) quibbled about that. Thats separate to the implementation issues I have mentioned in this thread and previous. Arguing that implicit casting is a good idea isn't what I was doing, nor what Nick was rebutting, AFAICT. -Rob
Robert Collins writes:
Thats separate to the implementation issues I have mentioned in this thread and previous.
Oops, sorry. Nevertheless, I personally think that b'a'[0] == 97 is a good idea, and consistent with everything else in Python. It's Unicode (str) that is weird, it's str is surprising when first encountered by a C or Lisp programmer at first, but not enough to cause a heart attack given how weird natural language is. But I don't see why that weirdness (an element of LIST of TYPE is a LIST of TYPE, hey, young man, you're very smart but *it's turtles all the way down!*) should be replicated elsewhere. If you want your bytes object to behave like a str, it's very easy to get that (.decode('latin1')), and nobody has yet demonstrated that this is too time-inefficient for real work, given the other overhead imposed by Python. The space inefficiency could be dealt with as Greg points out (by internally having a Unicode representation using 1 byte instead of 2 or 4). But if you want your bytes object to *be* a string, then you're confused. It isn't (any more). Even if it's just a matter of flipping one bit in the type field, a str-with-unibyte- representation, is not equal to a bytes object with the same bytes. For example, you write:
urlparse converting bytes to 'str' to operate on them is at best a kludge - you're forcing 5 times the storage (the original bytes + 4 bytes-per-byte when its decoded into unicode) to work on something which is defined as a BNF * that uses ascii *.
Indeed it (RFC 3896) does *use* ASCII. But I think there is confusion in your words. This is what the RFC says about that use of ASCII: 2. Characters The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. [...] The ABNF notation defines its terminal values to be non-negative integers (codepoints) based on the US-ASCII coded character set [ASCII]. Because a URI is a sequence of characters, we must invert that relation in order to understand the URI syntax. Therefore, the integer values used by the ABNF must be mapped back to their corresponding characters via US-ASCII in order to complete the syntax rules. Ie, ASCII is *irrelevant* to (the modern definition of) URLs except as it is a convenient and familiar way to refer to a certain familiar and rather small set of *characters*. There are reasons for this (that I'm not going to rehash here), and they are the *same* reasons why Python 3's behavior is "correct" IMHO (modulo the issue about the type of a list element, which I discuss above). It is true that one might like there to be a literal that expresses `ord(bytes-object-of-length-one)', ie, something like o'a' == 97. (This is different from Greg's x'6465616462656566' == b'deadbeef', which I don't think helps solve the confusion problem although it would definitely be convenient.)
Ethan Furman wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface.
I think the weird part is that there exists a literal for writing a byte array as an ascii string, and furthermore that it's the *only* kind of literal available for bytes. Personally I think that the default literal syntax for bytes, and also the form produced by repr(), should have been something more neutral, such as hex, with the ascii form available for use when it makes sense. Currently if you want to write a bytes literal in hex, you have to say something like some_var = b'\xde\xad\xbe\xef' which is ugly and unreadable. Much nicer would be some_var = x'deadbeef' As for
--> some_other_var[3] == b'd'
there ought to be a literal for specifying an integer using an ascii character, so you could say something like if some_other_var[3] == c'd': which would be equivalent to if some_other_var[3] == ord(b'd') but without the overhead of computing the value each time at run time. -- Greg
On 5/17/2011 10:39 PM, Greg Ewing wrote:
Personally I think that the default literal syntax for bytes, and also the form produced by repr(), should have been something more neutral, such as hex, with the ascii form available for use when it makes sense.
Much nicer would be
some_var = x'deadbeef'
As for
--> some_other_var[3] == b'd'
there ought to be a literal for specifying an integer using an ascii character, so you could say something like
if some_other_var[3] == c'd':
which would be equivalent to
if some_other_var[3] == ord(b'd')
but without the overhead of computing the value each time at run time.
+1 Seems this could be added compatibly?
On 18.05.2011 07:39, Greg Ewing wrote:
Ethan Furman wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 256' interface.
I think the weird part is that there exists a literal for writing a byte array as an ascii string, and furthermore that it's the *only* kind of literal available for bytes.
Personally I think that the default literal syntax for bytes, and also the form produced by repr(), should have been something more neutral, such as hex, with the ascii form available for use when it makes sense. Currently if you want to write a bytes literal in hex, you have to say something like
some_var = b'\xde\xad\xbe\xef'
which is ugly and unreadable. Much nicer would be
some_var = x'deadbeef'
We do have bytes.fromhex('deadbeef') Georg
Greg Ewing wrote:
Ethan Furman wrote:
On the one hand we have the 'bytes are ascii data' type interface, and on the other we have the 'bytes are a list of integers between 0 - 255' interface.
I think the weird part is that there exists a literal for writing a byte array as an ascii string, and furthermore that it's the *only* kind of literal available for bytes.
That is the point I was trying to make -- thank you for stating it more clearly than I managed to. :)
Personally I think that the default literal syntax for bytes, and also the form produced by repr(), should have been something more neutral, such as hex,
Agreed. It is surprising to extract an element out of bytes, and not end up with bytes, but with an int -- if the repr used something besides the plain ascii representation, this would not be an expectation. For comparison, when one extracts an element out of a str one gets a str -- not the int representing the unicode code point.
with the ascii form available for use when it makes sense.
As for
--> some_other_var[3] == b'd'
there ought to be a literal for specifying an integer using an ascii character, so you could say something like
if some_other_var[3] == c'd':
which would be equivalent to
if some_other_var[3] == ord(b'd')
but without the overhead of computing the value each time at run time.
Given that we can't change the behavior of b'abc'[1], that would be better than what we have. +1 ~Ethan~
Ethan Furman wrote:
Greg Ewing wrote:
As for
--> some_other_var[3] == b'd'
there ought to be a literal for specifying an integer using an ascii character, so you could say something like
if some_other_var[3] == c'd':
which would be equivalent to
if some_other_var[3] == ord(b'd')
but without the overhead of computing the value each time at run time.
Given that we can't change the behavior of b'abc'[1], that would be better than what we have.
+1
Here's another thought, that perhaps is not backwards-incompatible... some_var[3] == b'd' At some point, the bytes class' __eq__ will be called -- is there a reason why we cannot have 1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented? This makes sense to me -- after all, the bytes class is an array of ints in range(256); it is a special case, but doesn't feel any more special than passing an int into bytes() giving a string of that many null bytes; and it would get rid of the, in my opinion ugly, idiom of some_var[i:i+1] == b'd' It would also not require a new literal syntax. ~Ethan~
Here's another thought, that perhaps is not backwards-incompatible...
some_var[3] == b'd'
At some point, the bytes class' __eq__ will be called -- is there a reason why we cannot have
1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented?
Immutable objects that compare equal should hash equal; so we would also have to change the hashing of byte strings. Not sure whether that, in turn, has undesirable consequences. In addition, equality should be transitive, so b'A' == 65.0. Regards, Martin
Martin v. Löwis wrote:
Here's another thought, that perhaps is not backwards-incompatible...
some_var[3] == b'd'
At some point, the bytes class' __eq__ will be called -- is there a reason why we cannot have
1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented?
Immutable objects that compare equal should hash equal; so we would also have to change the hashing of byte strings. Not sure whether that, in turn, has undesirable consequences.
I thought it was the other-way-round -- if they hash equal, they should compare equal? Or is this just for immutables?
In addition, equality should be transitive, so b'A' == 65.0.
I'm not sure what you're getting at... we could certainly have step 2 check for a number instead of an int, and then step 3 could extract the one element, giving an int, and then let that int compare itself with the other number, whether it be int, float, fraction, what-have-you. ~Ethan~
Immutable objects that compare equal should hash equal; so we would also have to change the hashing of byte strings. Not sure whether that, in turn, has undesirable consequences.
I thought it was the other-way-round -- if they hash equal, they should compare equal?
No no no. If they hash equal, it could just be a hash collision - objects of a class could all hash to 42, if they wanted to. Dictionaries require the property I mentioned. If they compare equal, but hash differently, a dictionary lookup would fail to find the key.
In addition, equality should be transitive, so b'A' == 65.0.
I'm not sure what you're getting at...
That it is counter-intuitive to have a bytes object compare equal to a floating-point number. Regards, Martin
Ethan Furman wrote:
some_var[3] == b'd'
1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented?
It might seem convenient, but I'd worry that it would lead to even more confusion in other ways. If someone sees that some_var[3] == b'd' is true, and that some_var[3] == 100 is also true, they might expect to be able to do things like n = b'd' + 1 and get 101... or maybe b'e'... -- Greg
On 19.05.2011 00:39, Greg Ewing wrote:
Ethan Furman wrote:
some_var[3] == b'd'
1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented?
It might seem convenient, but I'd worry that it would lead to even more confusion in other ways. If someone sees that
some_var[3] == b'd'
is true, and that
some_var[3] == 100
is also true, they might expect to be able to do things like
n = b'd' + 1
and get 101... or maybe b'e'...
Maybe they should :) Georg
On 2011-05-19, at 07:28 , Georg Brandl wrote:
On 19.05.2011 00:39, Greg Ewing wrote:
Ethan Furman wrote:
some_var[3] == b'd'
1) a check to see if the bytes instance is length 1 2) a check to see if i) the other object is an int, and 2) 0 <= other_obj < 256 3) if 1 and 2, make the comparison instead of returning NotImplemented?
It might seem convenient, but I'd worry that it would lead to even more confusion in other ways. If someone sees that
some_var[3] == b'd'
is true, and that
some_var[3] == 100
is also true, they might expect to be able to do things like
n = b'd' + 1
and get 101... or maybe b'e'...
Maybe they should :)
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
Xavier Morel, 19.05.2011 09:41:
On 2011-05-19, at 07:28 , Georg Brandl wrote:
On 19.05.2011 00:39, Greg Ewing wrote:
If someone sees that
some_var[3] == b'd'
is true, and that
some_var[3] == 100
is also true, they might expect to be able to do things like
n = b'd' + 1
and get 101... or maybe b'e'...
Maybe they should :)
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
The result of this must obviously be b"de1". Stefan
Wiadomość napisana przez Stefan Behnel w dniu 2011-05-19, o godz. 10:37:
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
The result of this must obviously be b"de1".
I hope you're joking. At best, the result should be b"de\x01". But I don't think such construct should be allowed. Just like you can't do `[1, 2, 3] + 4`. I wouldn't ever expect that a single byte behaves like a sequence of bytes. In the case of bytes b'a' is obviously still a sequence of bytes, just happening to store a single one. Indexing should return a byte so I'm not surprised it returns a number. Slicing on the other hand returns a sub-sequence. However inconvenient, I find the current behaviour logical and predictable. A shortcut for b'a'[0] would obviously be nice but that's for python-ideas. -- Best regards, Łukasz Langa Senior Systems Architecture Engineer IT Infrastructure Department Grupa Allegro Sp. z o.o.
Łukasz Langa, 19.05.2011 11:25:
Wiadomość napisana przez Stefan Behnel w dniu 2011-05-19, o godz. 10:37:
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
The result of this must obviously be b"de1".
I hope you're joking.
I "obviously" was. My point is that expectations and "obvious behaviour" may not be obvious to everyone. Nick summed it up very nicely IMHO. Stefan
On 2011-05-19, at 11:25 , Łukasz Langa wrote:
Wiadomość napisana przez Stefan Behnel w dniu 2011-05-19, o godz. 10:37:
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
The result of this must obviously be b"de1". I hope you're joking. At best, the result should be b"de\x01".
Actually, if `b'd'+1` returns `b'e'` an equivalent behavior should be that `b'de'+1` returns `b'df'`.
Wiadomość napisana przez Stefan Behnel w dniu 2011-05-19, o godz. 10:37:
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well? The result of this must obviously be b"de1". I hope you're joking. At best, the result should be b"de\x01". The behaviour Stefan suggests is what some "weakly typed" languages like
On 19/05/2011 10:25, Łukasz Langa wrote: perl (and possibly php?) do, which masks errors and is rightly abhorred by Python programmers (although semantically not *so* different from 1 + 1.0 == 2.0). I think it's safe to say that Stefan was joking. Michael
But I don't think such construct should be allowed. Just like you can't do `[1, 2, 3] + 4`. I wouldn't ever expect that a single byte behaves like a sequence of bytes. In the case of bytes b'a' is obviously still a sequence of bytes, just happening to store a single one. Indexing should return a byte so I'm not surprised it returns a number. Slicing on the other hand returns a sub-sequence.
However inconvenient, I find the current behaviour logical and predictable. A shortcut for b'a'[0] would obviously be nice but that's for python-ideas.
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html
On 19.05.2011 10:37, Stefan Behnel wrote:
Xavier Morel, 19.05.2011 09:41:
On 2011-05-19, at 07:28 , Georg Brandl wrote:
On 19.05.2011 00:39, Greg Ewing wrote:
If someone sees that
some_var[3] == b'd'
is true, and that
some_var[3] == 100
is also true, they might expect to be able to do things like
n = b'd' + 1
and get 101... or maybe b'e'...
Maybe they should :)
But why wouldn't "they" expect `b'de' + 1` to work as well in this case? If a 1-byte bytes is equivalent to an integer, why not an arbitrary one as well?
The result of this must obviously be b"de1".
To clarify my original one-liner: if bytes objects (but only one-char bytes objects) equal integers, you should rightly expect to treat them as integers. This is obviously *not* desirable from a strong-typing POV. Georg
OK, summarising the thread so far from my point of view. 1. There are some aspects of the behavior of bytes() objects that tempt people to think of them as string-like objects (primarily the b'' literals and their use in repr(), along with the fact that they fill roles that were filled by str in it's "arbitrary binary data" incarnation in Python 2.x). The mental model this creates in the reader is incorrect, as bytes() are far closer to array.array('c') in their underlying behaviour (and deliberately so - cf. PEP 358, 3112, 3137). One proposal for addressing this is to add a x'deadbeef' literal and using that in repr() rather than the bytestring. Another would be to escape all characters, even printable ASCII, in the bytes() representation. Both of these are undesirable, as they miss the original purpose of this behaviour: making it easier to work with the many ASCII based wire protocols that are in widespread use. To be honest, I don't think there is a lot we can do here except to further emphasise in the documentation and elsewhere that *bytes is not a string type* (regardless of any API similarities retained to ease transition from the 2.x series). For example, if we have any lingering references to "byte strings" they should be replaced with "byte sequences" or "bytes objects" (depending on context, as the former phrasing also encompasses bytearray objects). 2. As a concrete usability issue, it is awkward to programmatically check the value of a specific byte when working with an ASCII based protocol: data[i] == b'a' # Intuitive, but always False due to type mismatch data[i:i+1] == b'a' # Works, but clumsy data[i] == b'a'[0] # Ditto (but at least susceptible to compiler const-expression optimisation) data[i] == ord('a') # Clumsy and slow data[i] == 97 # Hard to read Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a') Potentially workable, but leaves the intuitive answer above silently producing an unexpected answer - allow 1-element byte sequences to compare equal to the corresponding integer values. - would require reworking of bytes.__hash__ to use the hash of the contained element when the data length is exactly 1 - transitivity of equality would recommend also supporting equivalences such as b'a' == 97.0 - backwards compatibility concerns arise due to introduction of new key collisions in dictionaries and sets and other value based containers - yet more string-like behaviour in a type that is *not* a string (further reinforcing the mistaken impression from point 1) - One thing that *isn't* a concern from my point of view is the fact that we have ample precedent in decimal.Decimal for supporting implicit coercion in comparison operations while disallowing them in arithmetic operations (Decimal("1") == 1.0 is allowed, but Decimal("1") + 1.0 will raise TypeError). For point 2, I'm personally +0 on the idea of having 1-element bytes and bytearray objects delegate hashing and comparison operations to the corresponding integer object. We have the power to make the obvious code correct code, so let's do that. However, the implications of the additional key collisions in value based containers may need to be explored further. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, May 19, 2011 at 6:43 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
For point 2, I'm personally +0 on the idea of having 1-element bytes and bytearray objects delegate hashing and comparison operations to the corresponding integer object. We have the power to make the obvious code correct code, so let's do that. However, the implications of the additional key collisions in value based containers may need to be explored further.
On further reflection, the key collision and semantics blurring problems mean I am at best -0 on this particular solution to the problem (and heading fairly rapidly in the direction of -1). Best to just go with b'a'[0] and let the optimiser sort it out (PyPy should handle it automatically, CPython would need work). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
On Thu, May 19, 2011 at 6:43 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
For point 2, I'm personally +0 on the idea of having 1-element bytes and bytearray objects delegate hashing and comparison operations to the corresponding integer object. We have the power to make the obvious code correct code, so let's do that. However, the implications of the additional key collisions in value based containers may need to be explored further.
Several folk have said that objects that compare equal must hash equal... Why? It's an honest question. Here's what I have tried: --> class Wierd(): ... def __init__(self, value): ... self.value = value ... def __eq__(self, other): ... return self.value == other ... def __hash__(self): ... return hash((self.value + 13) ** 3) ... --> one = Wierd(1) --> two = Wierd(2) --> three = Wierd(3) --> one <Wierd object at 0x00BFE710> --> one == 1 True --> one == 2 False --> two == 2 True --> three == 3 True --> d = dict() --> d[one] = '1' --> d[two] = '2' --> d[three] = '3' --> d {<Wierd object at 0x00BFE710>: '1', <Wierd object at 0x00BFE870>: '3', <Wierd object at 0x00BFE830>: '2'} --> d[1] = '1.0' --> d[2] = '2.0' --> d[3] = '3.0' --> d {<Wierd object at 0x00BFE870>: '3', 1: '1.0', 2: '2.0', 3: '3.0', <Wierd object at 0x00BFE830>: '2', <Wierd object at 0x00BFE710>: '1'} --> d[2] '2.0' --> d[two] '2' This behavior matches what I was imagining for having b'a' == 97. They compare equal, yet remain distinct objects for all other purposes. If anybody has a link to or an explanation why equal values must be equal hashes I'm all ears. My apologies in advance if this is an incredibly naive question. ~Ethan~
2011/5/19 Ethan Furman <ethan@stoneleaf.us>:
If anybody has a link to or an explanation why equal values must be equal hashes I'm all ears. My apologies in advance if this is an incredibly naive question.
https://secure.wikimedia.org/wikipedia/en/wiki/Hash_table -- Regards, Benjamin
On May 19, 2011, at 7:40 PM, Ethan Furman wrote:
Several folk have said that objects that compare equal must hash equal...
And so do the docs: http://docs.python.org/dev/reference/datamodel.html#object.__hash__ , "the only required property is that objects which compare equal have the same hash value". Raymond
On Fri, May 20, 2011 at 10:40 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
This behavior matches what I was imagining for having b'a' == 97. They compare equal, yet remain distinct objects for all other purposes.
If anybody has a link to or an explanation why equal values must be equal hashes I'm all ears. My apologies in advance if this is an incredibly naive question.
Because whether or not two objects can coexist in the same hash table should *not* depend on their hash values - it should depend on whether or not they compare equal to each other. The use of hashing should just be an optimisation, not fundamentally change the nature of the comparison operation. (i.e. "hash(a) == hash(b) and a == b" is meant to be a fast alternative to "a == b", not a completely different check). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, May 19, 2011 at 1:43 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
OK, summarising the thread so far from my point of view.
1. There are some aspects of the behavior of bytes() objects that tempt people to think of them as string-like objects (primarily the b'' literals and their use in repr(), along with the fact that they fill roles that were filled by str in it's "arbitrary binary data" incarnation in Python 2.x). The mental model this creates in the reader is incorrect, as bytes() are far closer to array.array('c') in their underlying behaviour (and deliberately so - cf. PEP 358, 3112, 3137).
I think most of this "wrong mental model" is actually due to people not having completely internalized the Python 3 way.
One proposal for addressing this is to add a x'deadbeef' literal and using that in repr() rather than the bytestring. Another would be to escape all characters, even printable ASCII, in the bytes() representation. Both of these are undesirable, as they miss the original purpose of this behaviour: making it easier to work with the many ASCII based wire protocols that are in widespread use.
Indeed, -1 on both.
To be honest, I don't think there is a lot we can do here except to further emphasise in the documentation and elsewhere that *bytes is not a string type* (regardless of any API similarities retained to ease transition from the 2.x series). For example, if we have any lingering references to "byte strings" they should be replaced with "byte sequences" or "bytes objects" (depending on context, as the former phrasing also encompasses bytearray objects).
+1
2. As a concrete usability issue, it is awkward to programmatically check the value of a specific byte when working with an ASCII based protocol:
data[i] == b'a' # Intuitive, but always False due to type mismatch data[i:i+1] == b'a' # Works, but clumsy data[i] == b'a'[0] # Ditto (but at least susceptible to compiler const-expression optimisation) data[i] == ord('a') # Clumsy and slow data[i] == 97 # Hard to read
Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a')
-1; the result is not a *character* but an integer. I'm personally favoring using b'a'[0] and possibly hiding this in a constant definition.
Potentially workable, but leaves the intuitive answer above silently producing an unexpected answer
I'm not convinced that that problem is any worse than other comparison-related problems. E.g. b'a' == 'a' also always returns False (most likely it'll be disguised by at least one operand being a variable of course.)
- allow 1-element byte sequences to compare equal to the corresponding integer values. - would require reworking of bytes.__hash__ to use the hash of the contained element when the data length is exactly 1 - transitivity of equality would recommend also supporting equivalences such as b'a' == 97.0 - backwards compatibility concerns arise due to introduction of new key collisions in dictionaries and sets and other value based containers - yet more string-like behaviour in a type that is *not* a string (further reinforcing the mistaken impression from point 1) - One thing that *isn't* a concern from my point of view is the fact that we have ample precedent in decimal.Decimal for supporting implicit coercion in comparison operations while disallowing them in arithmetic operations (Decimal("1") == 1.0 is allowed, but Decimal("1") + 1.0 will raise TypeError).
For point 2, I'm personally +0 on the idea of having 1-element bytes and bytearray objects delegate hashing and comparison operations to the corresponding integer object. We have the power to make the obvious code correct code, so let's do that. However, the implications of the additional key collisions in value based containers may need to be explored further.
My gut feeling about this is that this will probably introduce some confusing or unintended side effect elsewhere, and I am -1 on this change. -- --Guido van Rossum (python.org/~guido)
On May 19, 2011, at 1:43 PM, Guido van Rossum wrote:
-1; the result is not a *character* but an integer.
Well, really the result ought to be an octet, but I suppose adding an 'octet' type is beyond the scope of even this sprawling discussion :).
I'm personally favoring using b'a'[0] and possibly hiding this in a constant definition.
As someone who spends a frankly unfortunate amount of time handling protocols where things like this are necessary, I agree with this recommendation. In protocols where one needs to compare network data with one-byte type identifiers or packet prefixes, more (documented) constants and less inscrutable junk like if p == 'c': ... elif p == 'j': ... elif p == 'J': # for compatibility ... would definitely be a good thing. Of course, I realize that this sort of programmer will most likely replace those constants with 99, 106, 74 than take a moment to document what they mean, but at least they'll have to pause for a moment and realize that they have now lost _all_ mnemonics... In fact, I feel like I would want to push in the opposite direction: don't treat one-byte bytes slices less like integers; I wish I could more easily treat n-byte sequences _more_ like integers! :). More protocols have 2-byte or 4-byte network-endian packed integers embedded in them than have individual tag bytes that I want to examine. For the typical ASCII-ish protocol where you want to look at command names and CRLF-separated messages, you'd never want to look at an individual octet, stringish operations like split() will give you what you want.
Glyph Lefkowitz wrote:
In fact, I feel like I would want to push in the opposite direction: don't treat one-byte bytes slices less like integers; I wish I could more easily treat n-byte sequences _more_ like integers! :). More protocols have 2-byte or 4-byte network-endian packed integers embedded in them than have individual tag bytes that I want to examine.
So are you thinking that bytes([01,56])[:2] == 120 ? Or more along the lines of a .to_int() method? ~Ethan~
On 5/23/2011 1:20 PM, Ethan Furman wrote:
Glyph Lefkowitz wrote:
In fact, I feel like I would want to push in the opposite direction: don't treat one-byte bytes slices less like integers; I wish I could more easily treat n-byte sequences _more_ like integers! :). More protocols have 2-byte or 4-byte network-endian packed integers embedded in them than have individual tag bytes that I want to examine.
So are you thinking that bytes([01,56])[:2] == 120 ? Or more along the lines of a .to_int() method?
I believe that such things can be handled by the struct module. -- Terry Jan Reedy
Guido van Rossum wrote:
On Thu, May 19, 2011 at 1:43 AM, Nick Coghlan wrote:
Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a')
-1; the result is not a *character* but an integer. I'm personally favoring using b'a'[0] and possibly hiding this in a constant definition.
Using this method, my code now looks like: # constants EOH = b'\r'[0] CHAR = b'C'[0] DATE = b'D'[0] FLOAT = b'F'[0] INT = b'I'[0] LOGICAL = b'L'[0] MEMO = b'M'[0] NUMBER = b'N'[0] This is not beautiful code. ~Ethan~
# constants
EOH = b'\r'[0] CHAR = b'C'[0] DATE = b'D'[0] FLOAT = b'F'[0] INT = b'I'[0] LOGICAL = b'L'[0] MEMO = b'M'[0] NUMBER = b'N'[0]
This is not beautiful code.
In this case, I think the intent would be better captured with def ASCII(c): return c.encode('ascii') EOH = ASCII('\r') # 0D CHAR = ASCII('C') # 43 DATE = ASCII('D') # 44 FLOAT = ASCII('F') # 46 INT = ASCII('I') # 49 LOGICAL = ASCII('L') # 4C MEMO = ASCII('M') # 4D NUMBER = ASCII('N') # 4E This expresses the intent that a) these are really byte values, not characters, and b) the specific choice of byte values was motivated by ASCII. Regards, Martin
Guido van Rossum wrote:
On Thu, May 19, 2011 at 1:43 AM, Nick Coghlan wrote:
Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a')
-1; the result is not a *character* but an integer.
Would you be happier if it were spelled i'a' instead? -- Greg
Ethan Furman writes:
Using this method, my code now looks like:
# constants
[...]
This is not beautiful code.
Put mascara on a pig, and you have a pig with mascara on, not Bette Davis. I don't necessarily think you're doing anybody a service by making the hack of using ASCII bytes as mnemonics more beautiful. I think Martin's version is as beautiful as this code should get.
On Mon, Jun 13, 2011 at 3:18 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
This is not beautiful code.
Agreed, but: EOH, CHAR, DATE, FLOAT, INT, LOGICAL, MEMO, NUMBER = b'\rCDFILMN' is a shorter way to write the same thing. Going two per line makes it easier to mentally map the characters: EOH, CHAR = b'\rC' DATE, FLOAT = b'DF' INT, LOGICAL = b'IL' MEMO, NUMBER = b'MN' Or, as a variant on Martin's solution: FORMAT_CHARS = dict( EOH = '\r', CHAR= 'C', DATE = 'D', FLOAT = 'F', INT = 'I', LOGICAL = 'L', MEMO = 'M', NUMBER = 'N' ) FORMAT_CODES = {name : char.encode('ascii') for name, char in FORMAT_CHARS} globals().update(FORMAT_CODES) Sure, there's no "one obvious way" at this stage, but that's because we don't know yet if there even *should* be an obvious way to do this (as conflating text and binary data is a bad idea in principle). By not blessing any one way of handling the situation, we give alternative solutions time to evolve naturally. If one turns out to be clearly superior to the decode/process/encode cycle then hopefully that will become clear at some point in the future. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Thank you all for the responses. Rather than reply to each, I just made one big summary. :) ---------------------------------------------------------------- Martin v. Löwis wrote:
Ethan Furman wrote:
# constants
EOH = b'\r'[0] CHAR = b'C'[0] DATE = b'D'[0] FLOAT = b'F'[0] INT = b'I'[0] LOGICAL = b'L'[0] MEMO = b'M'[0] NUMBER = b'N'[0]
This is not beautiful code.
In this case, I think the intent would be better captured with
def ASCII(c): return c.encode('ascii')
EOH = ASCII('\r') # 0D CHAR = ASCII('C') # 43 DATE = ASCII('D') # 44 FLOAT = ASCII('F') # 46 INT = ASCII('I') # 49 LOGICAL = ASCII('L') # 4C MEMO = ASCII('M') # 4D NUMBER = ASCII('N') # 4E
This expresses the intent that a) these are really byte values, not characters, and b) the specific choice of byte values was motivated by ASCII.
Definitely easier to read. If I go this route I'll probably use ord(), though, since ascii and unicode are the same for the first 127 chars, and there will be plenty of places to error out with a more appropriate message if I get garbage. Since I really don't care what the actual integer values are, I'll skip those comments, too. ---------------------------------------------------------------- Hagen Fürstenau wrote:
You still have the alternative
EOH = ord('\r') CHAR = ord('C') ...
which looks fine to me.
Yes it does. I just dislike the (to me unnecessary) extra function call. For those tuning in late to this thread, these are workarounds for this not working: field_type = header[11] # field_type is now an int, not a 1-byte bstr if field_type == r'C': # r'C' is a 1-byte bstr, so this always fails ---------------------------------------------------------------- Greg Ewing wrote:
Guido van Rossum wrote:
On Thu, May 19, 2011 at 1:43 AM, Nick Coghlan wrote:
Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a')
-1; the result is not a *character* but an integer.
Would you be happier if it were spelled i'a' instead?
That would work for me, although I would prefer a'a' (for ASCII). :) ---------------------------------------------------------------- Stephen J. Turnbull wrote:
Put mascara on a pig, and you have a pig with mascara on, not Bette Davis. I don't necessarily think you're doing anybody a service by making the hack of using ASCII bytes as mnemonics more beautiful. I think Martin's version is as beautiful as this code should get.
I'll either use Martin's or Nick's. The point of beauty here is the ease of readability. I think less readable is worse, and we shouldn't have to have ugly, hard to read code nor inefficient code just because we have to deal with byte streams that aren't unicode. ---------------------------------------------------------------- Nick Coghlan wrote:
Agreed, but:
EOH, CHAR, DATE, FLOAT, INT, LOGICAL, MEMO, NUMBER = b'\rCDFILMN'
is a shorter way to write the same thing.
Going two per line makes it easier to mentally map the characters:
EOH, CHAR = b'\rC' DATE, FLOAT = b'DF' INT, LOGICAL = b'IL' MEMO, NUMBER = b'MN'
Wow. I didn't realize that could be done. That very nearly makes up for not being able to do it one char at a time. Thanks, Nick! ---------------------------------------------------------------- ~Ethan~
At 03:11 PM 6/13/2011 -0700, Ethan Furman wrote:
Nick Coghlan wrote:
Agreed, but:
EOH, CHAR, DATE, FLOAT, INT, LOGICAL, MEMO, NUMBER = b'\rCDFILMN'
is a shorter way to write the same thing.
Going two per line makes it easier to mentally map the characters:
EOH, CHAR = b'\rC' DATE, FLOAT = b'DF' INT, LOGICAL = b'IL' MEMO, NUMBER = b'MN'
Wow. I didn't realize that could be done. That very nearly makes up for not being able to do it one char at a time.
You can still do it one at a time: CHAR, = b'C' INT, = b'I' ... etc. I just tried it with Python 3.1 and it works there.
On Tue, Jun 14, 2011 at 9:40 AM, P.J. Eby <pje@telecommunity.com> wrote:
You can still do it one at a time:
CHAR, = b'C' INT, = b'I' ...
etc. I just tried it with Python 3.1 and it works there.
I almost mentioned that, although it does violate one of the "unwritten rules of the Zen" (in this case, "syntax shall not look like grit on Tim's monitor") Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 12:35 am, ncoghlan@gmail.com wrote:
On Tue, Jun 14, 2011 at 9:40 AM, P.J. Eby <pje@telecommunity.com> wrote:
You can still do it one at a time:
CHAR, = b'C' INT, �= b'I' ...
etc. �I just tried it with Python 3.1 and it works there.
I almost mentioned that, although it does violate one of the "unwritten rules of the Zen" (in this case, "syntax shall not look like grit on Tim's monitor")
[CHAR] = b'C' [INT] = b'I' ... Jean-Paul
Nick Coghlan wrote:
OK, summarising the thread so far from my point of view.
[snip]
To be honest, I don't think there is a lot we can do here except to further emphasise in the documentation and elsewhere that *bytes is not a string type* (regardless of any API similarities retained to ease transition from the 2.x series). For example, if we have any lingering references to "byte strings" they should be replaced with "byte sequences" or "bytes objects" (depending on context, as the former phrasing also encompasses bytearray objects).
I think this would be a big help.
2. As a concrete usability issue, it is awkward to programmatically check the value of a specific byte when working with an ASCII based protocol:
data[i] == b'a' # Intuitive, but always False due to type mismatch data[i:i+1] == b'a' # Works, but clumsy data[i] == b'a'[0] # Ditto (but at least susceptible to compiler const-expression optimisation) data[i] == ord('a') # Clumsy and slow data[i] == 97 # Hard to read
Proposals to address this include: - introduce a "character" literal to allow c'a' as an alternative to ord('a') Potentially workable, but leaves the intuitive answer above silently producing an unexpected answer
[snip]
For point 2, I'm personally +0 on the idea of having 1-element bytes and bytearray objects delegate hashing and comparison operations to the corresponding integer object. We have the power to make the obvious code correct code, so let's do that. However, the implications of the additional key collisions in value based containers may need to be explored further.
Nick Coghlan also wrote:
On further reflection, the key collision and semantics blurring problems mean I am at best -0 on this particular solution to the problem (and heading fairly rapidly in the direction of -1).
Last thought I have for a possible 'solution' -- when a bytes object is tested for equality against an int raise TypeError. Precedent being sum() raising a TypeError when passed a list of strings because performance is so poor. Reason here being that the intuitive behavior will never work and will always produce silent bugs. ~Ethan~
On Thu, May 19, 2011 at 10:50 AM, Ethan Furman <ethan@stoneleaf.us> wrote:
Last thought I have for a possible 'solution' -- when a bytes object is tested for equality against an int raise TypeError. Precedent being sum() raising a TypeError when passed a list of strings because performance is so poor. Reason here being that the intuitive behavior will never work and will always produce silent bugs.
Not the same thing at all. The == operator is special, and should not raise exceptions; too many things would start randomly failing (e.g. membership tests for a dict that has both ints and bytes as keys, or for a list containing a variety of types). -- --Guido van Rossum (python.org/~guido)
participants (25)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Benjamin Peterson
-
Bill Janssen
-
Eric Smith
-
Ethan Furman
-
exarkun@twistedmatrix.com
-
Georg Brandl
-
Glenn Linderman
-
Glyph Lefkowitz
-
Greg Ewing
-
Guido van Rossum
-
Hagen Fürstenau
-
Michael Foord
-
Nick Coghlan
-
P.J. Eby
-
R. David Murray
-
Raymond Hettinger
-
Robert Collins
-
Stefan Behnel
-
Stephen J. Turnbull
-
Terry Reedy
-
Xavier Morel
-
Xavier Morel
-
Łukasz Langa