Python 3.x and bytes
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
As those who have to work with byte strings know, when retrieving a single character from a byte string, what you get back is not a byte string, but an int -- a rather important distinction from unicode strings (str). This has the frustrating side-effect of b'abc'[2] == b'c' being False. It is far too late to change that particular behavior of the byte string (returning int's, that is) -- however, it may not be too late for a non-backwards-incompatible change: have the bytes class' __eq__ method be modified so that it 1) checks to see if the bytes instance is length 1 2) checks to see if a) the other object is an int, and b) 0 <= other_obj < 256 3) if 1 and 2, make the comparison between the int and its single element instead of returning NotImplemented? This makes sense to me -- after all, the bytes class is an array of ints in range(256); it is a special case, but doesn't feel any more special than passing an int into bytes() giving a string of that many null bytes; and it would get rid of the, in my opinion ugly, idiom of some_var[i:i+1] == b'd' It would also not require a new literal syntax. Thoughts? ~Ethan~
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/18/2011 4:10 PM, Ethan Furman wrote:
For all sequences, slicing (if it works at all) returns a subsequence (possibly of length 0, which is why slicing can work with out-of-bounds slice points). For all (built-in) sequences except for strings, indexing returns a member of the sequence (which is why it raises an exception for out-of-bounds indexes). Leaving aside extension and user-defined sequences, strings are unique in instead returning a length-1 subsequence So bytes are normal while strings are anomolous! Why that anomaly? The immediate reason is that Python does not have a separate character type. Why not? Guido might best answer (but he might say 'my gut instinct'), but I can think of a few reasons. 1. That is how it is in the (math) theory of strings. 'A' is both a char and a string of length one. There is no separate 'char' type that cannot be added (concatenated) to other strings of whatever length. 2. (Related) This pragmatically works best for Python. 3. Python follows Occam's principle by not introducing types without necessity. And a separate char type is not *necessary*. 4. Text strings are homegeneous arrays (like the arrays in the array module), unlike heterogeneous tuples and lists. So they need not be sequences of Python objects, and for efficiency, would not be even if there were a character type. Like other arrays, they contain the information needed to produce Python objects on demand without actually containing such objects in the way tuples, lists, and dicts do. I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface. For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/600af/600af0bbcc432b8ca2fa4d01f09c63633eb2f1a7" alt=""
On Wed, May 18, 2011 at 11:10 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I don't see the necessity of saying that length-1 strings aren't members of strings. For all definitions I can think of for "member of the sequence", they are. You get them when you iterate over them, you get them when you use index access, they work with .index(). They have a sort of infinite regress / cycle to them ("it's strings all the way down"), but you can get that with lists too (x = []; x.append(x); y = x + x -- compare with x = 'a'; y = x + x).
At least in the context of formal language theory (e.g. Sipser's Introduction to the Theory of Computation), characters (symbols) are a separate thing from strings. You have your alphabet, Sigma, which is an arbitrary set, and strings are finite sequences of elements from Sigma. In Python's case, it's chosen an alphabet where all elements are length-1 strings in the alphabet. I don't think that's really well-formed using this definition of string and ZFC, and the usual definitions of finite sequences (functions or linked-lists). It doesn't really matter, you can model it in something else.
I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface.
I would add the syntax of bytes literals to the list of similarities. br'\foo' versus r'\foo' makes them very similar.
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
Eh, actually I think what was suggested was having w.g. b'\x42' == 0x42 by making singleton bytes objects equal to the appropriate integer. This would work for all bytes, not just those smaller than 128. Devin Jeanpierre
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Terry Reedy writes: <aside>
3. Python follows Occam's principle by not introducing types without necessity. And a separate char type is not *necessary*.
Well, neither are floats and integers; Decimal should do, no? </aside>
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
For us, the convenience remains. Japanese mail is transmitted via SMTP, and the control function "hello" is still spelled "EHLO" in Japanese mail. Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course. It's the pain that comes from the inevitable mixing of binary protocol that looks like text with real text, turning the whole into an unintelligible garble, that hurts so much harder for people who can't properly write their names in ASCII. ターンブル・スティーヴェンです-ly y'rs,
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/20/2011 1:44 AM, Stephen J. Turnbull wrote:
I understood the thrust of this thread being that doing text manipulation with bytes sometimes bites -- because bytes are not text. Someone writing email or html bodies in Japanese or Farsi will not even try that, but will use str (unicode) and encode to bytes only when done, most likely transparently.. As far as I noticed, Ethan did not explain why he was extracting single bytes and comparing to a constant, so it is hard to know if he was even using them properly.
I am not familiar with that control function, but if it is part of the SMTP protocol, it has nothing to do with the language of the payload. For programming a wire protocol that encodes abstract functions in ascii chars, then the ascii char representation of bytes in convenient. That is why it was chosen as the default.
Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course.
When writing the html *text* body, sure. But I presume browsers decode encoded bytes to unicode *before* parsing the text. If so, it does not really matter that '<br>' gets encoded to b'<br>'.
-- Terry Jan Reedy
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
Terry Reedy wrote:
The header of a .dbf file details the field composition such as name, size, type, etc. The type is C for character, L for logical, etc, and the end of the field definition block is signaled by a CR byte. So in one spot of my code I (used to) have a comparison if hdr[0] == b'\x0d': # end of fields which I have changed to if hdr[0] == 0x0d: and elsewhere: field_type = hdr[11] which is now field_type = chr(hdr[11]) since the first 127 positions of unicode are ASCII. However, I can see this silently producing errors for values between 128 and 255 -- consider: --> chr(0xa1) '¡' --> b'\xa1'.decode('cp1251') '\u040e' So because my single element access to the byte string lost its bytes type, I may no longer get the correct result. ~Ethan~
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 20 May 2011 14:05, Ethan Furman <ethan@stoneleaf.us> wrote:
This seems to me to be an improvement, regardless...
That seems reasonable, if you have a fixed set of known-ASCII values that are field types. If you care about detecting invalid files, then do a field_type in 'CL...' test to validate and you're fine.
But those aren't valid field codes, so why do you care? And why are you using cp1251? I thought you said they were ASCII? As I said, if you're checking for error values, just start with either a check for specific values, or simply check the field type is <128.
So because my single element access to the byte string lost its bytes type, I may no longer get the correct result.
I still don't see your problem here... Paul.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, May 20, 2011 at 11:05 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
which is now
field_type = chr(hdr[11])
This is definitely a modelling problem, and exactly the kind of thinking that the bytes model in Py3k is intended to combat. Bytes are not text, even when you're dealing primarily with ASCII. The world where that mindset worked consistently and reliably is ancient history (and many non-English speakers still suffer annoying software glitches due to the fact that English speakers have been able to get by with only ASCII for so long). If you want a subscript on a bytes object to create another bytes object, then slice it, just as you would a list. If you want the integer value, index it.
So because my single element access to the byte string lost its bytes type, I may no longer get the correct result.
Umm, no. You may not get the correct result because you're telling Python to interpret a value as a Unicode code point when it is actually no such thing (given your example, I assume it is actually cp1251 encoded text). Therefore, instead of: chr(hdr[11]) # Only makes sense for a sequence of Unicode code points you want something like: hdr[11:12].decode('cp1251') # Makes sense for a cp1251 encoded byte sequence Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Nick Coghlan <ncoghlan@gmail.com> wrote:
To me, that's the crux of this issue, and that's the reason this will keep coming up again and again, and that's the reason people will continue to want to "improve" the 'bytes' type to be more 'string-like'. The problem, of course, is that bytes often *are* text, in the sense that the byte sequence contains an encoded string, and the programmer both knows that and wants that. Even for non-ASCII strings. Because Python is widely used for processing encoded strings of various kinds, and programmers hate to decode/encode just to work on them *as* strings. Mind you, that's exactly the wrong thing to do, in my opinion. It just gets us back to the bad old days of Python 2, where strings were often kept in a sequence of bytes which had no way of indicating what encoding it had. But changing the mindset of programmers? Hard to do, very hard to do. Personally, I think a more realistic approach might be to (a) improve the implementation of 'str()' so that it avoids unnecessary decode/encode operations, decoding only when necessary (yes, that means there would be multiple C-level representations for a 'str'), and then (b) making 'bytes' less useful as strings. Bill
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/20/2011 9:05 AM, Ethan Furman wrote:
At the level of bytes, these are small int codes. For English speakers, it is convenient that most map to ascii chars that are the first letters of an English name of the type. This convinience is somewhat lost for non-English non-latin-alphabet speakers who cannot do the same.
Some people dislike magic constants in code and would suggest defining them at the top of the file (or even in a separate module) with comment that define and explain the protocol. # Field type codes T_log = ... # Logical field with T or F <or whatever> T_char= ... # Variable length char field <or whatever> T_efdb= 0x0d # End of field definition block Take your pick of how to define the constants:
0x0d == 13 == 0o15 == 0b1101 == ord(b'\r') == ord('\r') == b'\r'[0] True
In 3.x, the identifies and comments can use any characters and language, so this works for everyone. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Terry Reedy writes:
It doesn't really matter whether Ethan is using them properly. It's clear there are such uses, though I don't know how important they are, so we may as well assume Ethan's is one such.
Precisely my point. Therefore a payload represented as bytes should be treated as *uninterpreted* bytes, except where interpretations are defined for those bytes. This works for SMTP, because RFC 822 *deliberately* specifies headers to be encoded in ASCII (not "ASCII-compatible") in order that the payload (header) manipulations specified by RFC 821 and friends be guaranteed correct. Nevertheless, people frequently request mail processing features that require manipulations of MIME part bodies and even plain RFC 822 message bodies. These cannot be guaranteed correct unless done by decoding and reencoding, but bytes-oriented manipulations generally "work" in monolingual contexts (or seem to, and any problems can always be blamed on MS Outlook). There are several such features that come up over and over again on Mailman lists and sometimes in the Python Email SIG, and I'm sure the same is true for web protocols.
HTML is not exclusively processed by browsers. It is often processed by servers and middleware that don't know they're speaking HTML, and according to several experts' testimony, they're in a freakin' hurry to push bytes out the door, there's no time for Unicode (decoding and encoding, OMG how inefficient!) Such developers want to write their libraries using bytes *and* literals that can be used both for binary protocols and for text protocols (urlparse seems to be the canonical example). The convenience of using bytes in a string-like way (eg, the b'' literal) in manipulating many binary protocols is clear. That convenience is just as great for people who are at substantial risk of mojibake if bytes are used to do text manipulations on the encoded form, as well as for people who face little risk (eg, those who use only American English). The question is how far to go with polymorphism, etc. I think that Nick's urlparse work gets the balance about right, and see only danger in more stringlike bytes (eg, by returning b'b' for b'bytes'[0]). OTOH, there are some changes that might be useful but seem very low-risk, such as a c'b' literal that means 98, not b'b'.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Mon, May 23, 2011 at 1:46 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If we did go with an ord() literal, I would actually favour something more like 0'b'. However, as Maciej pointed out off-list, adding a new literal type because calls to builtin functions have a relatively high overhead in CPython even with constant arguments probably isn't a good idea. Better to just write "ord('b')" and use PyPy to make it fast (Alternative for use with -O rather than PyPy: "ordb = 98; assert ordb == ord('b')"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
On Mon, May 23, 2011 at 5:33 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Not a nuisance enough to warrant a syntax change, IMO. Note that one of the proposed alternatives, 0'b' visually is very similar to b'x'[0]. There are plenty of other options available to users. My own favorite is probably, if bytesdata[i] == 98: # ord('b') .. In some cases, when single-byte values have protocol mnemonics, it may be more appropriate to give them descriptive names: quit_code = ord('q') if bytesdata[i] == quit_code: .. Finally, I find it rare to have single-byte codes at fixed positions in protocols. More often such codes are found after splitting the bytes data on some kind of separator.
data:image/s3,"s3://crabby-images/92199/921992943324c6708ae0f5518106ecf72b9897b1" alt=""
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.) I don't like this:
if bytesdata[i] == 121: # ord('x')
because it looks a heck of a lot like:
if bytesdata[i] == 120: # ord('x')
and only one of those is correct. That's a very easy bug to miss. I like it even less without the comment. I don't care for:
if bytesdata[i] == ord('x'):
because while ord is a builtin, it's not invulnerable to being changed. In contrast, string constants and numbers are truly constant. I recognize that the compiler can optimize:
if bytesdata[i] == b'x'[0]:
but that looks like chicken scratches to me. Someone suggested using 0'x' which I don't quite get. It looks too much like 0x to me and the I've always read the leading zero to mean 'this is a number'. Also, this was raised in the context of bytes and not all characters fit in a byte. So c'Δ' ord('Δ') work but b'Δ'[0] won't. Is there a learning curve? Yes, but minor IMHO and if you don't know it, it's obvious when you see it that you don't know it. --- Bruce Follow me: http://www.twitter.com/Vroo Latest tweet: SO disappointed end of the world didn't happen AGAIN! #y2k<http://twitter.com/#!/search?q=%23y2k> #rapture <http://twitter.com/#!/search?q=%23rapture> Now waiting for 2038! #unixrapture <http://twitter.com/#!/search?q=%23unixrapture>
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
2011/5/23 Bruce Leban <bruce@leapyear.org>:
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.)
-0 from me Mainly because unlike b'..' or r'..' constructs, no meaning is proposed for c'xyz'. BTW, is it too soon to assign new meaning to back-quotes? In py3k they no longer stand for repr(), so we can probably reuse them for ord()? On the other hand, this is likely to be a bad idea for the same reasons as syntax for repr() was.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
2011/5/23 Bruce Leban <bruce@leapyear.org>:
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.)
We shouldn't add any new notation to create integers from characters to the language. It's too small a use case for adding new syntax. I would focus on agreeing on the notation that is most readable; personally I vote for ord('x'). -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Bruce Leban writes:
Using named constants should fix that, and is better style anyway.
Someone suggested using 0'x' which I don't quite get. It looks too much like 0x to me
True but minor, IMO YMMV.
and the I've always read the leading zero to mean 'this is a number'.
That's precisely Nick's point in suggesting it!
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, May 24, 2011 at 12:40 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Indeed :) Still, I've come around to the point of view that the simplest and clearest way to write it is simply "ord('x')", and if that is in a time-critical inner loop, save the value in a named variable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/18/2011 4:10 PM, Ethan Furman wrote:
For all sequences, slicing (if it works at all) returns a subsequence (possibly of length 0, which is why slicing can work with out-of-bounds slice points). For all (built-in) sequences except for strings, indexing returns a member of the sequence (which is why it raises an exception for out-of-bounds indexes). Leaving aside extension and user-defined sequences, strings are unique in instead returning a length-1 subsequence So bytes are normal while strings are anomolous! Why that anomaly? The immediate reason is that Python does not have a separate character type. Why not? Guido might best answer (but he might say 'my gut instinct'), but I can think of a few reasons. 1. That is how it is in the (math) theory of strings. 'A' is both a char and a string of length one. There is no separate 'char' type that cannot be added (concatenated) to other strings of whatever length. 2. (Related) This pragmatically works best for Python. 3. Python follows Occam's principle by not introducing types without necessity. And a separate char type is not *necessary*. 4. Text strings are homegeneous arrays (like the arrays in the array module), unlike heterogeneous tuples and lists. So they need not be sequences of Python objects, and for efficiency, would not be even if there were a character type. Like other arrays, they contain the information needed to produce Python objects on demand without actually containing such objects in the way tuples, lists, and dicts do. I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface. For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/600af/600af0bbcc432b8ca2fa4d01f09c63633eb2f1a7" alt=""
On Wed, May 18, 2011 at 11:10 PM, Terry Reedy <tjreedy@udel.edu> wrote:
I don't see the necessity of saying that length-1 strings aren't members of strings. For all definitions I can think of for "member of the sequence", they are. You get them when you iterate over them, you get them when you use index access, they work with .index(). They have a sort of infinite regress / cycle to them ("it's strings all the way down"), but you can get that with lists too (x = []; x.append(x); y = x + x -- compare with x = 'a'; y = x + x).
At least in the context of formal language theory (e.g. Sipser's Introduction to the Theory of Computation), characters (symbols) are a separate thing from strings. You have your alphabet, Sigma, which is an arbitrary set, and strings are finite sequences of elements from Sigma. In Python's case, it's chosen an alphabet where all elements are length-1 strings in the alphabet. I don't think that's really well-formed using this definition of string and ZFC, and the usual definitions of finite sequences (functions or linked-lists). It doesn't really matter, you can model it in something else.
I do, however, understand the tendency to think of bytes as strings because of both Python's history and the remnant string interface.
I would add the syntax of bytes literals to the list of similarities. br'\foo' versus r'\foo' makes them very similar.
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
Eh, actually I think what was suggested was having w.g. b'\x42' == 0x42 by making singleton bytes objects equal to the appropriate integer. This would work for all bytes, not just those smaller than 128. Devin Jeanpierre
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Terry Reedy writes: <aside>
3. Python follows Occam's principle by not introducing types without necessity. And a separate char type is not *necessary*.
Well, neither are floats and integers; Decimal should do, no? </aside>
For people using non-Latin (non-ascii) alphabets, the 'convenience' of replacing some bytes with ascii-chars might be less convenient.
For us, the convenience remains. Japanese mail is transmitted via SMTP, and the control function "hello" is still spelled "EHLO" in Japanese mail. Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course. It's the pain that comes from the inevitable mixing of binary protocol that looks like text with real text, turning the whole into an unintelligible garble, that hurts so much harder for people who can't properly write their names in ASCII. ターンブル・スティーヴェンです-ly y'rs,
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/20/2011 1:44 AM, Stephen J. Turnbull wrote:
I understood the thrust of this thread being that doing text manipulation with bytes sometimes bites -- because bytes are not text. Someone writing email or html bodies in Japanese or Farsi will not even try that, but will use str (unicode) and encode to bytes only when done, most likely transparently.. As far as I noticed, Ethan did not explain why he was extracting single bytes and comparing to a constant, so it is hard to know if he was even using them properly.
I am not familiar with that control function, but if it is part of the SMTP protocol, it has nothing to do with the language of the payload. For programming a wire protocol that encodes abstract functions in ascii chars, then the ascii char representation of bytes in convenient. That is why it was chosen as the default.
Farsi web pages are formatted by HTML, and the control function "new line" is spelled "<BR>" in Farsi, of course.
When writing the html *text* body, sure. But I presume browsers decode encoded bytes to unicode *before* parsing the text. If so, it does not really matter that '<br>' gets encoded to b'<br>'.
-- Terry Jan Reedy
data:image/s3,"s3://crabby-images/dd81a/dd81a0b0c00ff19c165000e617f6182a8ea63313" alt=""
Terry Reedy wrote:
The header of a .dbf file details the field composition such as name, size, type, etc. The type is C for character, L for logical, etc, and the end of the field definition block is signaled by a CR byte. So in one spot of my code I (used to) have a comparison if hdr[0] == b'\x0d': # end of fields which I have changed to if hdr[0] == 0x0d: and elsewhere: field_type = hdr[11] which is now field_type = chr(hdr[11]) since the first 127 positions of unicode are ASCII. However, I can see this silently producing errors for values between 128 and 255 -- consider: --> chr(0xa1) '¡' --> b'\xa1'.decode('cp1251') '\u040e' So because my single element access to the byte string lost its bytes type, I may no longer get the correct result. ~Ethan~
data:image/s3,"s3://crabby-images/8e91b/8e91bd2597e9c25a0a8c3497599699707003a9e9" alt=""
On 20 May 2011 14:05, Ethan Furman <ethan@stoneleaf.us> wrote:
This seems to me to be an improvement, regardless...
That seems reasonable, if you have a fixed set of known-ASCII values that are field types. If you care about detecting invalid files, then do a field_type in 'CL...' test to validate and you're fine.
But those aren't valid field codes, so why do you care? And why are you using cp1251? I thought you said they were ASCII? As I said, if you're checking for error values, just start with either a check for specific values, or simply check the field type is <128.
So because my single element access to the byte string lost its bytes type, I may no longer get the correct result.
I still don't see your problem here... Paul.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Fri, May 20, 2011 at 11:05 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
which is now
field_type = chr(hdr[11])
This is definitely a modelling problem, and exactly the kind of thinking that the bytes model in Py3k is intended to combat. Bytes are not text, even when you're dealing primarily with ASCII. The world where that mindset worked consistently and reliably is ancient history (and many non-English speakers still suffer annoying software glitches due to the fact that English speakers have been able to get by with only ASCII for so long). If you want a subscript on a bytes object to create another bytes object, then slice it, just as you would a list. If you want the integer value, index it.
So because my single element access to the byte string lost its bytes type, I may no longer get the correct result.
Umm, no. You may not get the correct result because you're telling Python to interpret a value as a Unicode code point when it is actually no such thing (given your example, I assume it is actually cp1251 encoded text). Therefore, instead of: chr(hdr[11]) # Only makes sense for a sequence of Unicode code points you want something like: hdr[11:12].decode('cp1251') # Makes sense for a cp1251 encoded byte sequence Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/b2012/b20127a966d99eea8598511fc82e29f8d180df6c" alt=""
Nick Coghlan <ncoghlan@gmail.com> wrote:
To me, that's the crux of this issue, and that's the reason this will keep coming up again and again, and that's the reason people will continue to want to "improve" the 'bytes' type to be more 'string-like'. The problem, of course, is that bytes often *are* text, in the sense that the byte sequence contains an encoded string, and the programmer both knows that and wants that. Even for non-ASCII strings. Because Python is widely used for processing encoded strings of various kinds, and programmers hate to decode/encode just to work on them *as* strings. Mind you, that's exactly the wrong thing to do, in my opinion. It just gets us back to the bad old days of Python 2, where strings were often kept in a sequence of bytes which had no way of indicating what encoding it had. But changing the mindset of programmers? Hard to do, very hard to do. Personally, I think a more realistic approach might be to (a) improve the implementation of 'str()' so that it avoids unnecessary decode/encode operations, decoding only when necessary (yes, that means there would be multiple C-level representations for a 'str'), and then (b) making 'bytes' less useful as strings. Bill
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 5/20/2011 9:05 AM, Ethan Furman wrote:
At the level of bytes, these are small int codes. For English speakers, it is convenient that most map to ascii chars that are the first letters of an English name of the type. This convinience is somewhat lost for non-English non-latin-alphabet speakers who cannot do the same.
Some people dislike magic constants in code and would suggest defining them at the top of the file (or even in a separate module) with comment that define and explain the protocol. # Field type codes T_log = ... # Logical field with T or F <or whatever> T_char= ... # Variable length char field <or whatever> T_efdb= 0x0d # End of field definition block Take your pick of how to define the constants:
0x0d == 13 == 0o15 == 0b1101 == ord(b'\r') == ord('\r') == b'\r'[0] True
In 3.x, the identifies and comments can use any characters and language, so this works for everyone. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Terry Reedy writes:
It doesn't really matter whether Ethan is using them properly. It's clear there are such uses, though I don't know how important they are, so we may as well assume Ethan's is one such.
Precisely my point. Therefore a payload represented as bytes should be treated as *uninterpreted* bytes, except where interpretations are defined for those bytes. This works for SMTP, because RFC 822 *deliberately* specifies headers to be encoded in ASCII (not "ASCII-compatible") in order that the payload (header) manipulations specified by RFC 821 and friends be guaranteed correct. Nevertheless, people frequently request mail processing features that require manipulations of MIME part bodies and even plain RFC 822 message bodies. These cannot be guaranteed correct unless done by decoding and reencoding, but bytes-oriented manipulations generally "work" in monolingual contexts (or seem to, and any problems can always be blamed on MS Outlook). There are several such features that come up over and over again on Mailman lists and sometimes in the Python Email SIG, and I'm sure the same is true for web protocols.
HTML is not exclusively processed by browsers. It is often processed by servers and middleware that don't know they're speaking HTML, and according to several experts' testimony, they're in a freakin' hurry to push bytes out the door, there's no time for Unicode (decoding and encoding, OMG how inefficient!) Such developers want to write their libraries using bytes *and* literals that can be used both for binary protocols and for text protocols (urlparse seems to be the canonical example). The convenience of using bytes in a string-like way (eg, the b'' literal) in manipulating many binary protocols is clear. That convenience is just as great for people who are at substantial risk of mojibake if bytes are used to do text manipulations on the encoded form, as well as for people who face little risk (eg, those who use only American English). The question is how far to go with polymorphism, etc. I think that Nick's urlparse work gets the balance about right, and see only danger in more stringlike bytes (eg, by returning b'b' for b'bytes'[0]). OTOH, there are some changes that might be useful but seem very low-risk, such as a c'b' literal that means 98, not b'b'.
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Mon, May 23, 2011 at 1:46 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
If we did go with an ord() literal, I would actually favour something more like 0'b'. However, as Maciej pointed out off-list, adding a new literal type because calls to builtin functions have a relatively high overhead in CPython even with constant arguments probably isn't a good idea. Better to just write "ord('b')" and use PyPy to make it fast (Alternative for use with -O rather than PyPy: "ordb = 98; assert ordb == ord('b')"). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
On Mon, May 23, 2011 at 5:33 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Not a nuisance enough to warrant a syntax change, IMO. Note that one of the proposed alternatives, 0'b' visually is very similar to b'x'[0]. There are plenty of other options available to users. My own favorite is probably, if bytesdata[i] == 98: # ord('b') .. In some cases, when single-byte values have protocol mnemonics, it may be more appropriate to give them descriptive names: quit_code = ord('q') if bytesdata[i] == quit_code: .. Finally, I find it rare to have single-byte codes at fixed positions in protocols. More often such codes are found after splitting the bytes data on some kind of separator.
data:image/s3,"s3://crabby-images/92199/921992943324c6708ae0f5518106ecf72b9897b1" alt=""
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.) I don't like this:
if bytesdata[i] == 121: # ord('x')
because it looks a heck of a lot like:
if bytesdata[i] == 120: # ord('x')
and only one of those is correct. That's a very easy bug to miss. I like it even less without the comment. I don't care for:
if bytesdata[i] == ord('x'):
because while ord is a builtin, it's not invulnerable to being changed. In contrast, string constants and numbers are truly constant. I recognize that the compiler can optimize:
if bytesdata[i] == b'x'[0]:
but that looks like chicken scratches to me. Someone suggested using 0'x' which I don't quite get. It looks too much like 0x to me and the I've always read the leading zero to mean 'this is a number'. Also, this was raised in the context of bytes and not all characters fit in a byte. So c'Δ' ord('Δ') work but b'Δ'[0] won't. Is there a learning curve? Yes, but minor IMHO and if you don't know it, it's obvious when you see it that you don't know it. --- Bruce Follow me: http://www.twitter.com/Vroo Latest tweet: SO disappointed end of the world didn't happen AGAIN! #y2k<http://twitter.com/#!/search?q=%23y2k> #rapture <http://twitter.com/#!/search?q=%23rapture> Now waiting for 2038! #unixrapture <http://twitter.com/#!/search?q=%23unixrapture>
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
2011/5/23 Bruce Leban <bruce@leapyear.org>:
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.)
-0 from me Mainly because unlike b'..' or r'..' constructs, no meaning is proposed for c'xyz'. BTW, is it too soon to assign new meaning to back-quotes? In py3k they no longer stand for repr(), so we can probably reuse them for ord()? On the other hand, this is likely to be a bad idea for the same reasons as syntax for repr() was.
data:image/s3,"s3://crabby-images/3c3b2/3c3b2a6eec514cc32680936fa4e74059574d2631" alt=""
2011/5/23 Bruce Leban <bruce@leapyear.org>:
I like c'x'. It's easy to read and very explicitly constant and clear what the value is 'x'. (Some other letter instead of 'c' would be fine as well.)
We shouldn't add any new notation to create integers from characters to the language. It's too small a use case for adding new syntax. I would focus on agreeing on the notation that is most readable; personally I vote for ord('x'). -- --Guido van Rossum (python.org/~guido)
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Bruce Leban writes:
Using named constants should fix that, and is better style anyway.
Someone suggested using 0'x' which I don't quite get. It looks too much like 0x to me
True but minor, IMO YMMV.
and the I've always read the leading zero to mean 'this is a number'.
That's precisely Nick's point in suggesting it!
data:image/s3,"s3://crabby-images/eac55/eac5591fe952105aa6b0a522d87a8e612b813b5f" alt=""
On Tue, May 24, 2011 at 12:40 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Indeed :) Still, I've come around to the point of view that the simplest and clearest way to write it is simply "ord('x')", and if that is in a time-critical inner loop, save the value in a named variable. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (13)
-
Alexander Belopolsky
-
Bill Janssen
-
Bruce Leban
-
Darren Dale
-
Devin Jeanpierre
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
Nick Coghlan
-
Paul Moore
-
Stefan Behnel
-
Stephen J. Turnbull
-
Terry Reedy