bytes type discussion
I'm about to send 6 or 8 replies to various salient messages in the PEP 332 revival thread. That's probably a sign that there's still a lot to be sorted out. In the mean time, to save you reading through all those responses, here's a summary of where I believe I stand. Let's continue the discussion in this new thread unless there are specific hairs to be split in the other thread that aren't addressed below or by later posts. Non-controversial (or almost): - we need a new PEP; PEP 332 won't cut it - no b"..." literal - bytes objects are mutable - bytes objects are composed of ints in range(256) - you can pass any iterable of ints to the bytes constructor, as long as they are in range(256) - longs or anything with an __index__ method should do, too - when you index a bytes object, you get a plain int - repr(bytes[1,0 20, 30]) == 'bytes([10, 20, 30])' Somewhat controversial: - it's probably too big to attempt to rush this into 2.5 - bytes("abc") == bytes(map(ord, "abc")) - bytes("\x80\xff") == bytes(map(ord, "\x80\xff")) == bytes([128, 256]) Very controversial: - bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument - bytes(u"abc") == bytes("abc") # for ASCII at least - bytes(u"\x80\xff") raises UnicodeError - bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff") Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Feb 14, 2006, at 3:13 PM, Guido van Rossum wrote:
I'm about to send 6 or 8 replies to various salient messages in the PEP 332 revival thread. That's probably a sign that there's still a lot to be sorted out. In the mean time, to save you reading through all those responses, here's a summary of where I believe I stand. Let's continue the discussion in this new thread unless there are specific hairs to be split in the other thread that aren't addressed below or by later posts.
Non-controversial (or almost):
- we need a new PEP; PEP 332 won't cut it
- no b"..." literal
- bytes objects are mutable
- bytes objects are composed of ints in range(256)
- you can pass any iterable of ints to the bytes constructor, as long as they are in range(256)
Sounds like array.array('B'). Will the bytes object support the buffer interface? Will it accept objects supporting the buffer interface in the constructor (or a class method)? If so, will it be a copy or a view? Current array.array behavior says copy.
- longs or anything with an __index__ method should do, too
- when you index a bytes object, you get a plain int
When slicing a bytes object, do you get another bytes object or a list? If its a bytes object, is it a copy or a view? Current array.array behavior says copy.
- repr(bytes[1,0 20, 30]) == 'bytes([10, 20, 30])'
Somewhat controversial:
- it's probably too big to attempt to rush this into 2.5
- bytes("abc") == bytes(map(ord, "abc"))
- bytes("\x80\xff") == bytes(map(ord, "\x80\xff")) == bytes([128, 256])
It would be VERY controversial if ord('\xff') == 256 ;)
Very controversial:
- bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument
- bytes(u"abc") == bytes("abc") # for ASCII at least
- bytes(u"\x80\xff") raises UnicodeError
- bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
Given a base64 or hex string, how do you get a bytes object out of it? Currently str.decode('base64') and str.decode('hex') are good solutions to this... but you get a str object back. -bob
On 2/14/06, Bob Ippolito <bob@redivi.com> wrote:
On Feb 14, 2006, at 3:13 PM, Guido van Rossum wrote:
- we need a new PEP; PEP 332 won't cut it
- no b"..." literal
- bytes objects are mutable
- bytes objects are composed of ints in range(256)
- you can pass any iterable of ints to the bytes constructor, as long as they are in range(256)
Sounds like array.array('B').
Sure.
Will the bytes object support the buffer interface?
Do you want them to? I suppose they should *not* support the *text* part of that API.
Will it accept objects supporting the buffer interface in the constructor (or a class method)? If so, will it be a copy or a view? Current array.array behavior says copy.
bytes() should always copy -- thanks for asking.
- longs or anything with an __index__ method should do, too
- when you index a bytes object, you get a plain int
When slicing a bytes object, do you get another bytes object or a list? If its a bytes object, is it a copy or a view? Current array.array behavior says copy.
Another bytes object which is a copy. (Why would you even think about views here? They are evil.)
- repr(bytes[1,0 20, 30]) == 'bytes([10, 20, 30])'
Somewhat controversial:
- it's probably too big to attempt to rush this into 2.5
- bytes("abc") == bytes(map(ord, "abc"))
- bytes("\x80\xff") == bytes(map(ord, "\x80\xff")) == bytes([128, 256])
It would be VERY controversial if ord('\xff') == 256 ;)
Oops. :-)
Very controversial:
- bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument
- bytes(u"abc") == bytes("abc") # for ASCII at least
- bytes(u"\x80\xff") raises UnicodeError
- bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
Given a base64 or hex string, how do you get a bytes object out of it? Currently str.decode('base64') and str.decode('hex') are good solutions to this... but you get a str object back.
I don't know -- you can propose an API you like here. base64 is as likely to encode text as binary data, so I don't think it's wrong for those things to return strings. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Feb 14, 2006, at 4:17 PM, Guido van Rossum wrote:
On 2/14/06, Bob Ippolito <bob@redivi.com> wrote:
On Feb 14, 2006, at 3:13 PM, Guido van Rossum wrote:
- we need a new PEP; PEP 332 won't cut it
- no b"..." literal
- bytes objects are mutable
- bytes objects are composed of ints in range(256)
- you can pass any iterable of ints to the bytes constructor, as long as they are in range(256)
Sounds like array.array('B').
Sure.
Will the bytes object support the buffer interface?
Do you want them to?
I suppose they should *not* support the *text* part of that API.
I would imagine that it'd be convenient for integrating with existing extensions... e.g. initializing an array or Numeric array with one.
Will it accept objects supporting the buffer interface in the constructor (or a class method)? If so, will it be a copy or a view? Current array.array behavior says copy.
bytes() should always copy -- thanks for asking.
I only really ask because it's worth fully specifying these things. Copy seems a lot more sensible given the rest of the interpreter and stdlib (e.g. buffer(x) seems to always return a read-only buffer).
- longs or anything with an __index__ method should do, too
- when you index a bytes object, you get a plain int
When slicing a bytes object, do you get another bytes object or a list? If its a bytes object, is it a copy or a view? Current array.array behavior says copy.
Another bytes object which is a copy.
(Why would you even think about views here? They are evil.)
I mention views because that's what numpy/Numeric/numarray/etc. do... It's certainly convenient at times to have that functionality, for example, to work with only the alpha channel in an RGBA image. Probably too magical for the bytes type.
import numpy image = numpy.array(list('RGBARGBARGBA')) alpha = image[3::4] alpha array([A, A, A], dtype=(string,1)) alpha[:] = 'X' image array([R, G, B, X, R, G, B, X, R, G, B, X], dtype=(string,1))
Very controversial:
- bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument
- bytes(u"abc") == bytes("abc") # for ASCII at least
- bytes(u"\x80\xff") raises UnicodeError
- bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
Given a base64 or hex string, how do you get a bytes object out of it? Currently str.decode('base64') and str.decode('hex') are good solutions to this... but you get a str object back.
I don't know -- you can propose an API you like here. base64 is as likely to encode text as binary data, so I don't think it's wrong for those things to return strings.
That's kinda true I guess -- but you'd still need an encoding in py3k to turn base64 -> text. A lot of the current codecs infrastructure doesn't make sense in py3k -- for example, the 'zlib' encoding, which is really a bytes transform, or 'unicode_escape' which is a text transform. I suppose there aren't too many different ways you'd want to encode or decode data to binary (beyond the text codecs), they should probably just live in a module -- something like the binascii we have now. I do find the codecs infrastructure to be convenient at times (maybe too convenient), but since you're not interested in adding functions to existing types then a module seems like the best approach. -bob
Bob Ippolito wrote:
On Feb 14, 2006, at 4:17 PM, Guido van Rossum wrote:
(Why would you even think about views here? They are evil.)
I mention views because that's what numpy/Numeric/numarray/etc. do... It's certainly convenient at times to have that functionality, for example, to work with only the alpha channel in an RGBA image. Probably too magical for the bytes type.
The key difference between numpy arrays and normal sequences is that the length of a sequence can change, but the shape of a numpy array is essentially fixed. So view behaviour can be reserved for a dimensioned array type (if the numpy folks ever find the time to finish writing their PEP. . .) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Bob Ippolito wrote:
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
Given a base64 or hex string, how do you get a bytes object out of it? Currently str.decode('base64') and str.decode('hex') are good solutions to this... but you get a str object back.
If s is a base64 string, bytes(s.decode("base64")) should work. In 2.x, it returns a str, which is then copied into bytes; in 3.x, .decode("base64") returns a byte string already (*), for which an extra copy is made. I would prefer to see base64.decodestring to return bytes, though - perhaps even in 2.x already. Regards, Martin (*) Interestingly enough, the "base64" encoding will work reversed in terms of types, compared to all other encodings. Where .encode returns bytes normally, it will return a string for base64, and vice versa (assuming the bytes type has .decode/.encode methods).
On Tue, Feb 14, 2006 at 03:13:25PM -0800, Guido van Rossum wrote:
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
Unless you only ever create (byte)strings by doing s.encode(), and only send them to code that is either byte/string-agnostic or -aware. Oh, and don't use indexing, only slicing (length-1 if you have to.) I guess it depends on howmuch code will accept a bytes-string where currently a string is the norm (and a unicode object is default-encoded.) I'm still worried that all this is quite a big leap. Very few people understand the intricacies of unicode encodings. (Almost everyone understands unicode, except they don't know it yet; it's the encodings that are the problem.) By forcing everything to be unicode without a uniform encoding-detection scheme, we're forcing every programmer who opens a file or reads from the network to think about encodings. This will be a pretty big step for newbie programmers. And it's not just that. The encoding of network streams or files may be entirely unknown beforehand, and depend on the content: a content-encoding, a <META EQUIV> HTML tag. Will bytes-strings get string methods for easy searching of content descriptors? Will the 're' module accept bytes-strings? What would the literals you want to search for, look like? Do I really do 'if bytes("Content-Type:") in data:' and such? Should data perhaps get read using the opentext() equivalent of 'decode('ascii', 'replace')' and then parsed the 'normal' way? What about data gotten from an extension? And nevermind what the 'right way' for that is; what will *programmers* do? The 'right way' often escapes them. It may well be that I'm thinking too conservatively, too stuck in the old ways, but I think we're being too hasty in dismissing the ol' string. Don't get me wrong, I really like the idea of as much of Python doing unicode as possible, and the idea of a mutable bytes type sounds good to me too. I just don't like the wide gap between the troublesome-to-get unicode object and the unreadable-repr, weird-indexing, hard-to-work-with bytes-string. I don't think adding something inbetween is going to work (we basically have that now, the normal string), so I suggest the bytes-string becomes a bit more 'string' and a bit less 'sequence of bytes'. Perhaps in the form of: - A bytes type that repr()'s to something readable - A way to write byte literals that doesn't bleed the eyes, and isn't so fragile in the face of source-encoding (all the suggestions so far have you explicitly re-stating the source-encoding at each bytes("".encode())) If you have to wonder why that's fragile, just think about a recoding editor. Alternatively, get a short way to say 'encode in source-encoding' (I can't think of anything better than b"..." for the above two... Except... hmm... didn't `` become available in Py3k? Too little visual distinction?) - A way to manipulation the bytes as character-strings. Pattern matching, splitting, finding, slicing, etc. Quite like current strings. - Disallowing any interaction between bytes and real (meaning 'unicode') strings. Not "oh, let's assume ascii or the default encoding", either. If the user wants to explicitly decode using 'ascii', that's their choice, but they should consciously make it. - Mutable or immutable, I don't know. I fear that if the bytes type was easy enough to handle and mutable, and the normal (unicode) strings were immutable, people may end up using bytes all the time. In fact, they may do that anyway; I'm sure Python will grow entire subcults that prefer doing 'string("\xa1Python!")' where 'string' is 'bytes(arg.encode("iso-8859-1"))' Bytes should be easy enough to manipulate 'as strings' to do the basic tasks, but not easy enough to encourage people to forget about that whole annoying 'encoding' business and just use them instead (which is basically what we have now.) On the other hand, if people don't want to deal with that whole encoding business, we should allow them to -- consciously. We can offer a variety of hints and tips on how to figure out the encoding of something, but we can't do the thinking for them (trust me, I've tried.) When a file's encoding is specified in file metadata, that's great, really great. When a network connection is handled by a library that knows how to deal with the content (*cough*Twisted*cough*) and can decode it for you, that's really great too. But we're not there yet, not by a long shot. And explaining encodings to a ADHD-infested teenager high on adrenalin and creative inspiration who just wants to connect to an IRC server to make his bot say "Hi!", well, that's hard. I'd rather they don't go and do PHP instead. Doing it right is hard, but it's even harder to do it all right the first time, and Python never really worried about that ;P -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Thomas Wouters wrote:
The encoding of network streams or files may be entirely unknown beforehand, and depend on the content: a content-encoding, a <META EQUIV> HTML tag. Will bytes-strings get string methods for easy searching of content descriptors?
Seems to me this is a case where you want to be able to change encodings in the middle of reading the stream. You start off reading the data as ascii, and once you've figured out the encoding, you switch to that and carry on reading. Are there any plans to make it possible to change the encoding of a text file object on the fly like this? If that would be awkward, maybe file objects themselves shouldn't be where the decoding occurs, but decoders should be separate objects that wrap byte streams. Under that model, opentext(filename, encoding) would be a factory function that did something like codecs.streamdecoder(encoding, openbinary(filename)) Having codecs be stream filters might be a good idea anyway, since then you could use them to wrap anything that can be treated as a stream of bytes (sockets, some custom object in your program, etc.), you could create pipelines of encoders and decoders, etc. -- Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | Carpe post meridiam! | Christchurch, New Zealand | (I'm not a morning person.) | greg.ewing@canterbury.ac.nz +--------------------------------------+
On Tuesday 14 February 2006 22:34, Greg Ewing wrote:
Seems to me this is a case where you want to be able to change encodings in the middle of reading the stream. You start off reading the data as ascii, and once you've figured out the encoding, you switch to that and carry on reading.
Not quite. The proper response in this case is often to re-start decoding with the correct encoding, since some of the data extracted so far may have been decoded incorrectly. A very carefully constructed application may be able to go back and re-decode any data saved from the stream with the previous encoding, but that seems like it would be pretty fragile in practice. There may be cases where switching encoding on the fly makes sense, but I'm not aware of any actual examples of where that approach would be required. -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>
Fred L. Drake, Jr. wrote:
The proper response in this case is often to re-start decoding with the correct encoding, since some of the data extracted so far may have been decoded incorrectly.
If the protocol has been sensibly designed, that shouldn't happen, since everything up to the coding marker should be ascii (or some other protocol-defined initial coding). For protocols that are not sensibly designed (or if you're just trying to guess) what you suggest may be needed. But it would be good to have a nicer way of going about it for when the protocol is sensible. Greg
On Wednesday 15 February 2006 01:44, Greg Ewing wrote:
If the protocol has been sensibly designed, that shouldn't happen, since everything up to the coding marker should be ascii (or some other protocol-defined initial coding).
Indeed.
For protocols that are not sensibly designed (or if you're just trying to guess) what you suggest may be needed. But it would be good to have a nicer way of going about it for when the protocol is sensible.
I agree in principle, but the example of using an HTML <meta> tag as a source of document encoding information isn't sensible. Unfortunately, it's still part of the HTML specification. :-( I'm not opposing a way to do a sensible thing, but wanted to note that it wasn't going to be right for all cases, with such an example having been mentioned already (though the issues with it had not been fully spelled out). -Fred -- Fred L. Drake, Jr. <fdrake at acm.org>
Greg Ewing wrote:
If the protocol has been sensibly designed, that shouldn't happen, since everything up to the coding marker should be ascii (or some other protocol-defined initial coding).
XML, for one protocol, requires you to restart over. The initial sequence could be UTF-16, or it could be EBCDIC. You read a few bytes (up to four), then know which of these it is. Then you start over, reading further if it looks like an ASCII superset, to find out the real encoding. You normally then start over, although switching at that point could also work.
For protocols that are not sensibly designed (or if you're just trying to guess) what you suggest may be needed. But it would be good to have a nicer way of going about it for when the protocol is sensible.
There might be buffering of decoded strings already, (ie. beyond the point to which you have read), so you would need to unbuffer these, and reinterpret them. To support that, you really need to buffer both the original bytes, and the decoded ones, since the encoding might not roundtrip. Regards, Martin
On 2/14/06, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Fred L. Drake, Jr. wrote:
The proper response in this case is often to re-start decoding with the correct encoding, since some of the data extracted so far may have been decoded incorrectly.
If the protocol has been sensibly designed, that shouldn't happen, since everything up to the coding marker should be ascii (or some other protocol-defined initial coding).
For protocols that are not sensibly designed (or if you're just trying to guess) what you suggest may be needed. But it would be good to have a nicer way of going about it for when the protocol is sensible.
I think that the implementation of encoding-guessing or auto-encoding-upgrade techniques should be left out of the standard library design for now. I know that XML does something like this, but fortunately we employ dedicated C code to parse XML so that particular case should be taken care of without complicating the rest of the standard I/O library. As far as searching bytes objects, that shouldn't be a problem as long as the search 'string' is also specified as a bytes object. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
"Guido" == Guido van Rossum <guido@python.org> writes:
Guido> I think that the implementation of encoding-guessing or Guido> auto-encoding-upgrade techniques should be left out of the Guido> standard library design for now. As far as I can see, little new design is needed. There's no reason why an encoding-guesser couldn't be written as a codec that detects the coding, then dispatches to the appropriate codec. The only real issue I know of is that if you ask such a codec "who are you?", there are two plausible answers: "autoguess" and the codec actually being used to translate the stream. If there's no API to ask for both of those, the API might want generalization. Guido> As far as searching bytes objects, that shouldn't be a Guido> problem as long as the search 'string' is also specified as Guido> a bytes object. You do need to be a little careful in implementation, as (for example) "case insensitive" should be meaningless for searching bytes objects. This would be especially important if searching and collation become more Unicode conformant. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
"Fred" == Fred L Drake, <fdrake@acm.org> writes:
Fred> On Tuesday 14 February 2006 22:34, Greg Ewing wrote: >> Seems to me this is a case where you want to be able to change >> encodings in the middle of reading the stream. You start off >> reading the data as ascii, and once you've figured out the >> encoding, you switch to that and carry on reading. Fred> Not quite. The proper response in this case is often to Fred> re-start decoding with the correct encoding, since some of Fred> the data extracted so far may have been decoded incorrectly. Fred> A very carefully constructed application may be able to go Fred> back and re-decode any data saved from the stream with the Fred> previous encoding, but that seems like it would be pretty Fred> fragile in practice. I believe GNU Emacs is currently doing this. AIUI, they save annotations where the codec is known to be non-invertible (eg, two charset-changing escape sequences in a row). I do think this is fragile, and a robust application really should buffer everything it's not sure of decoding correctly. Fred> There may be cases where switching encoding on the fly makes Fred> sense, but I'm not aware of any actual examples of where Fred> that approach would be required. This is exactly what ISO 2022 formalizes: switching encodings on the fly. mboxes of Japanese mail often contain random and unsignaled encoding changes. A terminal emulator may need to switch when logging in to a remote system. -- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.
Raymond Hettinger wrote:
- bytes("abc") == bytes(map(ord, "abc"))
At first glance, this seems obvious and necessary, so if it's somewhat controversial, then I'm missing something. What's the issue?
There is an "implicit Latin-1" assumption in that code. Suppose you do # -*- coding: koi-8r -*- print bytes("Гвидо ван Россум") in Python 2.x, then this means something (*). In Python 3, it gives you an exception, as the ordinals of this are suddenly above 256. Or, perhaps worse, the code # -*- coding: utf-8 -*- print bytes("Martin v. Löwis") will work in 2.x and 3.x, but produce different numbers (**). Regards, Martin (*) [231, 215, 201, 196, 207, 32, 215, 193, 206, 32, 242, 207, 211, 211, 213, 205] (**) In 2.x, this will give [77, 97, 114, 116, 105, 110, 32, 118, 46, 32, 76, 195, 182, 119, 105, 115] whereas in 3.x, it will give [77, 97, 114, 116, 105, 110, 32, 118, 46, 32, 76, 246, 119, 105, 115]
On 2/14/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Raymond Hettinger wrote:
- bytes("abc") == bytes(map(ord, "abc"))
At first glance, this seems obvious and necessary, so if it's somewhat controversial, then I'm missing something. What's the issue?
There is an "implicit Latin-1" assumption in that code. Suppose you do
# -*- coding: koi-8r -*- print bytes("Гвидо ван Россум")
in Python 2.x, then this means something (*). In Python 3, it gives you an exception, as the ordinals of this are suddenly above 256.
Or, perhaps worse, the code
# -*- coding: utf-8 -*- print bytes("Martin v. Löwis")
will work in 2.x and 3.x, but produce different numbers (**).
My assumption is these would become errors in 3.x. bytes(str) is only needed so you can do bytes(u"abc".encode('utf-8')) and have it work in 2.x and 3.x. (I wonder if maybe they should be an error in 2.x as well. Source encoding is for unicode literals, not str literals.) -- Adam Olsen, aka Rhamphoryncus
Adam Olsen wrote:
My assumption is these would become errors in 3.x. bytes(str) is only needed so you can do bytes(u"abc".encode('utf-8')) and have it work in 2.x and 3.x.
I think the proposal for bytes(seq) to mean bytes(map(ord, seq)) was meant to be valid for both 2.x and 3.x, on the grounds that you should be able to write byte string constants in the same way in all versions.
(I wonder if maybe they should be an error in 2.x as well. Source encoding is for unicode literals, not str literals.)
Source encoding applies to the entire source code, including (byte) string literals, comments, identifiers, and keywords. IOW, if you declare your source encoding is utf-8, the keyword "print" must be represented with the bytes that represent the Unicode letters for "p","r","i","n", and "t" in UTF-8. Regards, Martin
On 2/15/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Adam Olsen wrote:
(I wonder if maybe they should be an error in 2.x as well. Source encoding is for unicode literals, not str literals.)
Source encoding applies to the entire source code, including (byte) string literals, comments, identifiers, and keywords. IOW, if you declare your source encoding is utf-8, the keyword "print" must be represented with the bytes that represent the Unicode letters for "p","r","i","n", and "t" in UTF-8.
Although it does apply to the entire source file, I think this is more for convenience (try telling an editor that only a single line is Shift_JIS!) than to allow 8-bit (or 16-bit?!) str literals. Indeed, you could have arbitrary 8-bit str literals long before the source encoding was added. Keywords and identifiers continue to be limited to ascii characters (even if they make a roundtrip through other encodings), and comments continue to be ignored. Source encoding exists so that you can write u"123" with the encoding stated once at the top of the file, rather than "123".decode('utf-8') with the encoding repeated everywhere. Making it an error to have 8-bit str literals in 2.x would help educate the user that they will change behavior in 3.0 and not be 8-bit str literals anymore. -- Adam Olsen, aka Rhamphoryncus
Adam Olsen wrote:
Making it an error to have 8-bit str literals in 2.x would help educate the user that they will change behavior in 3.0 and not be 8-bit str literals anymore.
You would like to ban string literals from the language? Remember: all string literals are currently 8-bit (byte) strings. Regards, Martin
On 2/15/06, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Adam Olsen wrote:
Making it an error to have 8-bit str literals in 2.x would help educate the user that they will change behavior in 3.0 and not be 8-bit str literals anymore.
You would like to ban string literals from the language? Remember: all string literals are currently 8-bit (byte) strings.
That's a rather literal interpretation of what I said. ;) What I meant was to only accept 7-bit characters, namely ascii. -- Adam Olsen, aka Rhamphoryncus
On Tue, 14 Feb 2006 19:41:07 -0500, "Raymond Hettinger" <python@rcn.com> wrote:
[Guido van Rossum]
Somewhat controversial:
- bytes("abc") == bytes(map(ord, "abc"))
At first glance, this seems obvious and necessary, so if it's somewhat controversial, then I'm missing something. What's the issue?
ord("x") gets the source encoding's ord value of "x", but if that is not unicode or latin-1, it will break when PY 3000 makes "x" unicode. This means until Py 3000 plain str string literals have to use ascii and escapes in order to preserve the meaning when "x" == u"x". But the good news is bytes(map(ord(u"x"))) works fine for any source encoding now or after PY 3000. You just have to type characters into your editor between the quotes that look on the screen like any of the first 256 unicode characters (or use ascii escapes for unshowables). The u"x" translates x into unicode according to the *character* of x, whatever the source encoding, so all you have to do is choose characters of the first 256 unicodes. This happens to be latin-1, but you can ignore that unless you are interested in the actual byte values. If they have byte meaning, escapes are clearer anyway, and they work in a unicode string (where "x".decode(source_encoding) might fail on an illegal character). The solution is to use u"x" for now or use ascii-only with escapes, and just map ord on either kind of string. This should work when u"x" becomes equivalent to "x". The unicode that comes from a current u"x" string defines a *character* sequence. If you use legal latin-1 *characters* in whatever source encoding your editor and coding cookie say, you will get the *characters* you see inside the quotes in the u"..." literal translated to unicode, and the first 256 characters of unicode happen to be the latin-1 set, so map ord just works. With a unicode string you don't have to think about encoding, just use ord/unichr in range(0,256). Hex escapes within unicode strings work as expected, so IMO it's pretty clean. I think I have shown this in a couple of other posts in the orignal thread (where I created and compiled source code in several encodings including utf-8 and comiled with coding cookies and exec'd the result) I could always have overlooked something, but I am hopeful. Regards, Bengt Richter
Guido van Rossum wrote:
I'm about to send 6 or 8 replies to various salient messages in the PEP 332 revival thread. That's probably a sign that there's still a lot to be sorted out. In the mean time, to save you reading through all those responses, here's a summary of where I believe I stand. Let's continue the discussion in this new thread unless there are specific hairs to be split in the other thread that aren't addressed below or by later posts.
I hope bytes objects will be pickle-able? If so, and they support the buffer protocol, then many NumPy users will be very happy. -Travis
On Tue, 14 Feb 2006 15:13:25 -0800, Guido van Rossum <guido@python.org> wrote:
I'm about to send 6 or 8 replies to various salient messages in the PEP 332 revival thread. That's probably a sign that there's still a lot to be sorted out. In the mean time, to save you reading through all those responses, here's a summary of where I believe I stand. Let's continue the discussion in this new thread unless there are specific hairs to be split in the other thread that aren't addressed below or by later posts.
Non-controversial (or almost):
- we need a new PEP; PEP 332 won't cut it
- no b"..." literal
- bytes objects are mutable
- bytes objects are composed of ints in range(256)
- you can pass any iterable of ints to the bytes constructor, as long as they are in range(256)
- longs or anything with an __index__ method should do, too
- when you index a bytes object, you get a plain int
- repr(bytes[1,0 20, 30]) == 'bytes([10, 20, 30])'
Somewhat controversial:
- it's probably too big to attempt to rush this into 2.5
- bytes("abc") == bytes(map(ord, "abc"))
- bytes("\x80\xff") == bytes(map(ord, "\x80\xff")) == bytes([128, 256])
Very controversial:
Given that ord/unichr and ord/chr work as encoding-agnostic function pairs symmetrically mapping between unicode and int or str and int, please consider the effect of this API as illustrated by how it works with the examples:
def bytes(arg, encoding=None): ... if isinstance(arg, str): ... if encoding: b = map(ord, arg.decode(encoding)) ... else: b = map(ord, arg) ... elif isinstance(arg, unicode): ... if encoding: raise ValueError( ... 'Use bytes(%r.encode(%r)) to avoid PY 3000 breakage'%(arg, encoding)) ... b = map(ord, arg) ... else: ... b = map(int, arg) ... if sum(1 for x in b if x<0 or x>255) > 0: ... raise ValueError('byte out of range') ... return 'bytes(%r)'%b ... ...
Then
- bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument (Use encoding, the only requirement is that all the resulting ord values be in range(0,256))
bytes("abc\xf6", 'latin-1') 'bytes([97, 98, 99, 246])' print unichr(246) ö bytes("abc\xf6", 'cp437') 'bytes([97, 98, 99, 247])' print unichr(247) ÷
- bytes(u"abc") == bytes("abc") # for ASCII at least
bytes(u"abc")
'bytes([97, 98, 99])'
- bytes(u"\x80\xff") raises UnicodeError
bytes(u"\x80\xff")
'bytes([128, 255])'
- bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
bytes(u"\x80\xff", "latin-1")
Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 6, in bytes ValueError: Use bytes(u'\x80\xff'.encode('latin-1')) to avoid PY 3000 breakage
bytes(u'\x80\xff'.encode('latin-1')) 'bytes([128, 255])'
(If the characters exist in the encoding specified, it will work, otherwise raises exception. Assumes PY 3000 string encode results in bytes, so it should work there too ;-) of course,
bytes(u'\u1234') Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 12, in bytes ValueError: byte out of range and bytes([1,2]) 'bytes([1, 2])' bytes([1,-1]) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 12, in bytes ValueError: byte out of range bytes([1,256]) Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 12, in bytes ValueError: byte out of range
Interestingly, the internal map int on a sequence permits
bytes(["1", 2, 3L, True, 5.6]) 'bytes([1, 2, 3, 1, 5])'
IOW, any sequence of objects that will convert themselves to int in range(0,256) will do.
Martin von Loewis's alternative for the "very controversial" set is to disallow an encoding argument and (I believe) also to disallow Unicode arguments. In 3.0 this would leave us with s.encode(<encoding>) as the only way to convert a string (which is always unicode) to bytes. The problem with this is that there's no code that works in both 2.x and 3.0.
I hope Martin will reconsider, considering ord/unichr as a symmetric pair of functions mapping 1:1 to unicode (and ignoring the fact that this also happens to be the latin-1 mapping ;-) A test class should be easy, except deciding on appropriate methods and how the type should be defined. It's the same peculiar problem as str, i.e., length one would be compatible with int, but not other lengths. How do we do that? Regards, Bengt Richter
Guido van Rossum wrote:
- it's probably too big to attempt to rush this into 2.5
After reading some of the discussion, and seen some of the arguments, I'm beginning to feel that we need working code to get this right. It would be nice if we could get a bytes() type into the first alpha, so the design can get some real-world exposure in real-world apps/libs be- fore 2.5 final. </F>
On Wed, Feb 15, 2006 at 11:28:59PM +0100, Fredrik Lundh wrote:
After reading some of the discussion, and seen some of the arguments, I'm beginning to feel that we need working code to get this right.
It would be nice if we could get a bytes() type into the first alpha, so the design can get some real-world exposure in real-world apps/libs be- fore 2.5 final.
I agree that working code would be nice, but I don't see why it should be in an alpha release. IMHO it shouldn't be in an alpha release until it at least looks good enough for the developers, and good enough to put in a PEP. -- Thomas Wouters <thomas@xs4all.net> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Thomas Wouters wrote:
After reading some of the discussion, and seen some of the arguments, I'm beginning to feel that we need working code to get this right.
It would be nice if we could get a bytes() type into the first alpha, so the design can get some real-world exposure in real-world apps/libs be- fore 2.5 final.
I agree that working code would be nice, but I don't see why it should be in an alpha release. IMHO it shouldn't be in an alpha release until it at least looks good enough for the developers, and good enough to put in a PEP.
I'm not convinced that the PEP will be good enough without experience from using a bytes type in *real-world* (i.e. *existing*) byte-crunching applications. if we put it in an early alpha, we can use it with real code, fix any issues that arises, and even remove it if necessary, before 2.5 final. if it goes in late, we'll be stuck with whatever the PEP says. </F>
I'm actually assuming to put this off until 2.6 anyway. On 2/15/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
Thomas Wouters wrote:
After reading some of the discussion, and seen some of the arguments, I'm beginning to feel that we need working code to get this right.
It would be nice if we could get a bytes() type into the first alpha, so the design can get some real-world exposure in real-world apps/libs be- fore 2.5 final.
I agree that working code would be nice, but I don't see why it should be in an alpha release. IMHO it shouldn't be in an alpha release until it at least looks good enough for the developers, and good enough to put in a PEP.
I'm not convinced that the PEP will be good enough without experience from using a bytes type in *real-world* (i.e. *existing*) byte-crunching applications.
if we put it in an early alpha, we can use it with real code, fix any issues that arises, and even remove it if necessary, before 2.5 final. if it goes in late, we'll be stuck with whatever the PEP says.
</F>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, Feb 16, 2006 at 06:13:53PM +0100, Fredrik Lundh wrote:
Barry Warsaw wrote:
We know at least there will never be a 2.10, so I think we still have time.
because there's no way to count to 10 if you only have one digit?
we used to think that back when the gas price was just below 10 SEK/L, but they found a way...
Of course they found a way. The alternative was cutting taxes. whish-I-was-winking, -Jack
Fredrik Lundh wrote:
Barry Warsaw wrote:
We know at least there will never be a 2.10, so I think we still have time.
because there's no way to count to 10 if you only have one digit?
we used to think that back when the gas price was just below 10 SEK/L, but they found a way...
IIRC Guido is on record as saying "There will be no Python 2.10 because I hate the ambiguity of double-digit minor release numbers", or words to that effect. regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC www.holdenweb.com PyCon TX 2006 www.python.org/pycon/
On Fri, 17 Feb 2006 00:43:50 -0500, Steve Holden <steve@holdenweb.com> wrote:
Fredrik Lundh wrote:
Barry Warsaw wrote:
We know at least there will never be a 2.10, so I think we still have time.
because there's no way to count to 10 if you only have one digit?
we used to think that back when the gas price was just below 10 SEK/L, but they found a way...
IIRC Guido is on record as saying "There will be no Python 2.10 because I hate the ambiguity of double-digit minor release numbers", or words to that effect.
Hex? Regards, Bengt Richter
Bengt Richter wrote:
because there's no way to count to 10 if you only have one digit?
we used to think that back when the gas price was just below 10 SEK/L, but they found a way...
IIRC Guido is on record as saying "There will be no Python 2.10 because I hate the ambiguity of double-digit minor release numbers", or words to that effect.
Hex?
or roman numbers. I've payed X.35 SEK/L for gas... </F>
On Fri, 2006-02-17 at 00:43 -0500, Steve Holden wrote:
Fredrik Lundh wrote:
Barry Warsaw wrote:
We know at least there will never be a 2.10, so I think we still have time.
because there's no way to count to 10 if you only have one digit?
we used to think that back when the gas price was just below 10 SEK/L, but they found a way...
IIRC Guido is on record as saying "There will be no Python 2.10 because I hate the ambiguity of double-digit minor release numbers", or words to that effect.
I heard the same quote, so that's what I was referring to! -Barry
On Wed, 15 Feb 2006 15:20:16 -0800, Guido van Rossum <guido@python.org> wrote:
I'm actually assuming to put this off until 2.6 anyway.
On 2/15/06, Fredrik Lundh <fredrik@pythonware.com> wrote:
Thomas Wouters wrote:
After reading some of the discussion, and seen some of the arguments, I'm beginning to feel that we need working code to get this right.
It would be nice if we could get a bytes() type into the first alpha, so the design can get some real-world exposure in real-world apps/libs be- fore 2.5 final.
I agree that working code would be nice, but I don't see why it should be in an alpha release. IMHO it shouldn't be in an alpha release until it at least looks good enough for the developers, and good enough to put in a PEP.
I'm not convinced that the PEP will be good enough without experience from using a bytes type in *real-world* (i.e. *existing*) byte-crunching applications.
if we put it in an early alpha, we can use it with real code, fix any issues that arises, and even remove it if necessary, before 2.5 final. if it goes in late, we'll be stuck with whatever the PEP says.
</F>
I could hardly keep up with reading, never mind trying some things and writing coherently, so if others had that experience, 2.6 sounds +1. I agree with Fredrik that an implementation to try in real-world use cases would probably yield valuable information. As a step in that direction, could we have a sub-thread on what methods to implement for bytes? I.e., which str methods make sense, which special methods? How many methods from list make sense, given that bytes will be mutable? How much of array.array('B') should be emulated? (a protype hack could just wrap array.array for storage). Should the type really be a subclass of int? I think that might be hard for prototyping, since builtin types as bases seem to get priority subclass bypass access from some builtin functions. At least I've had some frustrations with that. If it were a kind of int, would it be an int-string, where int(bytes([65])) would work like ord does with non-length-1? BTW bytes([1,2])[1] by analogy to str should then return bytes([2]), shouldn't it? I have a feeling a lot of str-like methods will bomb if that's not so.
int(bytes([1,2])) # faked ;-) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: int() expected a byte, but bytes of length 2 found
I've hacked a few pieces, but I think further discussion either in this thread or maybe a bytes prototype spec thread would be fruitful. By the time a prototype spec takes shape, someone will probably have beaten me to something workable, but that's ok ;-) Then a PEP will mostly be writing and collecting rationale references etc. That's really not my favorite kind of work, frankly. But I like thinking and programming. Regards, Bengt Richter
participants (16)
-
"Martin v. Löwis"
-
Adam Olsen
-
Barry Warsaw
-
Bob Ippolito
-
bokr@oz.net
-
Fred L. Drake, Jr.
-
Fredrik Lundh
-
Greg Ewing
-
Guido van Rossum
-
Jack Diederich
-
Nick Coghlan
-
Raymond Hettinger
-
Stephen J. Turnbull
-
Steve Holden
-
Thomas Wouters
-
Travis E. Oliphant