RE: [Python-Dev] Unifying Long Integers and Integers: baseint
Is there a plan for implementing a base class for int and long (like basestring for str and unicode):
issubclass(int, baseint) and issubclass(long, baseint) True
?
I think this would be a good idea; maybe the name should be baseinteger?
I would like to urge caution before making this change. Despite what the PEP may say, I actually think that creating a 'baseint' type is the WRONG design choice for the long term. I envision an eventual Python which has just one type, called 'int'. The fact that an efficient implementation is used when the ints are small and an arbitrary-precision version when they get too big would be hidden from the user by automatic promotion of overflow. (By "hidden" I mean the user doesn't need to care, not that they can't find out if they want to.) We are almost there already, but if people start coding to 'baseinteger' it takes us down a different path entirely. 'basestring' is a completely different issue -- there will always be a need for both unicode and 8-bit-strings as separate types. -- Michael Chermside
>> I think this would be a good idea; maybe the name should be >> baseinteger? Michael> I would like to urge caution before making this change. Despite Michael> what the PEP may say, I actually think that creating a Michael> 'baseint' type is the WRONG design choice for the long term. I Michael> envision an eventual Python which has just one type, called Michael> 'int'. I agree. I made a suggestion that we consider the entire tree of numeric types, but I had int/long unification in the back of my mind as well. I will take /F's suggestion and poke around the peps when I have some time, but I see no pressing reason a base integer class, however it's spelled, needs to be added for 2.4. Skip
Skip Montanaro wrote:
>> I think this would be a good idea; maybe the name should be >> baseinteger?
Michael> I would like to urge caution before making this change. Despite Michael> what the PEP may say, I actually think that creating a Michael> 'baseint' type is the WRONG design choice for the long term. I Michael> envision an eventual Python which has just one type, called Michael> 'int'.
I agree. I made a suggestion that we consider the entire tree of numeric types,
Is it a good time to consider the entrie tree of ALL types then? For example: object (Tree of numeric types as suggested by Gareth) number complex real (Decimal) float rational fraction integer int bool long sequence buffer basestring str unicode list tuple mapping dict set frozenset file
but I had int/long unification in the back of my mind as well. I will take /F's suggestion and poke around the peps when I have some time, but I see no pressing reason a base integer class, however it's spelled, needs to be added for 2.4.
-- Dmitry Vasiliev (dima at hlabs.spb.ru) http://hlabs.spb.ru
Dmitry Vasiliev wrote:
Is it a good time to consider the entrie tree of ALL types then?
Please, no. One of the best things about Python is that, for most purposes, you can get the appropriate reaction from the standard library simply by providing the relevant magic methods (e.g. __iter__ if you want to be treated like a sequence). Normal class inheritance means you are inheriting the *whole* interface, as well as the implementation of any parts you don't override. So 'sequence' for instance, couldn't have any methods or do much of anything, without causing potentially undesirable behaviour. About the only exception that makes any sense to me is for exceptions, where using inheritance for classification seems to be universal practice in all languages with exceptions (since 'catch X' generally means catch X, or any of its subclasses). (Bad pun noted, and not avoided). Localised type hierarchies where it is useful, OK. Inheriting from object as a way to say 'new-style class please', OK. But please, please, please, lets keep interface introspection the dominant way of determining what can be done with a particular object. Creating a 'grand hierarchy of everything' runs directly counter to that goal. Regards, Nick. -- Nick Coghlan | Eugene, Oregon Email: ncoghlan@email.com | USA
Michael Chermside
'basestring' is a completely different issue -- there will always be a need for both unicode and 8-bit-strings as separate types.
I'm not so sure about that. There will certainly be a need for something holding an arbitrary sequence of bytes, but it's not clear that it needs to be called anything with 'string' in it. I can enivisage a future in which 'string' means unicode string, and byte sequences are called something else entirely, such as 'bytevector' or 'bytearray' (or maybe just 'array' :-). Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg> I can enivisage a future in which 'string' means unicode string, Greg> and byte sequences are called something else entirely, such as Greg> 'bytevector' or 'bytearray' (or maybe just 'array' :-). Or just 'bytes'? Skip
Greg> I can enivisage a future in which 'string' means unicode string, Greg> and byte sequences are called something else entirely, such as Greg> 'bytevector' or 'bytearray' (or maybe just 'array' :-). After another couple minutes of thought: 1. Maybe in 2.5 introduce a "bytes" builtin as a synonym for "str" and recommend its use when the intent is an arbitrary sequence of bytes. Add 'b' as a string literal prefix to generate a bytes object (which will really just be a string). At the same time, start raising exceptions when non-ascii string literals are used without a coding cookie. (Warnings are already issued for that.) 2. In 2.6, make str a synonym for unicode, remove basestring (or also make it a synonym for unicode), make bytes its own distinct (mutable? immutable?) type, and have b"..." literals generate bytes objects. Then again, maybe this can't happen until Py3K. Before then we could do the following though: 1. Make bytes a synonuym for str. 2. Warn about the use of bytes as a variable name. 3. Introduce b"..." literals as a synonym for current string literals, and have them *not* generate warnings if non-ascii characters were used in them without a coding cookie. PEP time? Skip
1. Make bytes a synonuym for str.
Hmm... I worry that a simple alias would just encourage confused usage, since the compiler won't check. I'd rather see bytes an alias for a bytes array as defined by the array module.
2. Warn about the use of bytes as a variable name.
Is this really needed? Builtins don't byte variable names.
3. Introduce b"..." literals as a synonym for current string literals, and have them *not* generate warnings if non-ascii characters were used in them without a coding cookie.
I expecet all sorts of problems with that, such as what it would mean if Unicode or multibyte characters are used in the source. Do we really need byte array literals at all? I don't expect there to be much of a demand. Rather, byte arrays would eventually be returned by the read() method when a file is opened in binary mode. (Isn't this roughly how Java does this?) We could start doing this relatively soon if we used a new mode character ("B" anyone?). --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Do we really need byte array literals at all? I don't expect there to be much of a demand.
I'm not so sure. For example, in httplib, in the line h.putrequest('GET', selector) I would claim that 'GET' is a byte string, not a character string: it is sent as-is onto the wire, which is a byte-oriented wire. Now, it would also "work" if it were a character string, which then gets converted to a byte string using the system default encoding - but I believe that an application that relies on the system default encoding is somewhat broken: Explicit is better than implicit.
Rather, byte arrays would eventually be returned by the read() method when a file is opened in binary mode. (Isn't this roughly how Java does this?)
Java also supports byte arrays in the source, although they are difficult to type: byte[] request = {'G', 'E', 'T'}; As for reading from streams: Java has multiple reader APIs; some return byte strings, some character strings. Regards, Martin
Martin:
Now, it would also "work" if it were a character string, which then gets converted to a byte string using the system default encoding
Or the encoding associated with the file object, which could be set to ascii, either by default or explicitly. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Guido van Rossum wrote:
Do we really need byte array literals at all? I don't expect there to be much of a demand.
I'm not so sure. For example, in httplib, in the line
h.putrequest('GET', selector)
I would claim that 'GET' is a byte string, not a character string: it is sent as-is onto the wire, which is a byte-oriented wire.
Now, it would also "work" if it were a character string, which then gets converted to a byte string using the system default encoding - but I believe that an application that relies on the system default encoding is somewhat broken: Explicit is better than implicit.
Alternatively, we could postulate that the stream to which the string is written determines the encoding. This would still be explicit. Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh! --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido:
Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh!
Perhaps the constructor for a byte array could accept a string argument as long as it contained only ascii characters? h.putrequest(bytes('GET'), selector) Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing wrote:
Perhaps the constructor for a byte array could accept a string argument as long as it contained only ascii characters?
h.putrequest(bytes('GET'), selector)
That is probably most reasonable. It would be harmless to extend it to Latin-1, which would allow to represent all byte values in a string literal - even using the appropriate \x escapes. Regards, Martin
Guido:
Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh!
Greg:
Perhaps the constructor for a byte array could accept a string argument as long as it contained only ascii characters?
h.putrequest(bytes('GET'), selector)
Yeah, but that's what Martin called depending on the default encoding. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido:
Perhaps the constructor for a byte array could accept a string argument as long as it contained only ascii characters?
h.putrequest(bytes('GET'), selector)
Yeah, but that's what Martin called depending on the default encoding.
I don't see anything wrong with that. It would be a fixed default, defined by the language, not something site-dependent that could shift under you. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Guido van Rossum wrote:
Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh!
I see. How would you like byte array displays then? This is the approach taken in the other languages: Everytime the array display is executed, a new array is created. There is then no problem with that being mutable. Of course, if the syntax is too similar to string literals, people might be tricked into believing they are actually literals. Perhaps bytes('G','E','T') would be sufficient, or even bytes("GET") which would implicitly convert each character to Latin-1. Regards, Martin
Martin> bytes("GET") Martin> which would implicitly convert each character to Latin-1. That's why I think a special literal is necessary. There'd be no unicode foolishness involved. ;-) They'd just be raw uninterpreted bytes. Skip
Skip Montanaro wrote:
That's why I think a special literal is necessary. There'd be no unicode foolishness involved. ;-) They'd just be raw uninterpreted bytes.
But you'd spell them b"GET", no? If so, which numeric value has "G"? Regards, Martin
>> That's why I think a special literal is necessary. There'd be no >> unicode foolishness involved. ;-) They'd just be raw uninterpreted >> bytes. Martin> But you'd spell them b"GET", no? If so, which numeric value has Martin> "G"? Good point...
Skip Montanaro wrote:
>> That's why I think a special literal is necessary. There'd be no >> unicode foolishness involved. ;-) They'd just be raw uninterpreted >> bytes.
Martin> But you'd spell them b"GET", no? If so, which numeric value has Martin> "G"?
Good point...
I don't think I understand the example... What's binary about 'GET' ? Why would you want to put non-ASCII into a binary literal definition ? If we switch the binding of 'yyy' to mean unicode('yyy') some day, why can't we just continue to use the existing implementation for 8-bit strings for b'xxx' (the current implementation is already doing the right thing, meaning that it is 8-bit safe regardeless of the source code encoding) ? Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 13 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Fri, Aug 13, 2004, M.-A. Lemburg wrote:
What's binary about 'GET' ?
It's an ASCII, human-readable representation of a set of bytes sent over a network interface to command a server. It could just as easily have been "\010\011\012", but it was selected for the convenience of English-speaking humans. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "To me vi is Zen. To use vi is to practice zen. Every command is a koan. Profound to the user, unintelligible to the uninitiated. You discover truth everytime you use it." --reddy@lion.austin.ibm.com
M.-A. Lemburg wrote:
If we switch the binding of 'yyy' to mean unicode('yyy') some day, why can't we just continue to use the existing implementation for 8-bit strings for b'xxx' (the current implementation is already doing the right thing, meaning that it is 8-bit safe regardeless of the source code encoding) ?
Not exactly - the current implementation is not safe with respect to re-encoding source in a different encoding. Regards, Martin
>> ... 8-bit strings for b'xxx' (the current implementation is already >> doing the right thing, meaning that it is 8-bit safe regardeless of >> the source code encoding) ? Martin> Not exactly - the current implementation is not safe with Martin> respect to re-encoding source in a different encoding. Can't such re-encoding tools (whatever they are) be intelligent about b"..."? If they are Python-aware that seems fairly trivial. If not, the job is more difficult. Skip
Skip Montanaro wrote:
Can't such re-encoding tools (whatever they are) be intelligent about b"..."? If they are Python-aware that seems fairly trivial. If not, the job is more difficult.
What precisely should they do, though? A byte sequence in one encoding might be invalid in another. Regards, Martin
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
If we switch the binding of 'yyy' to mean unicode('yyy') some day, why can't we just continue to use the existing implementation for 8-bit strings for b'xxx' (the current implementation is already doing the right thing, meaning that it is 8-bit safe regardeless of the source code encoding) ?
Not exactly - the current implementation is not safe with respect to re-encoding source in a different encoding.
It is if you stick to writing your binary data using an ASCII compatible encoding -- I wouldn't expect any other encoding for binary data anyway. The most common are ASCII + escape sequences, base64 or hex, all of which are ASCII compatible. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 16 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
It is if you stick to writing your binary data using an ASCII compatible encoding -- I wouldn't expect any other encoding for binary data anyway. The most common are ASCII + escape sequences, base64 or hex, all of which are ASCII compatible.
We probably have a different notion of "ASCII compatible" then. I would define it as: An encoding E is "ASCII compatbible" if strings that only consist of ASCII characters use the same byte representation in E that they use in ASCII. In that sense, ISO-8859-1 and UTF-8 are also ASCII compatible. Notice that this is also the definition that PEP 263 assumes. However, byte strings used in source code are not "safe" if they are encoded in ISO-8859-1 under recoding: If the source code is converted to UTF-8 (including the encoding declaration), then the length of the strings changes, as do the byte values inside the string. Regards, Martin
However, byte strings used in source code are not "safe" if they are encoded in ISO-8859-1 under recoding: If the source code is converted to UTF-8 (including the encoding declaration), then the length of the strings changes, as do the byte values inside the string.
This suggests that byte string literals should be restricted to ASCII characters and \x escapes. Would that be safe enough? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
This suggests that byte string literals should be restricted to ASCII characters and \x escapes. Would that be safe enough?
Make that printable ASCII and I'm +0. (I'm still not sold on the concept of bytes literals at all.) --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
(I'm still not sold on the concept of bytes literals at all.)
Ok. Here's a case - in shtoom, I generate audio data. Lots
of audio data. This is broken into packets, then gets a small
header put onto each RTP packet. Right now, I'm using strings
for this. If there was a 'byte literal', I'd use it. This isn't
a huge problem right now, because strings are good enough. But
if we end up in an 'all the strings are unicode', I'll need
_some_ way to construct these packets.
--
Anthony Baxter
Anthony Baxter wrote:
Ok. Here's a case - in shtoom, I generate audio data. Lots of audio data. This is broken into packets, then gets a small header put onto each RTP packet. Right now, I'm using strings for this. If there was a 'byte literal', I'd use it. This isn't a huge problem right now, because strings are good enough. But if we end up in an 'all the strings are unicode', I'll need _some_ way to construct these packets.
Maybe you are missing the point here, maybe not: there is no debate that Python should always have a byte string type (although there is debate on whether that type should be mutable). The current question is whether you want to denote objects of the byte string type *in source code*. I.e. do you have the "Lots of audio data" stored in .py files? Regards, Martin
Martin v. Löwis wrote:
The current question is whether you want to denote objects of the byte string type *in source code*. I.e. do you have the "Lots of audio data" stored in .py files?
Generally, no - with the exception of test cases. In that case, I often end up with byte literals in the source code. (To check that a particular en/de coding operation Did The Right Thing). Anthony
Guido van Rossum wrote:
(I'm still not sold on the concept of bytes literals at all.)
[Anthony]
Ok. Here's a case - in shtoom, I generate audio data. Lots of audio data. This is broken into packets, then gets a small header put onto each RTP packet. Right now, I'm using strings for this. If there was a 'byte literal', I'd use it. This isn't a huge problem right now, because strings are good enough. But if we end up in an 'all the strings are unicode', I'll need _some_ way to construct these packets.
I see that as a huge case for a bytes type, which I've proposed myself; but what's the use case for bytes literals, assuming you can write bytes("foo")? Does b"foo" really make much of a difference? Is it so hard to have to write bytes([0x66, 0x6f, 0x6f]) instead of b"\x66\x6f\x6f"? IOW, how many *literal* packet fragments are in shtoom? --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
I see that as a huge case for a bytes type, which I've proposed myself; but what's the use case for bytes literals, assuming you can write bytes("foo")? Does b"foo" really make much of a difference? Is it so hard to have to write bytes([0x66, 0x6f, 0x6f]) instead of b"\x66\x6f\x6f"?
It's a pretty marginal case for it. I just played with it a bit, and I think after playing with it, I actually prefer the non b'' case. A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
On Tuesday 2004-08-17 12:13, Anthony Baxter wrote:
Guido van Rossum wrote:
I see that as a huge case for a bytes type, which I've proposed myself; but what's the use case for bytes literals, assuming you can write bytes("foo")? Does b"foo" really make much of a difference? Is it so hard to have to write bytes([0x66, 0x6f, 0x6f]) instead of b"\x66\x6f\x6f"?
It's a pretty marginal case for it. I just played with it a bit, and I think after playing with it, I actually prefer the non b'' case.
Another option, with pros and cons of its own: something along the lines of b"666f6f". -- g
Anthony Baxter
Guido van Rossum wrote:
I see that as a huge case for a bytes type, which I've proposed myself; but what's the use case for bytes literals, assuming you can write bytes("foo")? Does b"foo" really make much of a difference? Is it so hard to have to write bytes([0x66, 0x6f, 0x6f]) instead of b"\x66\x6f\x6f"?
It's a pretty marginal case for it. I just played with it a bit, and I think after playing with it, I actually prefer the non b'' case.
Is this getting to (hopefully uncontroverisal!) PEP time? Is there any consensus forming on whether bytes() instances are mutable or not?
A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
If bytes() doesn't implement the read buffer interface, someone somewhere is going to need shooting :-) Cheers, mwh -- <Yosomono> rasterman is the millionth monkey -- from Twisted.Quotes
Michael> Is this getting to (hopefully uncontroverisal!) PEP time? One would hope. I sent in a skeletal PEP last week asking for a number but haven't heard back. Michael> Is there any consensus forming on whether bytes() instances are Michael> mutable or not? ISTR Guido thought mutable was the way to go. I don't think that efficiency concerns will be a huge deal since it won't be used all over the place the way strings are. Will they need to be used as dict keys? Doesn't seem likely to me. Skip
ISTR Guido thought mutable was the way to go. I don't think that efficiency concerns will be a huge deal since it won't be used all over the place the way strings are. Will they need to be used as dict keys? Doesn't seem likely to me.
Mutable is far far more useful. I can't see them being used commonly
for dict keys, but I do know of a lot of protocols where you have
to go back and patch a 'length' field in a packet header after you've
finished the packet construction.
Maybe we could use the @ symbol for the byte literal? <wink>
Anthony
--
Anthony Baxter
Skip Montanaro
Michael> Is this getting to (hopefully uncontroverisal!) PEP time?
One would hope. I sent in a skeletal PEP last week asking for a number but haven't heard back.
Goody.
Michael> Is there any consensus forming on whether bytes() instances are Michael> mutable or not?
ISTR Guido thought mutable was the way to go.
OK.
I don't think that efficiency concerns will be a huge deal since it won't be used all over the place the way strings are.
Who cares? :-)
Will they need to be used as dict keys? Doesn't seem likely to me.
This is indeed the question. I agree it's unlikely, and you can presumably do something like convert a bytes() instance into a tuple of small integers if you really want to key into a dict. Cheers, mwh -- Clue: You've got the appropriate amount of hostility for the Monastery, however you are metaphorically getting out of the safari jeep and kicking the lions. -- coonec -- http://home.xnet.com/~raven/Sysadmin/ASR.Quotes.html
This is indeed the question. I agree it's unlikely, and you can presumably do something like convert a bytes() instance into a tuple of small integers if you really want to key into a dict.
Or even a Unicode string using the Latin-1 encoding. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
Is this getting to (hopefully uncontroverisal!) PEP time?
Sure.
Is there any consensus forming on whether bytes() instances are mutable or not?
Mutable! --Guido van Rossum (home page: http://www.python.org/~guido/)
At 08:31 AM 8/17/04 -0700, Guido van Rossum wrote:
Is this getting to (hopefully uncontroverisal!) PEP time?
Sure.
Is there any consensus forming on whether bytes() instances are mutable or not?
Mutable!
So, how will it be different from: from array import array def bytes(*initializer): return array('B',*initializer) Even if it's desirable for 'bytes' to be an actual type (e.g. subclassing ArrayType), it might help the definition process to describe the difference between the new type and a byte array.
So, how will it be different from:
from array import array
def bytes(*initializer): return array('B',*initializer)
Even if it's desirable for 'bytes' to be an actual type (e.g. subclassing ArrayType), it might help the definition process to describe the difference between the new type and a byte array.
Not a whole lot different, except for the ability to use a string as alternate argument to the constructor, and the fact that it's going to be an actual type, and that it should support the buffer API (which array mysteriously doesn't?). The string argument support may not even be necessary -- an alternative way to spell that would be to let s.decode() return a bytes object, which has the advantage of being explicit about the encoding; there's even a base64 encoding already! But it would be a bigger incompatibility, more likely to break existing code using decode() and expecting to get a string. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Aug 17, 2004, at 5:33 PM, Guido van Rossum wrote:
So, how will it be different from:
from array import array
def bytes(*initializer): return array('B',*initializer)
Even if it's desirable for 'bytes' to be an actual type (e.g. subclassing ArrayType), it might help the definition process to describe the difference between the new type and a byte array.
Not a whole lot different, except for the ability to use a string as alternate argument to the constructor, and the fact that it's going to be an actual type, and that it should support the buffer API (which array mysteriously doesn't?).
The string argument support may not even be necessary -- an alternative way to spell that would be to let s.decode() return a bytes object, which has the advantage of being explicit about the encoding; there's even a base64 encoding already! But it would be a bigger incompatibility, more likely to break existing code using decode() and expecting to get a string.
IMHO current uses of decode and encode are really confusing. Many decodes are from str -> unicode, and many encodes are from unicode -> str (or str -> unicode -> str implicitly, which is usually going to fail miserably)... while yet others like zlib, base64, etc. are str <-> str. Technically unicode.decode(base64) should certainly work, but it doesn't because unicode doesn't have a decode method. I don't have a proposed solution at the moment, but perhaps these operations should either be outside of the data types altogether (i.e. use codecs only) or there should be separate methods for doing separate things (character translations versus data->data transformations). -bob
Bob Ippolito wrote:
On Aug 17, 2004, at 5:33 PM, Guido van Rossum wrote:
So, how will it be different from:
from array import array
def bytes(*initializer): return array('B',*initializer)
Even if it's desirable for 'bytes' to be an actual type (e.g. subclassing ArrayType), it might help the definition process to describe the difference between the new type and a byte array.
Not a whole lot different, except for the ability to use a string as alternate argument to the constructor, and the fact that it's going to be an actual type, and that it should support the buffer API (which array mysteriously doesn't?).
The string argument support may not even be necessary -- an alternative way to spell that would be to let s.decode() return a bytes object, which has the advantage of being explicit about the encoding; there's even a base64 encoding already! But it would be a bigger incompatibility, more likely to break existing code using decode() and expecting to get a string.
IMHO current uses of decode and encode are really confusing. Many decodes are from str -> unicode, and many encodes are from unicode -> str (or str -> unicode -> str implicitly, which is usually going to fail miserably)... while yet others like zlib, base64, etc. are str <-> str. Technically unicode.decode(base64) should certainly work, but it doesn't because unicode doesn't have a decode method.
They do in 2.4. Note that in 2.4 .decode() and .encode() guarantee that you get a basestring instance. If you want more flexibility in terms of return type, the new codecs.encode() and codecs.decode() will allow arbitrary types as return value.
I don't have a proposed solution at the moment, but perhaps these operations should either be outside of the data types altogether (i.e. use codecs only) or there should be separate methods for doing separate things (character translations versus data->data transformations).
It all depends on whether you are discussing placing binary data into the Python source file (by some means of using literals) or just working with bytes you got from a file, generator, socket, etc. The current discussion is mixing these contexts a bit too much, I believe, which is probably why people keep misunderstanding each other (at least that's how I perceive the debate). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 17 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Guido> The string argument support may not even be necessary -- an Guido> alternative way to spell that would be to let s.decode() return a Guido> bytes object, which has the advantage of being explicit about the Guido> encoding; there's even a base64 encoding already! I'm sorry folks, but I still don't understand all this discussion overlap between unicode/string objects (which require explicit or implicit decoding) and bytes objects (which clearly must not). Everyone keeps talking about decoding stuff into bytes objects and whether or not bytes literals would be compatible with the current source encoding. My understanding is that bytes objects are just that, raw sequences of bytes in the range 0x00 to 0xff, inclusive, with no interpretation of any type. Skip
Skip Montanaro wrote:
My understanding is that bytes objects are just that, raw sequences of bytes in the range 0x00 to 0xff, inclusive, with no interpretation of any type.
Yes, but your understanding is limited :-) This idea is good, but it falls short once we talk about source code, because source code does have an encoding. So if you don't want to incorporate the notion of encodings into the byte string types, yet be able to declare them in source code, you have to go for a numeric representation. I.e. you write bytes(71,69, 84) instead of b"GET" As soon as you use some kind of string notation for denoting byte code values, you immediately *have* to deal with encodings. Regards, Martin
Martin v. Löwis wrote:
Skip Montanaro wrote:
My understanding is that bytes objects are just that, raw sequences of bytes in the range 0x00 to 0xff, inclusive, with no interpretation of any type.
Yes, but your understanding is limited :-) This idea is good, but it falls short once we talk about source code, because source code does have an encoding. So if you don't want to incorporate the notion of encodings into the byte string types, yet be able to declare them in source code, you have to go for a numeric representation. I.e. you write bytes(71,69, 84) instead of b"GET"
As soon as you use some kind of string notation for denoting byte code values, you immediately *have* to deal with encodings.
Of course you do, but aren't you making things too complicated, Martin ? If you write your string literal using just ASCII characters and escapes, I don't see much of a problem with different source code encodings. If it makes you feel better, we could even enforce this by only allowing these characters in binary string literals. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
If you write your string literal using just ASCII characters and escapes, I don't see much of a problem with different source code encodings.
That is correct. I personally have no problem if byte fields and unicode strings are connected through some encoding; I personally think making it fixed at Latin-1 might be best. I was merely responding to Skip's question why an encoding comes into play at all, as byte fields inherently have no encoding, and might not even represent character data. I was responding that this is mostly true, except for source code. Regards, Martin
>> > Is there any consensus forming on whether bytes() instances are >> > mutable or not? >> >> Mutable! Phillip> So, how will it be different from: Phillip> from array import array Phillip> def bytes(*initializer): Phillip> return array('B',*initializer) That might well be a decent trial implementation, though it seems that if we allow strings as initializers we should also allow strings in assignment: >>> b = bytes("abc\xff") >>> b array('B', [97, 98, 99, 255]) >>> print buffer(b) abcÿ >>> b[3] = '\xfe' Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: an integer is required >>> b += "\xfe" Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: can only extend array with array (not "str") I must admit I'm having a bit of trouble getting past this point with a more traditional subclass (can array objects not be subclassed?) in part I think because I don't understand new-style classes very well. In particular, I couldn't find a description of __new__, and once I fumbled past that, I didn't seem to be able to override append() or extend(). Skip
I must admit I'm having a bit of trouble getting past this point with a more traditional subclass (can array objects not be subclassed?) in part I think because I don't understand new-style classes very well. In particular, I couldn't find a description of __new__, and once I fumbled past that, I didn't seem to be able to override append() or extend().
Alas, the array type doesn't seem completely fit for subclassing; I tried similar things and couldn't gt it to work! Even more mysterious is that the array implementation appears to support the buffer API and yet it can't be used as an argument to write(). What's going on? -- --Guido van Rossum (home page: http://www.python.org/~guido/)
[GvR]
Alas, the array type doesn't seem completely fit for subclassing;
Famous last words: It doesn't look like it would be hard to make array's subclassable. I'll work on that this weekend. Reminder: there is an outstanding array feature request awaiting your adjudication: www.python.org/sf/992967 Raymond
Guido van Rossum
Even more mysterious is that the array implementation appears to support the buffer API and yet it can't be used as an argument to write(). What's going on?
It supports the "read" buffer API but not the "character" buffer API, so the file has to be opened in binary mode for it to work:
a = array.array('c', 'fish') open('/dev/null', 'w').write(a) Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: argument 1 must be string or read-only character buffer, not array.array open('/dev/null', 'wb').write(a)
That restriction(?) comes from this in file_write: PyArg_ParseTuple(args, f->f_binary ? "s#" : "t#", ... where s# requires a read buffer and t# requires a character buffer. array.array is the only type in the core that's a read buffer but not a character buffer, and I can't find any semantic differences between read and character buffers. If someone can explain the differences or confirm that there aren't any, I'll make this work. The easiest thing to do would be to make array support the character buffer API (but maybe only for [cbBu] types?). Dima.
array.array is the only type in the core that's a read buffer but not a character buffer, and I can't find any semantic differences between read and character buffers. If someone can explain the differences or confirm that there aren't any, I'll make this work. The easiest thing to do would be to make array support the character buffer API (but maybe only for [cbBu] types?).
Ah, that makes sense. And no please, keep it that way. It's compatible with the idea that bytes arrays should be read from / written to binary files. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Dima Dorfman
It supports the "read" buffer API but not the "character" buffer API, so the file has to be opened in binary mode for it to work:
If we're serious about bytes not being characters, isn't this the *right* behaviour for a byte array? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Skip Montanaro wrote:
>>> b[3] = '\xfe'
I think you meant to write b[3] = 0xfe # or byte(0xfe) here. Assigning to an index always takes the element type in Python. It's only strings where reading an index returns the container type.
>>> b += "\xfe"
And here, you probably meant to write b += bytes("\xfe") Regards, Martin
Michael Hudson wrote:
Anthony Baxter
writes: A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
If bytes() doesn't implement the read buffer interface, someone somewhere is going to need shooting :-)
Is there any reason you cannot use buffer() ?! It already implements all the necessary things and has been available for many years. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 17 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
Michael Hudson wrote:
Anthony Baxter
writes: A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
If bytes() doesn't implement the read buffer interface, someone somewhere is going to need shooting :-)
Is there any reason you cannot use buffer() ?!
Is it mutable? My guess: no:
d = u'123124' ddd[0] '1' ddd[1] '\x00' ddd[1] = '1' Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: buffer is read-only
It already implements all the necessary things and has been available for many years.
It was in the shadows because we had byte-strings. Sincerely yours, Roman Suzi -- rnd@onego.ru =\= My AI powered by GNU/Linux RedHat 7.3
Roman Suzi wrote:
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
Michael Hudson wrote:
Anthony Baxter
writes: A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
If bytes() doesn't implement the read buffer interface, someone somewhere is going to need shooting :-)
Is there any reason you cannot use buffer() ?!
Is it mutable? My guess: no:
The buffer object itself can be read-only or read-write. Unfortunately, the buffer() built-in always returns read-only buffers. At C level it is easy to create a buffer object from a read-write capable object.
d = u'123124' ddd[0]
'1'
ddd[1]
'\x00'
ddd[1] = '1'
Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: buffer is read-only
It already implements all the necessary things and has been available for many years.
It was in the shadows because we had byte-strings.
Right, so why not revive it ?! Anyway, this whole discussion about a new bytes type doesn't really solve the problem that the b'...' literal was intended for: that of having a nice way to define (read-only) 8-bit binary string literals. We already have a number of read-write types for storing binary data, e.g. arrays, cStringIO and buffers. Inventing yet another way to spell binary data won't make life easier. However, what will be missing is a nice way to spell read-only binary data. Since 'tada' will return a Unicode object in Py3k, I think we should reuse the existing 8-bit string object under the new literal constructor b'tada\x00' (and apply the same source code encoding semantics we apply today for 'tada\x00'). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 17 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
...
We already have a number of read-write types for storing binary data, e.g. arrays, cStringIO and buffers. Inventing yet another way to spell binary data won't make life easier.
The point is canonicalize one and start to make APIs that expect it. Otherwise we will never make the leap from the bad habit of using strings as byte arrays. Can you pass arrays to the write() function? Can you decode buffers to strings? A byte array type would have a certain mix of functions and API compatibility that is missing from the plethora of similar thingees. Paul Prescod
Paul Prescod wrote:
M.-A. Lemburg wrote:
...
We already have a number of read-write types for storing binary data, e.g. arrays, cStringIO and buffers. Inventing yet another way to spell binary data won't make life easier.
The point is canonicalize one and start to make APIs that expect it. Otherwise we will never make the leap from the bad habit of using strings as byte arrays.
Can you pass arrays to the write() function? Can you decode buffers to strings? A byte array type would have a certain mix of functions and API compatibility that is missing from the plethora of similar thingees.
Wouldn't it be possible to extend the existing buffer type to meet those standards ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Wouldn't it be possible to extend the existing buffer type to meet those standards ?
Yes; you then need to change all codecs to return buffers from .encode.
I'm still not convinced that we can simply drop the existing immutable 8-bit string type and replace it with a mutable bytes or buffer type, e.g. would buffer.lower() work on the buffer itself or return a lowered copy ? However, if that's where Python will be heading, then you're right (for most of the codecs: some might want to return Unicode objects, e.g. unicode-escape). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
I'm still not convinced that we can simply drop the existing immutable 8-bit string type and replace it with a mutable bytes or buffer type, e.g. would buffer.lower() work on the buffer itself or return a lowered copy ?
The byte string type would not have a .lower method, as "lowering" is not meaningful for bytes, only for characters. Regards, Martin
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I'm still not convinced that we can simply drop the existing immutable 8-bit string type and replace it with a mutable bytes or buffer type, e.g. would buffer.lower() work on the buffer itself or return a lowered copy ?
The byte string type would not have a .lower method, as "lowering" is not meaningful for bytes, only for characters.
Indeed... and the same is true for almost all other methods (except maybe .replace()). Sounds like a lot of code will break. OTOH, it will also enforce the notion of doing encoding and decoding only at IO boundaries and being explicit about character sets which is good in the long run. Auto-conversion using the default encoding will get us all ASCII character (Unicode) strings converted to buffers without problems (and without having the need for an extra b'something' modifier). This leaves the question of how to deal with the byte range 0x80 - 0xFF. The straight forward solution would be to switch to Latin-1 as default encoding and let the same magic take care of that byte range as well. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Indeed... and the same is true for almost all other methods (except maybe .replace()).
Sounds like a lot of code will break.
We will see. The default string type will be Unicode, so code using .lower will continue to work in many cases. The question is what open(path,"r").read() will return. It seems that Guido wants users to specify "rb" if they want that to be byte strings. Regards, Martin
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Indeed... and the same is true for almost all other methods (except maybe .replace()).
Sounds like a lot of code will break.
We will see. The default string type will be Unicode, so code using .lower will continue to work in many cases.
The question is what open(path,"r").read() will return. It seems that Guido wants users to specify "rb" if they want that to be byte strings.
This would imply that we'd need to add an encoding parameter that becomes a required parameter in case 'r' (without 'b') is specified as mode. Perhaps we should drop 'b' altogether and make encoding a required parameter. We could have a 'binary' codec which then passes through the data as is (as buffer object instead of as Unicode object for most other codecs). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
I'm still not convinced that we can simply drop the existing immutable 8-bit string type and replace it with a mutable bytes or buffer type, e.g. would buffer.lower() work on the buffer itself or return a lowered copy ?
Byte strings probably shouldn't have a lower() method at all, or any other method that assumes the contents represent characters. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
Roman Suzi wrote:
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
It was in the shadows because we had byte-strings.
Right, so why not revive it ?!
Anyway, this whole discussion about a new bytes type doesn't really solve the problem that the b'...' literal was intended for: that of having a nice way to define (read-only) 8-bit binary string literals.
I think new _mutable_ bytes() type is better than old 8-bit binary strings for binary data processing purposes. Or do we need them for legacy text-procesing software?
We already have a number of read-write types for storing binary data, e.g. arrays, cStringIO and buffers. Inventing yet another way to spell binary data won't make life easier.
However, what will be missing is a nice way to spell read-only binary data.
Since 'tada' will return a Unicode object in Py3k, I think we should reuse the existing 8-bit string object under the new literal constructor b'tada\x00' (and apply the same source code encoding semantics we apply today for 'tada\x00').
Sincerely yours, Roman Suzi -- rnd@onego.ru =\= My AI powered by GNU/Linux RedHat 7.3
Roman Suzi wrote:
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
Roman Suzi wrote:
On Tue, 17 Aug 2004, M.-A. Lemburg wrote:
It was in the shadows because we had byte-strings.
Right, so why not revive it ?!
Anyway, this whole discussion about a new bytes type doesn't really solve the problem that the b'...' literal was intended for: that of having a nice way to define (read-only) 8-bit binary string literals.
I think new _mutable_ bytes() type is better than old 8-bit binary strings for binary data processing purposes. Or do we need them for legacy text-procesing software?
Hmm, who ever said that we are going to drop the current 8-bit string implementation ? I'm only suggesting to look at what's there instead of trying to redo everything in slightly different way, e.g. you can already get the bytes() functionality from buffer type at C level - it's just that this functionality is not exposed at Python level.
We already have a number of read-write types for storing binary data, e.g. arrays, cStringIO and buffers. Inventing yet another way to spell binary data won't make life easier.
However, what will be missing is a nice way to spell read-only binary data.
Since 'tada' will return a Unicode object in Py3k, I think we should reuse the existing 8-bit string object under the new literal constructor b'tada\x00' (and apply the same source code encoding semantics we apply today for 'tada\x00').
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Hmm, who ever said that we are going to drop the current 8-bit string implementation ?
I expect that in Python 3.0 aka Python 3000 we'll have only Unicode strings and byte arrays, so yes, the current 8-bit string implementation will eventually die. Jython and IronPython are ahead of us in this respect... --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Hmm, who ever said that we are going to drop the current 8-bit string implementation ?
I expect that in Python 3.0 aka Python 3000 we'll have only Unicode strings and byte arrays, so yes, the current 8-bit string implementation will eventually die. Jython and IronPython are ahead of us in this respect...
Ok, so I suppose that we can learn from Jython and IronPython in this respect... How do they handle binary data and the interfacing between various I/O facilities, e.g. files, sockets, pipes, user input, etc. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 19 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Ok, so I suppose that we can learn from Jython and IronPython in this respect...
How do they handle binary data and the interfacing between various I/O facilities, e.g. files, sockets, pipes, user input, etc.
I'm not sure, but I expect that in most cases they use Unicode strings in order to be compatibly with Python's standard library. That's not the outcome I'd like to see though. I believe Jython at least also has a bytes-like type (probably a thin wrapper around Java's byte array) that's used for interfacing to java classes. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
Ok, so I suppose that we can learn from Jython and IronPython in this respect...
How do they handle binary data and the interfacing between various I/O facilities, e.g. files, sockets, pipes, user input, etc.
I'm not sure, but I expect that in most cases they use Unicode strings in order to be compatibly with Python's standard library. That's not the outcome I'd like to see though. I believe Jython at least also has a bytes-like type (probably a thin wrapper around Java's byte array) that's used for interfacing to java classes.
I've had a discussion with Jack Janssen about using bytes as default return value for I/O operations where no encoding is specified (or unknown). He raised the issue of bytes not being usable as dictionary keys due to their mutability. He was also concerned about the increase in complexity when writing programs that work with non-text data or mixed text/data I/O. If we want to make the move from Python 2.x to 3.0 feasable for large code bases, then we have to do something about these issues. It seems that the simple solution of going with Unicode + bytes type is not going to be a suitable approach. Anyway, we still have 4-5 years to think about this :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 23 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Anyway, this whole discussion about a new bytes type doesn't really solve the problem that the b'...' literal was intended for: that of having a nice way to define (read-only) 8-bit binary string literals.
But why do you need a way to spell 8-bit string literals? You can always do "string".encode("L1") If that is too much typing, do def b(s):return s.encode("L1") b("string") Regards, Martin
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Anyway, this whole discussion about a new bytes type doesn't really solve the problem that the b'...' literal was intended for: that of having a nice way to define (read-only) 8-bit binary string literals.
But why do you need a way to spell 8-bit string literals?
You can always do
"string".encode("L1")
If that is too much typing, do
def b(s):return s.encode("L1")
b("string")
You need to think about the important use-case of having to convert Py2 applications to Py3 style. In many cases, the application can be made to run under Py3 be adding the small 'b' in front of the used string literals. Even better: if we add the b'xxx' notation now, we could start working towards the switch-over by slowly converting the Python standard library to actually work in -U mode (which basically implements the switch-over by making 'abc' behave as u'abc'). Since the code is already in place and the change is minimal, I don't see any reason not to use it. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 18 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
You need to think about the important use-case of having to convert Py2 applications to Py3 style. In many cases, the application can be made to run under Py3 be adding the small 'b' in front of the used string literals.
That is hard to tell, because Py3 is not implemented, yet. It might be that in many cases, no change is necessary at all, because the system default encoding will convert the strings to bytes.
Since the code is already in place and the change is minimal, I don't see any reason not to use it.
I do. It would mean that we commit to the b"" notation, when there is no real need for that. Regards, Martin
A big +1 for a bytes() type, though. I'm not sure on the details, but it'd be nice if it was possible to pass a bytes() object to, for instance, write() directly.
That should already possible with the array module, but somehow it doesn't quite work, even though the array type appears to support the buffer API. (Does anybody understand why not?) --Guido van Rossum (home page: http://www.python.org/~guido/)
Does b"foo" really make much of a difference?
Yes. My guess is that if you leave it out, you'll see var = u"foo".encode("ASCII") all over the place (assuming that encode() will produce a bytes type). Wouldn't b"foo" be more readable all around?
Is it so hard to have to write bytes([0x66, 0x6f, 0x6f]) instead of b"\x66\x6f\x6f"?
No, that's true. But if you have a bytes literal syntax, might as well allow \x in it. Bill
Bill Janssen wrote:
Yes. My guess is that if you leave it out, you'll see
var = u"foo".encode("ASCII")
all over the place (assuming that encode() will produce a bytes type).
If you also had var = bytes(u"foo") then I guess people would prefer that. People who want to save typing can do b = bytes and, given that the u prefix will be redundant, write var = b("foo") Regards, Martin
On Aug 17, 2004, at 4:07 PM, Martin v. Löwis wrote:
Bill Janssen wrote:
Yes. My guess is that if you leave it out, you'll see var = u"foo".encode("ASCII") all over the place (assuming that encode() will produce a bytes type).
If you also had
var = bytes(u"foo")
then I guess people would prefer that. People who want to save typing can do
b = bytes
and, given that the u prefix will be redundant, write
var = b("foo")
How would you embed raw bytes if the string was unicode? Maybe there should be something roughly equivalent to this: bytesvar = r"delimited packet\x00".decode("string_escape") "string_escape" would probably be a bad name for it, of course. -bob
Bob Ippolito wrote:
How would you embed raw bytes if the string was unicode?
The most direct notation would be bytes("delimited packet\x00") However, people might not understand what is happening, and Guido doesn't like it if the bytes are >127. Regards, Martin
On Aug 17, 2004, at 5:11 PM, Martin v. Löwis wrote:
Bob Ippolito wrote:
How would you embed raw bytes if the string was unicode?
The most direct notation would be
bytes("delimited packet\x00")
However, people might not understand what is happening, and Guido doesn't like it if the bytes are >127.
I guess that was a bad example, what if the delimiter was \xff? I know that map(ord, u'delimited packet\xff') would get correct results.. but I don't think I like that either. -bob
How would you embed raw bytes if the string was unicode?
The most direct notation would be
bytes("delimited packet\x00")
However, people might not understand what is happening, and Guido doesn't like it if the bytes are >127.
I guess that was a bad example, what if the delimiter was \xff? I know that map(ord, u'delimited packet\xff') would get correct results.. but I don't think I like that either.
Maybe the constructor could be bytes(<string>[, <encoding>])? --Guido van Rossum (home page: http://www.python.org/~guido/)
On Aug 17, 2004, at 5:18 PM, Bob Ippolito wrote:
On Aug 17, 2004, at 5:11 PM, Martin v. Löwis wrote:
Bob Ippolito wrote:
How would you embed raw bytes if the string was unicode?
The most direct notation would be
bytes("delimited packet\x00")
However, people might not understand what is happening, and Guido doesn't like it if the bytes are >127.
I guess that was a bad example, what if the delimiter was \xff?
Indeed, if all strings are unicode, the question becomes: what encoding does bytes() use to translate unicode characters to bytes. Two alternatives have been proposed so far: 1) ASCII (translate chars as their codepoint if < 128, else error) 2) ISO-8859-1 (translate chars as their codepoint if < 256, else error) I think I'd choose #2, myself.
I know that map(ord, u'delimited packet\xff') would get correct results.. but I don't think I like that either.
Why would you consider that wrong? ord(u'\xff') *should* return 255. Just as ord(u'\u1000') returns 4096. There's nothing mysterious there. James
Yes, that works for me. Martin writes:
If you also had
var = bytes(u"foo")
then I guess people would prefer that. People who want to save typing can do
b = bytes
and, given that the u prefix will be redundant, write
var = b("foo")
Bill
Greg Ewing wrote:
This suggests that byte string literals should be restricted to ASCII characters and \x escapes. Would that be safe enough?
Yes, that should work fine (Guidos restriction to *printable* characters is also useful, as line endings are easily changed, too, when moving from one system to another). Regards, Martin
Martin v. Löwis wrote:
M.-A. Lemburg wrote:
It is if you stick to writing your binary data using an ASCII compatible encoding -- I wouldn't expect any other encoding for binary data anyway. The most common are ASCII + escape sequences, base64 or hex, all of which are ASCII compatible.
We probably have a different notion of "ASCII compatible" then. I would define it as:
An encoding E is "ASCII compatbible" if strings that only consist of ASCII characters use the same byte representation in E that they use in ASCII.
In that sense, ISO-8859-1 and UTF-8 are also ASCII compatible. Notice that this is also the definition that PEP 263 assumes.
Sorry, wrong wording on my part: I meant a string literal that only uses ASCII characters for the literal definition, i.e. literaldefinition.decode('ascii').encode('ascii') == literaldefinition.
However, byte strings used in source code are not "safe" if they are encoded in ISO-8859-1 under recoding: If the source code is converted to UTF-8 (including the encoding declaration), then the length of the strings changes, as do the byte values inside the string.
Agreed. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 17 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Guido van Rossum wrote:
Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh!
Martin:
I see. How would you like byte array displays then?
Not as characters (not by default anyway), because more often than not they will contain binary or encoded gibberish!
This is the approach taken in the other languages: Everytime the array display is executed, a new array is created. There is then no problem with that being mutable.
The downside of that is that then for performance reasons you might end up having to move bytes literals out of expressions if they are in fact used read-only (which the compiler can't know but the user can).
Of course, if the syntax is too similar to string literals, people might be tricked into believing they are actually literals. Perhaps
bytes('G','E','T')
would be sufficient, or even
bytes("GET")
which would implicitly convert each character to Latin-1.
The first form would also need an encoding, since Python doesn't have character literals! I don't think we should use Latin-1, for the same reasons that the default encoding is ASCII. Better would be to have a second argument that's the encoding, so you can write bytes(u"<chinese text>", "utf-8") Hm, u"<chinese text>".encode("utf-8") should probably return a bytes array, and that might be sufficient. Perhaps bytes should by default be considered as arrays of tiny unsigned ints, so we could use bytes(map(ord, "GET")) and it would display itself as bytes([71, 69, 84]) Very long ones should probably use ellipses rather than print a million numbers. --Guido van Rossum (home page: http://www.python.org/~guido/)
>> This is the approach taken in the other languages: Everytime the >> array display is executed, a new array is created. There is then no >> problem with that being mutable. Guido> The downside of that is that then for performance reasons you Guido> might end up having to move bytes literals out of expressions if Guido> they are in fact used read-only (which the compiler can't know Guido> but the user can). Wouldn't the compiler be able to tell it was to be treated specially if it saw b"GET"? In that case, the code generated for x = b"GET" would be something like LOAD_CONST "GET" LOAD_NAME bytes CALL_FUNCTION 1 STORE_FAST x Skip
Wouldn't the compiler be able to tell it was to be treated specially if it saw b"GET"? In that case, the code generated for
x = b"GET"
would be something like
LOAD_CONST "GET" LOAD_NAME bytes CALL_FUNCTION 1 STORE_FAST x
This is actually an illustration of what I meant: a performance-aware person might want to move that CALL_FUNCTION out of an inner loop if they knew the result was never modified inside the loop. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido> Anyway, if we really do have enough use cases for byte array Guido> literals, we might add them. I still think that would be Guido> confusing though, because byte arrays are most useful if they are Guido> mutable: and then we'd have mutable literals -- blechhhh! Today I can initialize mutable objects from immutable strings: >>> print list("abc") ['a', 'b', 'c'] >>> print set("abc") set(['a', 'c', 'b']) I see no reason that mutable bytes objects couldn't be created from otherwise immutable sequences either. Would it be a problem to ensure that a = b"abc" b = b"abc" print a is b prints False? The main difference as I see it is that byte literals would be completely devoid of any sort of interpretation as unicode sequences. It would be nice if this was possible: # -*- coding: utf-8 -*- b = b"â" though that would probably wreak havoc with editors and hex escapes would have to be used in most situations. Skip
At 10:45 PM 8/11/04 -0700, Guido van Rossum wrote:
Anyway, if we really do have enough use cases for byte array literals, we might add them. I still think that would be confusing though, because byte arrays are most useful if they are mutable: and then we'd have mutable literals -- blechhhh!
Not if they work like list or dictionary "literals". That is, if they're just short for 'array("B","literal here")'.
Martin> Java also supports byte arrays in the source, although they are Martin> difficult to type: Martin> byte[] request = {'G', 'E', 'T'}; Seems to me that b"GET" would be more Pythonic given existing Python string literals. Martin> As for reading from streams: Java has multiple reader APIs; some Martin> return byte strings, some character strings. I think Guido's proposed 'B' might make sense here. OTOH, today's 'b' might work as well, though a switch of that magnitude could probably not be made until 3.0 if bytes objects are not synonyms for strings. Skip
(changing the subject - sorry i didn't do it in my first two messages...) >> 1. Make bytes a synonuym for str. Guido> Hmm... I worry that a simple alias would just encourage confused Guido> usage, since the compiler won't check. I'd rather see bytes an Guido> alias for a bytes array as defined by the array module. You're right. This could probably be added now ("now" being 2.5) with little or no problem. My thought was to get the name in there quickly (could be done in 2.4) with some supporting documentation so people could begin modifying their code. >> 2. Warn about the use of bytes as a variable name. Guido> Is this really needed? Builtins don't byte variable names. I suppose this could be dispensed with. Let pychecker handle it. >> 3. Introduce b"..." literals as a synonym for current string >> literals, and have them *not* generate warnings if non-ascii >> characters were used in them without a coding cookie. Guido> I expecet all sorts of problems with that, such as what it would Guido> mean if Unicode or multibyte characters are used in the source. My intent in proposing b"..." literals was that they would be allowed in any source file. Their contents would not be interpreted in any way. One simple use case: identifying the magic number of a binary file type of some sort. That might well be a constant to programmers manipulating that sort of file and have nothing to do with Unicode at all. Guido> Do we really need byte array literals at all? I think so. Martin already pointed out an example where a string literal is used today for a sequences of bytes that's put out on the wire as-is. It's just convenient that the protocol was developed in such a way that most of its meta-data is plain ASCII. Skip
>> 1. Make bytes a synonuym for str.
Guido> Hmm... I worry that a simple alias would just encourage confused Guido> usage, since the compiler won't check. I'd rather see bytes an Guido> alias for a bytes array as defined by the array module.
You're right. This could probably be added now ("now" being 2.5) with little or no problem. My thought was to get the name in there quickly (could be done in 2.4) with some supporting documentation so people could begin modifying their code.
I think very few people would do so until the semantics of bytes were clearer. Let's just put it in meaning byte array when we're ready.
>> 2. Warn about the use of bytes as a variable name.
Guido> Is this really needed? Builtins don't byte variable names.
I suppose this could be dispensed with. Let pychecker handle it.
>> 3. Introduce b"..." literals as a synonym for current string >> literals, and have them *not* generate warnings if non-ascii >> characters were used in them without a coding cookie.
Guido> I expecet all sorts of problems with that, such as what it would Guido> mean if Unicode or multibyte characters are used in the source.
My intent in proposing b"..." literals was that they would be allowed in any source file. Their contents would not be interpreted in any way.
But they would be manipulated if they were non-ASCII and the source file was converted to a different encoding. Better be safe and only allow printable ASCII and hex escapes there.
One simple use case: identifying the magic number of a binary file type of some sort. That might well be a constant to programmers manipulating that sort of file and have nothing to do with Unicode at all.
Not a very strong use case, this could easily be done using just hex.
Guido> Do we really need byte array literals at all?
I think so. Martin already pointed out an example where a string literal is used today for a sequences of bytes that's put out on the wire as-is. It's just convenient that the protocol was developed in such a way that most of its meta-data is plain ASCII.
See my response to that. --Guido van Rossum (home page: http://www.python.org/~guido/)
Skip Montanaro
My intent in proposing b"..." literals was that they would be allowed in any source file. Their contents would not be interpreted in any way.
I think this is a bad idea. If a coding cookie says a file is in utf-8, then the file really has to be valid utf-8 data, for implementation sanity and not freaking out editors. The use-case for including arbitrary chunks of binary data in a source file seems stretched in the extreme. Cheers, mwh -- My hat is lined with tinfoil for protection in the unlikely event that the droid gets his PowerPoint presentation working. -- Alan W. Frame, alt.sysadmin.recovery
I would like to urge caution before making this change. Despite what the PEP may say, I actually think that creating a 'baseint' type is the WRONG design choice for the long term. I envision an eventual Python which has just one type, called 'int'. The fact that an efficient implementation is used when the ints are small and an arbitrary-precision version when they get too big would be hidden from the user by automatic promotion of overflow. (By "hidden" I mean the user doesn't need to care, not that they can't find out if they want to.) We are almost there already, but if people start coding to 'baseinteger' it takes us down a different path entirely. 'basestring' is a completely different issue -- there will always be a need for both unicode and 8-bit-strings as separate types.
Not so sure. I expect that, like Jython and IronPython, Python 3000 will use unicode for strings, and have a separate mutable byte array for 8-bit bytes. In Python 3000 I expect that indeed somehow the existence of long is completely hidden from the user, but that's a long time away, and until then baseinteger might be a better solution than requiring people to write isinstance(x, (int, long)). --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wednesday 2004-08-11 22:02, Michael Chermside wrote:
I would like to urge caution before making this change. Despite what the PEP may say, I actually think that creating a 'baseint' type is the WRONG design choice for the long term. I envision an eventual Python which has just one type, called 'int'. The fact that an efficient implementation is used when the ints are small and an arbitrary-precision version when they get too big would be hidden from the user by automatic promotion of overflow. (By "hidden" I mean the user doesn't need to care, not that they can't find out if they want to.) We are almost there already, but if people start coding to 'baseinteger' it takes us down a different path entirely. 'basestring' is a completely different issue -- there will always be a need for both unicode and 8-bit-strings as separate types.
This is why "integer" is a better name than "baseinteger". For now it can be the common supertype of int and long. In the future, it can be the name of the single integer type. -- g
This is why "integer" is a better name than "baseinteger". For now it can be the common supertype of int and long. In the future, it can be the name of the single integer type.
No, that will be int, of course! Like 'basestring', 'baseinteger' is intentionally cumbersome, because it is only the base class of all *built-in* integral types. Note that UserString is *not* subclassing basestring, and likewise if you wrote an integer-like class from scratch it should not inherit from baseinteger. Code testing for these types is interested in knowing whether something is a member of one of the *built-in* types, which is often needed because other built-in operations (e.g. many extension modules) only handle the built-in types. If you want to test for integer-like or string-like behavior, you won't be able to use isinstance(), but instead you'll have to check for the presence of certain methods. I know this is not easy in the case of integers, but I don't want to start requiring inheritance from a marker base type now. Python is built on duck typing. (Google for it.) --Guido van Rossum (home page: http://www.python.org/~guido/)
[I said:]
This is why "integer" is a better name than "baseinteger". For now it can be the common supertype of int and long. In the future, it can be the name of the single integer type.
[Guido:]
No, that will be int, of course!
Dang. I suppose it has to be, for hysterical raisins.
Like 'basestring', 'baseinteger' is intentionally cumbersome, because it is only the base class of all *built-in* integral types. Note that UserString is *not* subclassing basestring, and likewise if you wrote an integer-like class from scratch it should not inherit from baseinteger. Code testing for these types is interested in knowing whether something is a member of one of the *built-in* types, which is often needed because other built-in operations (e.g. many extension modules) only handle the built-in types.
If you want to test for integer-like or string-like behavior, you won't be able to use isinstance(), but instead you'll have to check for the presence of certain methods.
I know this is not easy in the case of integers, but I don't want to start requiring inheritance from a marker base type now. Python is built on duck typing. (Google for it.)
I don't think this is a good reason for rejecting "integer" as the name of the common supertype of int and long. You'd use isinstance(x,integer) to check whether x is an integer of built-in type, just as you currently use isinstance(x,float) to check whether x is a floating-point number of built-in type. I see no reason why a cumbersome name is much advantage. I am, however, convinced by the other argument: All the steps towards int/long unification that have been taken so far assume that the endpoint is having a single type called "int", and that would be derailed by changing the target name to "integer". Not to mention that there's any amount of code out there that uses "int" for conversions and the like, which there's no reason to break. I can't get away from the feeling that it would, in some possibly over-abstract sense, have been neater to introduce "integer" as the supertype and encourage people to *change* from using "int" (and "long") to using "integer", so that the meanings of "int" and "long" never change in the unification process. But, even if I could convince you, it's too late for that :-). By the way, I wasn't at any point proposing that inheritance from a "marker" type should be required for anything, and I'm not sure why you thought I were. I expect I was unclear, and I'm sorry about that. -- g
participants (21)
-
"Martin v. Löwis"
-
Aahz
-
Anthony Baxter
-
Bill Janssen
-
Bob Ippolito
-
Dima Dorfman
-
Dmitry Vasiliev
-
Gareth McCaughan
-
Greg Ewing
-
Guido van Rossum
-
Guido van Rossum
-
James Y Knight
-
M.-A. Lemburg
-
Michael Chermside
-
Michael Hudson
-
Nick Coghlan
-
Paul Prescod
-
Phillip J. Eby
-
Raymond Hettinger
-
Roman Suzi
-
Skip Montanaro