Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints)
I wrote:
A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".
Another way of putting this is: - utf-8 in an 8-bit string is to a unicode string what a pickle is to an object. - defaulting to utf-8 upon coercing is like implicitly trying to unpickle an 8-bit string when comparing it to an instance. Bad idea. Defaulting to Latin-1 is the only logical choice, no matter how western-culture-centric this may seem. Just
Just van Rossum wrote:
I wrote:
A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".
Another way of putting this is: - utf-8 in an 8-bit string is to a unicode string what a pickle is to an object. - defaulting to utf-8 upon coercing is like implicitly trying to unpickle an 8-bit string when comparing it to an instance. Bad idea.
Defaulting to Latin-1 is the only logical choice, no matter how western-culture-centric this may seem.
Please note that the support for mixing strings and Unicode objects is really only there to aid porting applications to Unicode. New code should use Unicode directly and apply all needed conversions explicitly using one of the many ways to encode or decode Unicode data. The auto-conversions are only there to help out and provide some convenience. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
(forwarded from c.l.py, on request)
New code should use Unicode directly and apply all needed conversions explicitly using one of the many ways to encode or decode Unicode data. The auto-conversions are only there to help out and provide some convenience.
does this mean that the 8-bit string type is deprecated ??? </F>
It is necessary for us to also have this scrag-fight in public? Most of the thread on c.l.py is filled in by people who are also py-dev members! [MAL writes]
Please note that the support for mixing strings and Unicode objects is really only there to aid porting applications to Unicode.
New code should use Unicode directly and apply all needed conversions explicitly using one of the many ways to encode or decode Unicode data.
This will _never_ happen. The Python programmer should never need to be aware they have a Unicode string versus a standard string - just a "string"! The fact there are 2 string types should be considered an implementation detail, and not a conceptual model for people to work within. I think we will be mixing Unicode and strings for ever! The only way to avoid it would be a unified type - possibly Py3k. Until then, people will still generally use strings as literals in their code, and should not even be aware they are mixing. Im never going to prefix my ascii-only strings with u"" just to avoid the possibility of mixing! Listening to the arguments, Ive got to say Im coming down squarely on the side of Fredrik and Just. strings must be sequences of characters, whose length is the number of characters. A string holding an encoding should be considered logically a byte array, and conversions should be explicit.
The auto-conversions are only there to help out and provide some convenience.
Doesn't sound like it is working :-( Mark.
Mark Hammond writes:
It is necessary for us to also have this scrag-fight in public? Most of the thread on c.l.py is filled in by people who are also py-dev members!
Attempting to walk a delicate line here, my reading of the situation is that Fredrik's frustration level is increaing as he points out problems, but nothing much is done about them. Marc-Andre will usually respond, but there's been no indication from Guido about what to do. But GvR might be waiting to hear from more users about their experience with Unicode; so far I don't know if anyone has much experience with the new code. But why not have it in public? The python-dev archives are publicly available anyway, so it's not like this discussion was going on behind closed doors. The problem with discussing this on c.l.py is that not everyone reads c.l.py any more due to volume. --amk
[Just van Rossum]
... Defaulting to Latin-1 is the only logical choice, no matter how western-culture-centric this may seem.
Indeed, if someone from an inferior culture wants to chime in, let them find Python-Dev with their own beady little eyes <wink>. western-culture-is-better-than-none-&-at-least-*we*-understand-it-ly y'rs - tim
At 10:27 PM -0400 26-04-2000, Tim Peters wrote:
Indeed, if someone from an inferior culture wants to chime in, let them find Python-Dev with their own beady little eyes <wink>.
All irony aside, I think you've nailed one of the problems spot on: - most core Python developers seem to be too busy to read *anything* at all in c.l.py - most people that care about the issues are not on python-dev Just
[Just van Rossum]
All irony aside, I think you've nailed one of the problems spot on: - most core Python developers seem to be too busy to read *anything* at all in c.l.py - most people that care about the issues are not on python-dev
But they're not on c.l.py either, are they? I still read everything there, although that's gotten so time-consuming I rarely reply anymore. In any case, I've seen almost nothing useful about Unicode issues on c.l.py that wasn't also on Python-Dev; perhaps I missed something. ask-10-more-people-&-you'll-get-20-more-opinions-ly y'rs - tim
I'd like to reset this discussion. I don't think we need to involve c.l.py yet -- I haven't seen anyone with Asian language experience chime in there, and that's where this matters most. I am directing this to the Python i18n-sig mailing list, because that's where the debate belongs, and there interested parties can join the discussion without having to be vetted as "fit for python-dev" first. I apologize for having been less than responsive in the matter; unfortunately there's lots of other stuff on my mind right now that has recently had a tendency to distract me with higher priority crises. I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?) These encodings are much less "pure" than UTF-8, because they only encode the local characters (and ASCII), and because of various problems with slicing: if you look "in the middle" of an encoded string or file, you may not know how to interpret the bytes you see. There are overlaps (in most of these encodings anyway) between the codes used for single-byte and double-byte encodings, and you may have to look back one or more characters to know what to make of the particular byte you see. To get an idea of the nightmares that non-UTF-8 multibyte encodings give C/C++ programmers, see the Multibyte Character Set (MBCS) Survival Guide (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). See also the home page of the i18n-sig for more background information on encoding (and other i18n) issues (http://www.python.org/sigs/i18n-sig/). UTF-8 attempts to solve some of these problems: the multi-byte encodings are chosen such that you can tell by the high bits of each byte whether it is (1) a single-byte (ASCII) character (top bit off), (2) the start of a multi-byte character (at least two top bits on; how many indicates the total number of bytes comprising the character), or (3) a continuation byte in a multi-byte character (top bit on, next bit off). Many of the problems with non-UTF-8 multibyte encodings are the same as for UTF-8 though: #bytes != #characters, a byte may not be a valid character, regular expression patterns using "." may give the wrong results, and so on. The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop". When I read a line of Japanese text from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to be an assumption built-in to my program, or perhaps information supplied separately (there's no easy way to guess based on the actual data). When I type a string literal using Latin-1 characters, the encoding is Latin-1. When I use octal escapes in a string literal, e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). When I type a 7-bit string literal, the encoding is ASCII. The moral of all this? 8-bit strings are not going away. They are not encoded in UTF-8 henceforth. Like before, and like 8-bit text files, they are encoded in whatever encoding you want. All you get is an extra mechanism to convert them to Unicode, and the Unicode conversion defaults to UTF-8 because it is the only conversion that is reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing Tim's paraphrase), UTF-8 annoys everyone equally. Where does the current approach require work? - We need a way to indicate the encoding of Python source code. (Probably a "magic comment".) - We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding). Marc-Andre showed some sample code, but I believe it is still cumbersome. (I have to play with it more to see how it could be improved.) - We need to discuss whether there should be a way to change the default conversion between Unicode and 8-bit strings (currently hardcoded to UTF-8), in order to make life easier for people who want to continue to use their favorite 8-bit encoding (e.g. Latin-1, or shift-JIS) but who also want to make use of the new Unicode datatype. We're still in alpha, so we can still fix things. --Guido van Rossum (home page: http://www.python.org/~guido/)
Hi, I'm not sure how much value I can add, as I know little about the charsets etc. and a bit more about Python. As a user of these, and running a consultancy firm in Hong Kong, I can at least pass on some points and perhaps help you with testing later on. My first touch on international PCs was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong Kong is quite an experience as there are two formats in common use, plus occasionally another gets thrown in. In HK they use the Traditional Chinese, whereas the mainland uses Simplified, as Guido says, there are a number of different types of these. Occasionally we see the Taiwanese charsets used. It seems to me that having each individual string variable encoded might just be too atomic, perhaps creating a cumbersome overhead in the system. For most applications I can settle for the entire app to be using a single charset, however from experience there are exceptions. We are normally working with prior knowledge of the charset being used, rather than having to deal with any charset which may come along (at an application level), and therefore generally work in a context, just as a European programmer would be working in say English or German. As you know, storage/retrieval is not a problem, but manipulation and comparison is. A nice way to handle this would be like operator overloading such that string operations would be perfomed in the context of the current charset, I could then change context as needed, removing the need for metadata surrounding the actual data. This should speed things up as each overloaded library could be optimised given the different quirks, and new ones could be added easily. My code could be easily re-used on different charsets by simply changing context externally to the code, rather than passing in lots of stuff and expecting Python to deal with it. Also I'd like very much to compile/load in only the International charsets that I need. I wouldn't want to see Java type bloat occurring to Python, and adding internationalisation for everything, is huge. I think what I am suggesting is a different approach which obviously places more onus on the programmer rather than Python. Perhaps this is not acceptable, I don't know as I've never developed a programming language. I hope this is a helpful point of view to get you thinking further, otherwise ... please ignore me and I'll keep quiet : ) Regards Paul ----- Original Message ----- From: "Guido van Rossum" <guido@python.org> To: <python-dev@python.org>; <i18n-sig@python.org> Cc: "Just van Rossum" <just@letterror.com> Sent: Thursday, April 27, 2000 11:01 PM Subject: [I18n-sig] Unicode debate
I'd like to reset this discussion. I don't think we need to involve c.l.py yet -- I haven't seen anyone with Asian language experience chime in there, and that's where this matters most. I am directing this to the Python i18n-sig mailing list, because that's where the debate belongs, and there interested parties can join the discussion without having to be vetted as "fit for python-dev" first.
I apologize for having been less than responsive in the matter; unfortunately there's lots of other stuff on my mind right now that has recently had a tendency to distract me with higher priority crises.
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
These encodings are much less "pure" than UTF-8, because they only encode the local characters (and ASCII), and because of various problems with slicing: if you look "in the middle" of an encoded string or file, you may not know how to interpret the bytes you see. There are overlaps (in most of these encodings anyway) between the codes used for single-byte and double-byte encodings, and you may have to look back one or more characters to know what to make of the particular byte you see. To get an idea of the nightmares that non-UTF-8 multibyte encodings give C/C++ programmers, see the Multibyte Character Set (MBCS) Survival Guide (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). See also the home page of the i18n-sig for more background information on encoding (and other i18n) issues (http://www.python.org/sigs/i18n-sig/).
UTF-8 attempts to solve some of these problems: the multi-byte encodings are chosen such that you can tell by the high bits of each byte whether it is (1) a single-byte (ASCII) character (top bit off), (2) the start of a multi-byte character (at least two top bits on; how many indicates the total number of bytes comprising the character), or (3) a continuation byte in a multi-byte character (top bit on, next bit off).
Many of the problems with non-UTF-8 multibyte encodings are the same as for UTF-8 though: #bytes != #characters, a byte may not be a valid character, regular expression patterns using "." may give the wrong results, and so on.
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop". When I read a line of Japanese text from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to be an assumption built-in to my program, or perhaps information supplied separately (there's no easy way to guess based on the actual data). When I type a string literal using Latin-1 characters, the encoding is Latin-1. When I use octal escapes in a string literal, e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). When I type a 7-bit string literal, the encoding is ASCII.
The moral of all this? 8-bit strings are not going away. They are not encoded in UTF-8 henceforth. Like before, and like 8-bit text files, they are encoded in whatever encoding you want. All you get is an extra mechanism to convert them to Unicode, and the Unicode conversion defaults to UTF-8 because it is the only conversion that is reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing Tim's paraphrase), UTF-8 annoys everyone equally.
Where does the current approach require work?
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
- We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding). Marc-Andre showed some sample code, but I believe it is still cumbersome. (I have to play with it more to see how it could be improved.)
- We need to discuss whether there should be a way to change the default conversion between Unicode and 8-bit strings (currently hardcoded to UTF-8), in order to make life easier for people who want to continue to use their favorite 8-bit encoding (e.g. Latin-1, or shift-JIS) but who also want to make use of the new Unicode datatype.
We're still in alpha, so we can still fix things.
--Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum [guido@python.org] wrote:
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
Actually a bigger concern that we hear from our customers in Japan is that Unicode has *serious* problems in asian languages. Theey took the "unification" of Chinese and Japanese, rather than both, and therefore can not represent los of phrases quite right. I can have someone write up a better dscription, but I was told by several Japanese people that they wouldn't use Unicode come hell or high water, basically. Basically it's JJIS, Shift-JIS or nothing for most Japanese companies. This was my experience working with Konica a few years ago as well. Chris -- | Christopher Petrilli | petrilli@amber.org
Guido van Rossum wrote:
...
I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?)
I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced. The physical and logical makeup of character strings are entirely separate issues. Unicode is a character set. It works in the logical domain. Dozens of different physical encodings can be used for Unicode characters. There are XML users who work with XML (and thus Unicode) every day and never see UTF-8, UTF-16 or any other Unicode-consortium "sponsored" encoding. If you invent an encoding tomorrow, it can still be XML-compatible. There are many encodings older than Unicode that are XML (and Unicode) compatible. I have not heard complaints about the XML way of looking at the world and in fact it was explicitly endorsed by many of the world's leading experts on internationalization. I haven't followed the Java situation as closely but I have also not heard screams about its support for il8n.
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop".
IMHO, it's a mistake of history that you would even think it makes sense to read a GIF file into a "string" object and we should be trying to erase that mistake, as quickly as possible (which is admittedly not very quickly) not building more and more infrastructure around it. How can we make the transition to a "binary goops are not strings" world easiest?
The moral of all this? 8-bit strings are not going away.
If that is a statement of your long term vision, then I think that it is very unfortunate. Treating string literals as if they were isomorphic with byte arrays was probably the right thing in 1991 but it won't be in 2005. It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications.
[Paul Prescod]
I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced.
In practical applications that manipulate text, encodings creep up all the time. I remember a talk or message by Andy Robinson about the messiness of producing printed reports in Japanese for a large investment firm. Most off the issues that took his time had to do with encodings, if I recall correctly. (Andy, do you remember what I'm talking about? Do you have a URL?)
The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop".
IMHO, it's a mistake of history that you would even think it makes sense to read a GIF file into a "string" object and we should be trying to erase that mistake, as quickly as possible (which is admittedly not very quickly) not building more and more infrastructure around it. How can we make the transition to a "binary goops are not strings" world easiest?
I'm afraid that's a bigger issue than we can solve for Python 1.6. We're committed to by and large backwards compatibility while supporting Unicode -- the backwards compatibility with tons of extension module (many 3rd party) requires that we deal with 8-bit strings in basically the same way as we did before.
The moral of all this? 8-bit strings are not going away.
If that is a statement of your long term vision, then I think that it is very unfortunate. Treating string literals as if they were isomorphic with byte arrays was probably the right thing in 1991 but it won't be in 2005.
I think you're a tad too optimistic about the evolution speed of software (Windows 2000 *still* has to support DOS programs), but I see your point. As I stated in another message, in Python 3000 we'll have to consider a more Java-esque solution: *character* strings are Unicode, and for bytes we have (mutable!) byte arras. Certainly 8-bit bytes as the smallest storage unit aren't going away.
It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications.
OK, so that's a good indication of where you're coming from. Maybe you should spend a little more time in the trenches and a little less in standards bodies. Standards are good, but sometimes disconnected from reality (remember ISO networking? :-).
From the W3C site:
""While ISO-2022-JP is not sufficient for every ISO10646 document, it is the case that ISO10646 is a sufficient document character set for any entity encoded with ISO-2022-JP.""
And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools).
I know that document well. --Guido van Rossum (home page: http://www.python.org/~guido/)
I agree with most of what you say, but... On Fri, 28 Apr 2000, Guido van Rossum wrote:
As I stated in another message, in Python 3000 we'll have to consider a more Java-esque solution: *character* strings are Unicode, and for bytes we have (mutable!) byte arras.
I would prefer a different distinction: mutable immutable chars string string_buffer bytes bytes bytes_buffer Why not allow me the freedom to index a dictionary with goop? (Here's a sample application: UNIX "file" command) -- Moshe Zadka <mzadka@geocities.com>. http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
Guido van Rossum wrote:
[Paul Prescod]
I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced.
In practical applications that manipulate text, encodings creep up all the time.
I'm not saying that encodings are unimportant. I'm saying that that they are *different* than what Fredrik was talking about. He was talking about a coherent logical model for characters and character strings based on the conventions of more modern languages and systems than C and Python.
How can we make the transition to a "binary goops are not strings" world easiest?
I'm afraid that's a bigger issue than we can solve for Python 1.6.
I understand that we can't fix the problem now. I just think that we shouldn't go out of our ways to make it worst. If we make byte-array strings "magically" cast themselves into character-strings, people will expect that behavior forever.
It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications.
OK, so that's a good indication of where you're coming from. Maybe you should spend a little more time in the trenches and a little less in standards bodies. Standards are good, but sometimes disconnected from reality (remember ISO networking? :-).
As far as I know, XML and Java are used a fair bit in the real world...even somewhat in Asia. In fact, there is a book titled "XML and Java" written by three Japanese men.
And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools).
You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is a character *set* and not an encoding. ISO-2022-JP says how you should represent characters in terms of bits and bytes. ISO10646 defines a mapping from integers to characters. They are both important, but separate. I think that this automagical re-encoding conflates them. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
[Guido]
And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools).
[Paul]
You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is a character *set* and not an encoding. ISO-2022-JP says how you should represent characters in terms of bits and bytes. ISO10646 defines a mapping from integers to characters.
OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.
They are both important, but separate. I think that this automagical re-encoding conflates them.
Who is proposing any automagical re-encoding? Are you sure you understand what we are arguing about? *I* am not even sure what we are arguing about. I am simply saying that 8-bit strings (literals or otherwise) in Python have always been able to contain encoded strings. Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes. There's plenty of use of 8-bit Python strings for non-character uses so your "proof" that 8-bit strings should contain "characters" according to your definition is invalid. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum writes:
OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.
Such as? -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
Tom Emerson wrote:
Guido van Rossum writes:
OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.
Such as?
ISO-2022-JP includes language/locale information, UTF-8 doesn't. if you just recode the character codes, you'll lose important information. </F>
Fredrik Lundh writes:
ISO-2022-JP includes language/locale information, UTF-8 doesn't. if you just recode the character codes, you'll lose important information.
So encode them using the Plane 14 language tags. I won't start with whether language/locale should be encoded in a character encoding... 8-) -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
Guido van Rossum writes:
OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.
[Tom Emerson]
Such as?
The standard forces that work against all change -- existing tools, user habits, compatibility, etc. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum writes:
The standard forces that work against all change -- existing tools, user habits, compatibility, etc.
Ah... I misread your original statement, which I took to be a technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of course one cannot expect everyone to switch en masse to a new encoding, pulling their existing documents with them. I'm in full agreement there. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
Uche asked for a summary so I cc:ed the xml-sig. Guido van Rossum wrote:
...
OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.
Absolutely agree.
Are you sure you understand what we are arguing about?
Here's what I thought we were arguing about: If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8). I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it.
Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes.
Actually, I think that that was Fredrik. Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make. We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem.
a=1000000000000000000000000000000000000L int(a) Traceback (innermost last): File "<stdin>", line 1, in ? OverflowError: long int too long to convert
And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays). -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
Are you sure you understand what we are arguing about?
Here's what I thought we were arguing about:
If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8).
I claim that we should *as far as possible* treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it.
Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes.
Actually, I think that that was Fredrik.
Yes, I came across the post again later. Sorry.
Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make.
I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately...
We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem.
a=1000000000000000000000000000000000000L int(a) Traceback (innermost last): File "<stdin>", line 1, in ? OverflowError: long int too long to convert
And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays).
OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode. I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally. I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric. Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one. I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding. I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> wrote:
I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately...
I've posted a few thousand messages on this topic, most of which seem to have been ignored. if you'd read all my messages, and seen all the replies, you'd be cranky too...
I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally.
maybe, but it's a darn good axiom, and it's used by everyone else. Perl uses it, Tcl uses it, XML uses it, etc. see: http://www.python.org/pipermail/python-dev/2000-April/005218.html
I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric.
the "Tcl uses it" is a red herring -- their internal implementation uses 16-bit integers, and the external interface works very hard to keep the "strings are character sequences" illusion. in other words, the length of a string is *always* the number of characters, the character at index i is *always* the i'th character in the string, etc. that's not true in Python 1.6a2. (as for Tkinter, you only have to add 2-3 lines of code to make it use 16-bit strings instead...)
Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one.
this is another red herring: my argument is that 8-bit strings should contain unicode characters, using unicode character codes. there should be only one character repertoire, and that repertoire is uni- code. for a definition of these terms, see: http://www.python.org/pipermail/python-dev/2000-April/005225.html obviously, you can only store 256 different values in a single 8-bit character (just like you can only store 4294967296 different values in a single 32-bit int). to store larger values, use unicode strings (or long integers). conversion from a small type to a large type always work, conversion from a large type to a small one may result in an OverflowError. it has nothing to do with encodings.
I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding.
yeah, and I claim that it won't fly, as long as it breaks the "strings are character sequences" rule used by all other contemporary (and competing) systems. (if you like, I can post more "fun with unicode" messages ;-) and as I've mentioned before, there are (at least) two ways to solve this: 1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and Perl). make sure len(s) returns the number of characters in the string, make sure s[i] returns the i'th character (not necessarily starting at the i'th byte, and not necessarily one byte), etc. to make this run reasonable fast, use as many implementation tricks as you can come up with (I've described three ways to implement this in an earlier post). 2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i]) is a unicode character code, whether s is an 8-bit string or a unicode string. for alternative 1 to work, you need to add some way to explicitly work with binary strings (like it's done in Perl and Tcl). alternative 2 doesn't need that; 8-bit strings can still be used to hold any kind of binary data, as in 1.5.2. just keep in mind you cannot use use all methods on such an object...
I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1.
I still think it's very unfortunate that you think that unicode strings are a special kind of strings. Perl and Tcl don't, so why should we? </F>
[Fredrik Lundh]
... (if you like, I can post more "fun with unicode" messages ;-)
By all means! Exposing a gotcha to ridicule does more good than a dozen abstract arguments. But next time stoop to explaining what it is that's surprising <wink>.
Sorry for the long message. Of course you need only respond to that which is interesting to you. I don't think that most of it is redundant. Guido van Rossum wrote:
...
OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode.
If the user provides an explicit conversion function (e.g. UTF-8-decode) then of course we should use that function. Under my character is a character is a character model, this "conversion" is morally equivalent to ROT-13, strupr or some other text->text translation. So you could apply UTF-8-decode even to a Unicode string as long as each character in the string has ord()<256 (so that it could be interpreted as a character representation for a byte).
I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally.
I don't see it as an axiom, but rather as a design decision you make to keep your language simple. Along the lines of "all values are objects" and (now) all integer values are representable with a single type. Are you happy with this? a="\244" b=u"\244" assert len(a)==len(b) assert ord(a[0])==ord(b[0]) # same thing, right? print b==a # Traceback (most recent call last): # File "<stdin>", line 1, in ? # UnicodeError: UTF-8 decoding error: unexpected code byte If I type "\244" it means I want character 244, not the first half of a UTF-8 escape sequence. "\244" is a string with one character. It has no encoding. It is not latin-1. It is not UTF-8. It is a string with one character and should compare as equal with another string with the same character. I would laugh my ass off if I was using Perl and it did something weird like this to me (as long as it didn't take a month to track down the bug!). Now it isn't so funny.
I have a bunch of good reasons (I think) for liking UTF-8:
I'm not against UTF-8. It could be an internal representation for some Unicode objects.
it allows you to convert between Unicode and 8-bit strings without losses,
Here's the heart of our disagreement: ****** I don't want, in Py3K, to think about "converting between Unicode and 8-bit strings." I want strings and I want byte-arrays and I want to worry about converting between *them*. There should be only one string type, its characters should all live in the Unicode character repertoire and the character numbers should all come from Unicode. "Special" characters can be assigned to the Unicode Private User Area. Byte arrays would be entirely seperate and would be converted to Unicode strings with explicit conversion functions. ***** In the meantime I'm just trying to get other people thinking in this mode so that the transition is easier. If I see people embedding UTF-8 escape sequences in literal strings today, I'm going to hit them. I recognize that we can't design the universe right now but we could agree on this direction and use it to guide our decision-making. By the way, if we DID think of 8-bit strings as essentially "byte arrays" then let's use that terminology and imagine some future documentation: "Python's string type is equivalent to a list of bytes. For clarity, we will call this type a byte list from now on. In contexts where a Unicode character-string is desired, Python automatically converts byte lists to charcter strings by doing a UTF-8 decode on them." What would you think if Java had a default (I say "magical") conversion from byte arrays to character strings. The only reason we are discussing this is because Python strings have a dual personality which was useful in the past but will (IMHO, of course) become increasingly confusing in the future. We want the best of both worlds without confusing anybody and I don't think that we can have it. If you want 8-bit strings to be really byte arrays in perpetuity then let's be consistent in that view. We can compare them to Unicode as we would two completely separate types. "U" comes after "S" so unicode strings always compare greater than 8-bit strings. The use of the word "string" for both objects can be considered just a historical accident.
Tcl uses it (so displaying Unicode in Tkinter *just* *works*...),
Don't follow this entirely. Shouldn't the next version of TKinter accept and return Unicode strings? It would be rather ugly for two Unicode-aware systems (Python and TK) to talk to each other in 8-bit strings. I mean I don't care what you do at the C level but at the Python level arguments should be "just strings." Consider that len() on the TKinter side would return a different value than on the Python side. What about integral indexes into buffers? I'm totally ignorant about TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is between the 5th and 6th character when in an 8-bit string the equivalent index might be the 11th or 12th byte?
it is not Western-language-centric.
If you look at encoding efficiency it is.
Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one.
The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all! Is Unicode going to be the canonical Py3K character set or will we have different objects for different character sets/encodings with different default (I say "magical") conversions between them. Such a design would not be entirely insane though it would be a PITA to implement and maintain. If we aren't ready to establish Unicode as the one true character set then we should probably make no special concessions for Unicode at all. Let a thousand string objects bloom! Even if we agreed to allow many string objects, byte==character should not be the default string object. Unicode should be the default.
I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters.
Won't this be totally common? Most people are going to use 8-bit literals in their program text but work with Unicode data from XML parsers, COM, WebDAV, Tkinter, etc?
Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1.
If we are guessing then we are doing something wrong. My answer to the question of "default encoding" falls out naturally from a certain way of looking at text, popularized in various other languages and increasingly "the norm" on the Web. If you accept the model (a character is a character is a character), the right behavior is obvious. "\244"==u"\244" Nobody is ever going to have trouble understanding how this works. Choose simplicity! -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
Paul, we're both just saying the same thing over and over without convincing each other. I'll wait till someone who wasn't in this debate before chimes in. Have you tried using this? --Guido van Rossum (home page: http://www.python.org/~guido/)
On Mon, 1 May 2000, Guido van Rossum wrote:
Paul, we're both just saying the same thing over and over without convincing each other. I'll wait till someone who wasn't in this debate before chimes in.
Well, I'm guessing you had someone specific in mind (Neil?), but I want to say someothing too, as the only one here (I think) using ISO-8859-8 natively. I much prefer the Fredrik-Paul position, known also as the character is a character position, to the UTF-8 as default encoding. Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8 is even more horribly western-centered (or I should say USA centered) -- ASCII documents are the same. I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal). If I'm using Hebrew characters in my source (which I won't for a long while), I'll use them in Unicode strings only, and make sure I use Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a conversion to Unicode on the fly anyway, since most bidi libraries work on Unicode. So having UTF-8 conversions magically happen won't help me at all, and will only cause problem when I use "sort-for-uniqueness" on a list with mixed binary-goop and Unicode strings. In short, this sounds like a recipe for disaster. internationally y'rs, Z. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
Moshe Zadka wrote:
I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal).
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead. BTW, I think that this behaviour should be changed:
buffer('binary') + 'data' 'binarydata'
while:
'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar'). -- Hmm, perhaps we need something like a data string object to get this 100% right ?!
d = data("...data...") or d = d"...data..." print type(d) <type 'data'>
'string' + d d"string...data..." u'string' + d d"s\000t\000r\000i\000n\000g\000...data..."
d[:5] d"...da"
etc. Ideally, string and Unicode objects would then be subclasses of this type in Py3K. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[MAL]
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead.
Easier said than done -- Python doesn't really have a buffer data type. Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers. I agree there's no reason to extend this to Unicode strings.
BTW, I think that this behaviour should be changed:
buffer('binary') + 'data' 'binarydata'
while:
'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar').
Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though.
--
Hmm, perhaps we need something like a data string object to get this 100% right ?!
d = data("...data...") or d = d"...data..." print type(d) <type 'data'>
'string' + d d"string...data..." u'string' + d d"s\000t\000r\000i\000n\000g\000...data..."
d[:5] d"...da"
etc.
Ideally, string and Unicode objects would then be subclasses of this type in Py3K.
Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
[MAL]
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead.
Easier said than done -- Python doesn't really have a buffer data type. Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers.
I agree there's no reason to extend this to Unicode strings.
BTW, I think that this behaviour should be changed:
buffer('binary') + 'data' 'binarydata'
while:
'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar').
Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though.
Dang, you're right...
--
Hmm, perhaps we need something like a data string object to get this 100% right ?!
d = data("...data...") or d = d"...data..." print type(d) <type 'data'>
'string' + d d"string...data..." u'string' + d d"s\000t\000r\000i\000n\000g\000...data..."
d[:5] d"...da"
etc.
Ideally, string and Unicode objects would then be subclasses of this type in Py3K.
Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist.
Anyway, one way or another I think we should make it clear to users that they should start using some other type for storing binary data. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[ damn, I wish people would pay more attention to changing the subject line to reflect the contents of the email ... I could not figure out if there were any further responses to this without opening most of those dang "Unicode debate" emails. sheesh... ] On Tue, 2 May 2000, M.-A. Lemburg wrote:
Guido van Rossum wrote:
[MAL]
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead.
Easier said than done -- Python doesn't really have a buffer data type.
The buffer object. We *do* have the type.
Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers.
For historical reasons only. It would be very easy to change these to use buffer objects, except for the simple fact that callers might expect a *string* rather than something with string-like behavior.
...
BTW, I think that this behaviour should be changed:
buffer('binary') + 'data' 'binarydata'
In several places, bufferobject.c uses PyString_FromStringAndSize(). It wouldn't be hard at all to use PyBuffer_New() to allow the memory, then copy the data in. A new API could also help out here: PyBuffer_CopyMemory(void *ptr, int size)
while:
'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
The string object can't handle the buffer on the right side. Buffer objects use the buffer interface, so they can deal with strings on the right. Therefore: asymmetry :-(
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar').
Sure. Wouldn't be a problem. The FromStringAndSize() thing.
Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though.
Dang, you're right...
Untrue. There is an API call which will construct a buffer object with its own memory: PyObject * PyBuffer_New(int size) The resulting buffer object will be read/write, and you can stuff values into it using the slice notation.
Hmm, perhaps we need something like a data string object to get this 100% right ?!
Nope. The buffer object is intended to be exactly this.
...
Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist.
Anyway, one way or another I think we should make it clear to users that they should start using some other type for storing binary data.
Buffer objects. There are a couple changes to make this a bit easier for people: 1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to create a read/write buffer of a particular size. buffer() should create a zero-length read/write buffer. 2) if slice assignment is updated to allow changes to the length (for example: buf[1:2] = 'abcdefgh'), then the buffer object definition must change. Specifically: when the buffer object owns the memory, it does this by appending the memory after the PyObject_HEAD and setting its internal pointer to it; when the dealloc() occurs, the target memory goes with the object. A flag would need to be added to tell the buffer object to do a second free() for the case where a realloc has returned a new pointer. [ I'm not sure that I would agree with this change, however; but it does make them a bit easier to work with; on the other hand, people have been working with immutable strings for a long time, so they're okay with concatenation, so I'm okay with saying length-altering operations must simply be done thru concatenation. ] IMO, extensions should be using the buffer object for raw bytes. I know that Mark has been updating some of the Win32 extensions to do this. Python programs could use the objects if the buffer() builtin is tweaked to allow a bit more flexibility in the arguments. Cheers, -g -- Greg Stein, http://www.lyra.org/
[Greg Stein]
IMO, extensions should be using the buffer object for raw bytes. I know that Mark has been updating some of the Win32 extensions to do this. Python programs could use the objects if the buffer() builtin is tweaked to allow a bit more flexibility in the arguments.
Forgive me for rewinding this to the very beginning. But what is a buffer object usefull for? I'm trying think about buffer object in terms of jpython, so my primary interest is the user experience of buffer objects. Please correct my misunderstandings. - There is not a buffer protocol exposed to python object (in the way the sequence protocol __getitem__ & friends are exposed). - A buffer object typically gives access to the raw bytes which under lays the backing object. Regardless of the structure of the bytes. - It is only intended for object which have a natural byte storage to implement the buffer interface. - Of the builtin object only string, unicode and array supports the buffer interface. - When slicing a buffer object, the result is always a string regardless of the buffer object base. In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be said to have some natural byte storage. The jpython string type doesn't. It would take some awful bit shifting to present a jpython string as an array of bytes. Would it make any sense to have a buffer object which only accept a byte array as base? So that jpython would say:
buffer("abc") Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: buffer object expected
Would it make sense to tell python users that they cannot depend on the portability of using strings (both 8bit and 16bit) as buffer object base? Because it is so difficult to look at java storage as a sequence of bytes, I think I'm all for keeping the buffer() builtin and buffer object as obscure and unknown as possible <wink>. regards, finn
[Finn Bock]
Forgive me for rewinding this to the very beginning. But what is a buffer object usefull for? I'm trying think about buffer object in terms of jpython, so my primary interest is the user experience of buffer objects.
Please correct my misunderstandings.
- There is not a buffer protocol exposed to python object (in the way the sequence protocol __getitem__ & friends are exposed). - A buffer object typically gives access to the raw bytes which under lays the backing object. Regardless of the structure of the bytes. - It is only intended for object which have a natural byte storage to implement the buffer interface.
All true.
- Of the builtin object only string, unicode and array supports the buffer interface.
And the new mmap module.
- When slicing a buffer object, the result is always a string regardless of the buffer object base.
In jpython, only byte arrays like jarrays.array('b', [0,1,2]) can be said to have some natural byte storage. The jpython string type doesn't. It would take some awful bit shifting to present a jpython string as an array of bytes.
I don't recall why JPython has jarray instead of array -- how do they differ? I think it's a shame that similar functionality is embodied in different APIs.
Would it make any sense to have a buffer object which only accept a byte array as base? So that jpython would say:
buffer("abc") Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: buffer object expected
Would it make sense to tell python users that they cannot depend on the portability of using strings (both 8bit and 16bit) as buffer object base?
I think that the portability of many string properties is in danger with the Unicode proposal. Supporting this in the next version of JPython will be a bit tricky.
Because it is so difficult to look at java storage as a sequence of bytes, I think I'm all for keeping the buffer() builtin and buffer object as obscure and unknown as possible <wink>.
I basically agree, and in a private email to Greg Stein I've told him this. I think that the array module should be promoted to a built-in function/type, and should be the recommended solution for data storage. The buffer API should remain a C-level API, and the buffer() built-in should be labeled with "for experts only". --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
I don't recall why JPython has jarray instead of array -- how do they differ? I think it's a shame that similar functionality is embodied in different APIs.
The jarray module is a paper thin factory for the PyArray type which is primary (I believe) a wrapper around any existing java array instance. It exists to make arrays returned from java code useful for jpython. Since a PyArray must always wrap the original java array, it cannot resize the array. In contrast an array instance would own the memory and can resize it as necessary. Due to the different purposes I agree with Jim's decision of making the two module incompatible. And they are truly incompatible. jarray.array have reversed the (typecode, seq) arguments. OTOH creating a mostly compatible array module for jpython should not be too hard. regards, finn
I don't recall why JPython has jarray instead of array -- how do they differ? I think it's a shame that similar functionality is embodied in different APIs.
The jarray module is a paper thin factory for the PyArray type which is primary (I believe) a wrapper around any existing java array instance. It exists to make arrays returned from java code useful for jpython. Since a PyArray must always wrap the original java array, it cannot resize the array.
Understood. This is a bit like the buffer API in CPython then (except for Greg's vision where the buffer object manages storage as well :-).
In contrast an array instance would own the memory and can resize it as necessary.
OK, this makes sense.
Due to the different purposes I agree with Jim's decision of making the two module incompatible. And they are truly incompatible. jarray.array have reversed the (typecode, seq) arguments.
This I'm not so sure of. Why be different just to be different?
OTOH creating a mostly compatible array module for jpython should not be too hard.
OK, when we make array() a built-in, this should be done for Java too. --Guido van Rossum (home page: http://www.python.org/~guido/)
Greg Stein wrote:
[ damn, I wish people would pay more attention to changing the subject line to reflect the contents of the email ... I could not figure out if there were any further responses to this without opening most of those dang "Unicode debate" emails. sheesh... ]
On Tue, 2 May 2000, M.-A. Lemburg wrote:
Guido van Rossum wrote:
[MAL]
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead.
Easier said than done -- Python doesn't really have a buffer data type.
The buffer object. We *do* have the type.
Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers.
For historical reasons only. It would be very easy to change these to use buffer objects, except for the simple fact that callers might expect a *string* rather than something with string-like behavior.
Would this be a too drastic change, then ? I think that we should at least make use of buffers in the standard lib.
...
BTW, I think that this behaviour should be changed:
> buffer('binary') + 'data' 'binarydata'
In several places, bufferobject.c uses PyString_FromStringAndSize(). It wouldn't be hard at all to use PyBuffer_New() to allow the memory, then copy the data in. A new API could also help out here:
PyBuffer_CopyMemory(void *ptr, int size)
while:
> 'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
The string object can't handle the buffer on the right side. Buffer objects use the buffer interface, so they can deal with strings on the right. Therefore: asymmetry :-(
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
> buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar').
Sure. Wouldn't be a problem. The FromStringAndSize() thing.
Right. Before digging deeper into this, I think we should here Guido's opinion on this again: he said that he wanted to use Java's binary arrays for binary data... perhaps we need to tweak the array type and make it more directly accessible (from C and Python) instead.
Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though.
Dang, you're right...
Untrue. There is an API call which will construct a buffer object with its own memory:
PyObject * PyBuffer_New(int size)
The resulting buffer object will be read/write, and you can stuff values into it using the slice notation.
Yes, but that API is not reachable from within Python, AFAIK.
Hmm, perhaps we need something like a data string object to get this 100% right ?!
Nope. The buffer object is intended to be exactly this.
...
Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist.
Anyway, one way or another I think we should make it clear to users that they should start using some other type for storing binary data.
Buffer objects. There are a couple changes to make this a bit easier for people:
1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to create a read/write buffer of a particular size. buffer() should create a zero-length read/write buffer.
This looks a lot like function overloading... I don't think we should get into this: how about having the buffer() API take keywords instead ?! buffer(size=1024,mode='rw') - 1K of owned read write memory buffer(obj) - read-only referenced memory from obj buffer(obj,mode='rw') - read-write referenced memory in obj etc. Or we could allow passing None as object to obtain an owned read-write memory block (much like passing NULL to the C functions).
2) if slice assignment is updated to allow changes to the length (for example: buf[1:2] = 'abcdefgh'), then the buffer object definition must change. Specifically: when the buffer object owns the memory, it does this by appending the memory after the PyObject_HEAD and setting its internal pointer to it; when the dealloc() occurs, the target memory goes with the object. A flag would need to be added to tell the buffer object to do a second free() for the case where a realloc has returned a new pointer. [ I'm not sure that I would agree with this change, however; but it does make them a bit easier to work with; on the other hand, people have been working with immutable strings for a long time, so they're okay with concatenation, so I'm okay with saying length-altering operations must simply be done thru concatenation. ]
I don't think I like this either: what happens when the buffer doesn't own the memory ?
IMO, extensions should be using the buffer object for raw bytes. I know that Mark has been updating some of the Win32 extensions to do this. Python programs could use the objects if the buffer() builtin is tweaked to allow a bit more flexibility in the arguments.
Right. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[Moshe Zadka]
... I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal).
Then you don't want Unicode at all, Moshe. All the official encoding schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of Unicode not yet having assigned a character to this position, it's that the standard explicitly makes this sequence illegal and guarantees it will always be illegal! the other place this comes up is with surrogates, where what's legal depends on both parts of a character pair; and, again, the illegalities here are guaranteed illegal for all time). UCS-4 is the closest thing to binary-transparent Unicode encodings get, but even there the length of a thing is contrained to be a multiple of 4 bytes. Unicode and binary goop will never coexist peacefully.
Tim Peters <tim_one@email.msn.com> wrote:
[Moshe Zadka]
... I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal).
Then you don't want Unicode at all, Moshe. All the official encoding schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of Unicode not yet having assigned a character to this position, it's that the standard explicitly makes this sequence illegal and guarantees it will always be illegal!
in context, I think what Moshe meant was that with a straight character code mapping, any 8-bit string can always be mapped to a unicode string and back again. given a byte array "b": u = unicode(b, "default") assert map(ord, u) == map(ord, s) again, this is no different from casting an integer to a long integer and back again. (imaging having to do that on the bits and bytes level!). and again, the internal unicode encoding used by the unicode string type itself, or when serializing that string type, has nothing to do with that. </F>
On Wed, 3 May 2000, Tim Peters wrote: [Moshe Zadka]
... I'd much prefer Python to reflect a fundamental truth about Unicode, which at least makes sure binary-goop can pass through Unicode and remain unharmed, then to reflect a nasty problem with UTF-8 (not everything is legal).
[Tim Peters]
Then you don't want Unicode at all, Moshe. All the official encoding schemes for Unicode 3.0 suffer illegal byte sequences
Of course I don't, and of course you're right. But what I do want is for my binary goop to pass unharmed through the evil Unicode forest. Which is why I don't want it to interpret my goop as a sequence of bytes it tries to decode, but I want the numeric values of my bytes to pass through to Unicode uharmed -- that means Latin-1 because of the second design decision of the horribly western-specific unicdoe - the first 256 characters are the same as Latin-1. If it were up to me, I'd use Latin-3, but it wasn't, so it's not.
(for example, 0xffff is illegal in UTF-16 (whether BE or LE)
Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE -- isn't it. -- Moshe Zadka <moshez@math.huji.ac.il> http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com
Guido van Rossum wrote:
....
Have you tried using this?
Yes. I haven't had large problems with it. As long as you know what is going on, it doesn't usually hurt anything because you can just explicitly set up the decoding you want. It's like the int division problem. You get bitten a few times and then get careful. It's the naive user who will be surprised by these random UTF-8 decoding errors. That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
It's the naive user who will be surprised by these random UTF-8 decoding errors.
That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors.
So what do you think of my new proposal of using ASCII as the default "encoding"? It takes care of "a character is a character" but also (almost) guarantees an error message when mixing encoded 8-bit strings with Unicode strings without specifying an explicit conversion -- *any* 8-bit byte with the top bit set is rejected by the default conversion to Unicode. I think this is less confusing than Latin-1: when an unsuspecting user is reading encoded text from a file into 8-bit strings and attempts to use it in a Unicode context, an error is raised instead of producing garbage Unicode characters. It encourages the use of Unicode strings for everything beyond ASCII -- there's no way around ASCII since that's the source encoding etc., but Latin-1 is an inconvenient default in most parts of the world. ASCII is accepted everywhere as the base character set (e.g. for email and for text-based protocols like FTP and HTTP), just like English is the one natural language that we can all sue to communicate (to some extent). --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
...
So what do you think of my new proposal of using ASCII as the default "encoding"?
I can live with it. I am mildly uncomfortable with the idea that I could write a whole bunch of software that works great until some European inserts one of their name characters. Nevertheless, being hard-assed is better than being permissive because we can loosen up later. What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
[me]
So what do you think of my new proposal of using ASCII as the default "encoding"?
[Paul]
I can live with it. I am mildly uncomfortable with the idea that I could write a whole bunch of software that works great until some European inserts one of their name characters.
Better than that when some Japanese insert *their* name characters and it produces gibberish instead.
Nevertheless, being hard-assed is better than being permissive because we can loosen up later.
Exactly -- just as nobody should *count* on 10**10 raising OverflowError, nobody (except maybe parts of the standard library :-) should *count* on unicode("\347") raising ValueError. I think that's fine.
What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers?
Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error. But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does). --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
So what do you think of my new proposal of using ASCII as the default "encoding"?
How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.) The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Guido van Rossum wrote:
So what do you think of my new proposal of using ASCII as the default "encoding"?
[MAL]
How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.)
The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor.
No, the backslash should mean itself when encoding from ASCII to Unicode. --Guido van Rossum (home page: http://www.python.org/~guido/)
M.-A. Lemburg wrote:
Guido van Rossum wrote:
So what do you think of my new proposal of using ASCII as the default "encoding"?
How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.)
The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor.
umm. if you disallow latin-1 characters, how can you call this one loss-less? looks like political correctness taken to an entirely new level... </F>
Fredrik Lundh wrote:
M.-A. Lemburg wrote:
Guido van Rossum wrote:
So what do you think of my new proposal of using ASCII as the default "encoding"?
How about using unicode-escape or raw-unicode-escape as default encoding ? (They would have to be adapted to disallow Latin-1 char input, though.)
The advantage would be that they are compatible with ASCII while still providing loss-less conversion and since they use escape characters, you can even read them using an ASCII based editor.
umm. if you disallow latin-1 characters, how can you call this one loss-less?
[Guido didn't like this one, so its probably moot investing any more time on this...] I meant that the unicode-escape codec should only take ASCII characters as input and disallow non-escaped Latin-1 characters. Anyway, I'm out of this discussion... I'll wait a week or so until things have been sorted out. Have fun, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Guido van Rossum <guido@python.org> wrote:
What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers?
Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error.
but that's on the display side of things, right? similar to repr, in other words.
But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does).
to throw some extra gasoline on this, how about allowing str() to return unicode strings? (extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?) count to ten before replying, please. </F>
On Wed, 3 May 2000, Fredrik Lundh wrote:
Guido van Rossum <guido@python.org> wrote:
But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does).
to throw some extra gasoline on this, how about allowing str() to return unicode strings?
You still need to *print* them somehow. One way or another, stdout is still just a stream with bytes on it, unless we augment file objects to understand encodings. stdout sends bytes to something -- and that something will interpret the stream of bytes in some encoding (could be Latin-1, UTF-8, ISO-2022-JP, whatever). So either: 1. You explicitly downconvert to bytes, and specify the encoding each time you do. Then write the bytes to stdout (or your file object). 2. The file object is smart and can be told what encoding to use, and Unicode strings written to the file are automatically converted to bytes. Another thread mentioned having separate read/write and binary_read/binary_write methods on files. I suggest doing it the other way, actually: since read/write operate on byte streams now, *they* are the binary operations; the new methods should be the ones that do the extra encoding/decoding work, and could be called uniread/uniwrite, uread/uwrite, textread/textwrite, etc.
(extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?)
Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128? -- ?!ng
Ka-Ping Yee <ping@lfw.org> wrote:
to throw some extra gasoline on this, how about allowing str() to return unicode strings?
You still need to *print* them somehow. One way or another, stdout is still just a stream with bytes on it, unless we augment file objects to understand encodings.
stdout sends bytes to something -- and that something will interpret the stream of bytes in some encoding (could be Latin-1, UTF-8, ISO-2022-JP, whatever). So either:
1. You explicitly downconvert to bytes, and specify the encoding each time you do. Then write the bytes to stdout (or your file object).
2. The file object is smart and can be told what encoding to use, and Unicode strings written to the file are automatically converted to bytes.
which one's more convenient? (no, I won't tell you what I prefer. guido doesn't want more arguments from the old "characters are characters" proponents, so I gotta trick someone else to spell them out ;-)
(extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?)
Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128?
that will break too much existing code, I think. but what about replacing 128 with 256? </F>
[Ping]
stdout sends bytes to something -- and that something will interpret the stream of bytes in some encoding (could be Latin-1, UTF-8, ISO-2022-JP, whatever). So either:
1. You explicitly downconvert to bytes, and specify the encoding each time you do. Then write the bytes to stdout (or your file object).
2. The file object is smart and can be told what encoding to use, and Unicode strings written to the file are automatically converted to bytes.
[Fredrik]
which one's more convenient?
Marc-Andre's codec module contains file-like objects that support this (or could easily be made to). However the problem is that print *always* first converts the object using str(), and str() enforces that the result is an 8-bit string. I'm afraid that loosening this will break too much code. (This all really happens at the C level.) I'm also afraid that this means that str(unicode) may have to be defined to yield UTF-8. My argument goes as follows: 1. We want to be able to set things up so that print u"..." does the right thing. (What "the right thing" is, is not defined here, as long as the user sees the glyphs implied by u"...".) 2. print u is equivalent to sys.stdout.write(str(u)). 3. str() must always returns an 8-bit string. 4. So the solution must involve assigning an object to sys.stdout that does the right thing given an 8-bit encoding of u. 5. So we need str(u) to produce a lossless 8-bit encoding of Unicode. 6. UTF-8 is the only sensible candidate. Note that (apart from print) str() is never implicitly invoked -- all implicit conversions when Unicode and 8-bit strings are combined go from 8-bit to Unicode. (There might be an alternative, but it would depend on having yet another hook (similar to Ping's sys.display) that gets invoked when printing an object (as opposed to displaying it at the interactive prompt). I'm not too keen on this because it would break code that temporarily sets sys.stdout to a file of its own choosing and then invokes print -- a common idiom to capture printed output in a string, for example, which could be embedded deep inside a module. If the main program were to install a naive print hook that always sent output to a designated place, this strategy might fail.)
(extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?)
Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128?
that will break too much existing code, I think. but what about replacing 128 with 256?
If the 8-bit Unicode proposal were accepted, this would make sense. In my "only ASCII is implicitly convertible" proposal, this would be a mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128). I agree with everyone that things would be much simpler if we had separate data types for byte arrays and 8-bit character strings. But we don't have this distinction yet, and I don't see a quick way to add it in 1.6 without major upsetting the release schedule. So all of my proposals are to be considered hacks to maintain as much b/w compatibility as possible while still supporting some form of Unicode. The fact that half the time 8-bit strings are really being used as byte arrays, while Python can't tell the difference, means (to me) that the default encoding is an important thing to argue about. I don't know if I want to push it out all the way to Py3k, but I just don't see a way to implement "a character is a character" in 1.6 given all the current constraints. (BTW I promise that 1.7 will be speedy once 1.6 is out of the door -- there's a lot else that was put off to 1.7.) Fredrik, I believe I haven't seen your response to my ASCII proposal. Is it just as bad as UTF-8 to you, or could you live with it? On a scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you? Where's my sre snapshot? --Guido van Rossum (home page: http://www.python.org/~guido/)
On Wed, 3 May 2000, Guido van Rossum wrote:
(There might be an alternative, but it would depend on having yet another hook (similar to Ping's sys.display) that gets invoked when printing an object (as opposed to displaying it at the interactive prompt). I'm not too keen on this because it would break code that temporarily sets sys.stdout to a file of its own choosing and then invokes print -- a common idiom to capture printed output in a string, for example, which could be embedded deep inside a module. If the main program were to install a naive print hook that always sent output to a designated place, this strategy might fail.)
I know this is not a small change, but i'm pretty convinced the right answer here is that the print hook should call a *method* on sys.stdout, whatever sys.stdout happens to be. The details are described in the other long message i wrote ("Printing objects on files"). Here is an addendum that might actually make that proposal feasible enough (compatibility-wise) to fly in the short term: print x does, conceptually: try: sys.stdout.printout(x) except AttributeError: sys.stdout.write(str(x)) sys.stdout.write("\n") The rest can then be added, and the change in 'print x' will work nicely for any file objects, but will not break on file-like substitutes that don't define a 'printout' method. Any reactions to the other benefit of this proposal -- namely, the ability to control the printing parameters of object components as they're being traversed for printing? That was actually the original motivation for doing the file.printout thing: it gives you some of the effect of "passing down str-ness" that we were discussing so heatedly a little while ago. The other thing that just might justify this much of a change is that, as you reasoned clearly in your other message, without adequate resolution to the printing problem we may have painted ourselves into a corner with regard to str(u"") conversion, and i don't like the look of that corner much. *Even* if we were to get people to agree that it's okay for str(u"") to produce UTF-8, it still seems pretty hackish to me that we're forced to choose this encoding as a way of working around that fact that we can't simply give the file the thing we want to print. -- ?!ng
Ka-Ping Yee <ping@lfw.org> wrote:
I know this is not a small change, but i'm pretty convinced the right answer here is that the print hook should call a *method* on sys.stdout, whatever sys.stdout happens to be. The details are described in the other long message i wrote ("Printing objects on files").
Here is an addendum that might actually make that proposal feasible enough (compatibility-wise) to fly in the short term:
print x
does, conceptually:
try: sys.stdout.printout(x) except AttributeError: sys.stdout.write(str(x)) sys.stdout.write("\n")
The rest can then be added, and the change in 'print x' will work nicely for any file objects, but will not break on file-like substitutes that don't define a 'printout' method.
another approach is (simplified): try: sys.stdout.write(x.encode(sys.stdout.encoding)) except AttributeError: sys.stdout.write(str(x)) or, if str is changed to return any kind of string: x = str(x) try: x = x.encode(sys.stdout.encoding) except AttributeError: pass sys.stdout.write(x) </F>
On Thu, 4 May 2000, Fredrik Lundh wrote:
another approach is (simplified):
try: sys.stdout.write(x.encode(sys.stdout.encoding)) except AttributeError: sys.stdout.write(str(x))
Indeed, that would work to solve just this specific Unicode issue -- but there is a lot of flexibility and power to be gained from the general solution of putting a method on the stream object, as the example with the formatted list items showed. I think it is a good idea, for instance, to leave decisions about how to print Unicode up to the Unicode object, and not hardcode bits of it into print. Guido, have you digested my earlier 'printout' suggestions? -- ?!ng "Old code doesn't die -- it just smells that way." -- Bill Frantz
Guido, have you digested my earlier 'printout' suggestions?
Not quite, except to the point that they require more thought than to rush them into 1.6. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Ka-Ping Yee]
Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128?
[Fredrik Lundh]
that will break too much existing code, I think. but what about replacing 128 with 256?
Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-) Just
Fredrik Lundh wrote:
Guido van Rossum <guido@python.org> wrote:
What do we do about str( my_unicode_string )? Perhaps escape the Unicode characters with backslashed numbers?
Hm, good question. Tcl displays unknown characters as \x or \u escapes. I think this may make more sense than raising an error.
but that's on the display side of things, right? similar to repr, in other words.
But there must be a way to turn on Unicode-awareness on e.g. stdout and then printing a Unicode object should not use str() (as it currently does).
to throw some extra gasoline on this, how about allowing str() to return unicode strings?
(extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?)
count to ten before replying, please.
1 2 3 4 5 6 7 8 9 10 ... ok ;-) Guido's problem with printing Unicode can easily be solved using the standard codecs.StreamRecoder class as I've done in the example I posted some days ago. Basically, what the stdout wrapper would do is take strings as input, converting them to Unicode and then writing them encoded to the original stdout. For Unicode objects the conversion can be skipped and the encoded output written directly to stdout. This can be done for any encoding supported by Python; e.g. you could do the indirection in site.py and then have Unicode printed as Latin-1 or UTF-8 or one of the many code pages supported through the mapping codec. About having str() return Unicode objects: I see str() as constructor for string objects and under that assumption str() will always have to return string objects. unicode() does the same for Unicode objects, so renaming it to something else doesn't really help all that much. BTW, __str__() has to return strings too. Perhaps we need __unicode__() and a corresponding slot function too ?! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Paul Prescod <paul@prescod.net> wrote:
I would laugh my ass off if I was using Perl and it did something weird like this to me.
you don't have to -- in Perl 5.6, a character is a character... does anyone on this list follow the perl-porters list? was this as controversial over in Perl land as it appears to be over here? </F>
Paul Prescod writes:
The fact that my proposal has the same effect as making Latin-1 the "default encoding" is a near-term side effect of the definition of Unicode. My long term proposal is to do away with the concept of 8-bit strings (and thus, conversions from 8-bit to Unicode) altogether. One string to rule them all! Why must this be a long term proposal?
I would find it quite attractive, when * the old string type became an imutable list of bytes * automatic conversion between byte lists and unicode strings were performed via user customizable conversion functions (a la __import__). Dieter
At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote:
Where does the current approach require work?
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
How will other parts of a program know which encoding was used for non-unicode string literals? It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding!
- We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding).
Can you open a file *with* an explicit encoding? Just
Just van Rossum writes:
How will other parts of a program know which encoding was used for non-unicode string literals?
This is the exact reason that Unicode should be used for all string literals: from a language design perspective I don't understand the rationale for providing "traditional" and "unicode" string.
It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding!
In Dylan there is an explicit split between 'characters' (which are always Unicode) and 'bytes'. What are the compelling reasons to not use UTF-8 as the (source) document encoding? In the past the usual response is, "the tools are't there for authoring UTF-8 documents". This argument becomes more specious as more OS's move towards Unicode. I firmly believe this can be done without Java's bloat. One off-the-cuff solution is this: All character strings are Unicode (utf-8 encoding). Language terminals and operators are restricted to US-ASCII, which are identical to UTF8. The contents of comments are not interpreted in any way.
- We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding).
Can you open a file *with* an explicit encoding?
If you cannot, you lose. You absolutely must be able to specify the encoding of a file when opening it, so that the runtime can transcode into the native encoding as you read it. This should be otherwise transparent the user. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This is the exact reason that Unicode should be used for all string literals: from a language design perspective I don't understand the rationale for providing "traditional" and "unicode" string.
In Python 3000, you would have a point. In current Python, there simply are too many programs and extensions written in other languages that manipulating 8-bit strings to ignore their existence. We're trying to add Unicode support to Python 1.6 without breaking code that used to run under Python 1.5.x; practicalities just make it impossible to go with Unicode for everything. I think that if Python didn't have so many extension modules (many maintained by 3rd party modules) it would be a lot easier to switch to Unicode for all strings (I think JavaScript has done this). In Python 3000, we'll have to seriously consider having separate character string and byte array objects, along the lines of Java's model. Note that I say "seriously consider." We'll first have to see how well the current solution works *in practice*. There's time before we fix Py3k in stone. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
[GvR]
- We need a way to indicate the encoding of Python source code. (Probably a "magic comment".)
[JvR]
How will other parts of a program know which encoding was used for non-unicode string literals?
It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding!
Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information, and there are too many sources of strings with unknown encodings to make it very useful. Plus, it would slow down 8-bit string ops. I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code. If the source encoding is known, these will be converted using the appropriate codec. If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"..."). But I think this should be enabled by a separate pragma -- people who want to write Unicode-unaware code manipulating 8-bit strings in their favorite encoding (e.g. shift-JIS or Latin-1) should not silently get Unicode strings. (I thought about an option to make *all strings* (not just literals) Unicode, but the current implementation would require too much hacking. This is what JPython does, and maybe it should be what Python 3000 does; I don't see it as a realistic option for the 1.x series.) --Guido van Rossum (home page: http://www.python.org/~guido/)
[GvR, on string.encoding ]
Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information,
I may miss something, but the encoding attr just travels with the string object, no? Like I said in my reply to MAL, I think it's undesirable to do *anything* with the encoding attr if not in combination with a unicode string.
and there are too many sources of strings with unknown encodings to make it very useful.
That's why the default encoding must be settable as well, as Fredrik suggested.
Plus, it would slow down 8-bit string ops.
Not if you ignore it most of the time, and just pass it along when concatenating.
I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code.
Explain that to newbies... I guess is that they will want simple 8 bit strings in their native encoding. Dunno.
If the source encoding is known, these will be converted using the appropriate codec.
If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"...").
Only if "\377" would still yield an 8-bit string, for binary goop... Just
[GvR, on string.encoding ]
Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information,
[JvR]
I may miss something, but the encoding attr just travels with the string object, no? Like I said in my reply to MAL, I think it's undesirable to do *anything* with the encoding attr if not in combination with a unicode string.
But just propagating affects every string op -- s+s, s*n, s[i], s[:], s.strip(), s.split(), s.lower(), ...
and there are too many sources of strings with unknown encodings to make it very useful.
That's why the default encoding must be settable as well, as Fredrik suggested.
I'm open for debate about this. There's just something about a changeable global default encoding that worries me -- like any global property, it requires conventions and defensive programming to make things work in larger programs. For example, a module that deals with Latin-1 strings can't just set the default encoding to Latin-1: it might be imported by a program that needs it to be UTF-8. This model is currently used by the locale in C, where all locale properties are global, and it doesn't work well. For example, Python needs to go through a lot of hoops so that Python numeric literals use "." for the decimal indicator even if the user's locale specifies "," -- we can't change Python to swap the meaning of "." and "," in all contexts. So I think that a changeable default encoding is of limited value. That's different from being able to set the *source file* encoding -- this only affects Unicode string literals.
Plus, it would slow down 8-bit string ops.
Not if you ignore it most of the time, and just pass it along when concatenating.
And slicing, and indexing, and...
I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code.
Explain that to newbies... I guess is that they will want simple 8 bit strings in their native encoding. Dunno.
If they are hap-py with their native 8-bit encoding, there's no need for them to ever use Unicode objects in their program, so they should be fine. 8-bit strings aren't ever interpreted or encoded except when mixed with Unicode objects.
If the source encoding is known, these will be converted using the appropriate codec.
If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"...").
Only if "\377" would still yield an 8-bit string, for binary goop...
Correct. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, 27 Apr 2000, Just van Rossum wrote:
At 10:27 PM -0400 26-04-2000, Tim Peters wrote:
Indeed, if someone from an inferior culture wants to chime in, let them find Python-Dev with their own beady little eyes <wink>.
All irony aside, I think you've nailed one of the problems spot on: - most core Python developers seem to be too busy to read *anything* at all in c.l.py
Datapoint: I stopped reading c.l.py almost two years ago. For a while, I would pop up a newsreader every month or so and skim what kinds of things were happening. That stopped at least a year or so ago. I get a couple hundred messages a day. Another 100+ from c.l.py would be way too much. Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (16)
-
Andrew Kuchling
-
bckfnnïĵ worldonline.dk
-
Christopher Petrilli
-
Dieter Maurer
-
Fredrik Lundh
-
Greg Stein
-
Guido van Rossum
-
Just van Rossum
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Mark Hammond
-
Moshe Zadka
-
Paul Gresham
-
Paul Prescod
-
Tim Peters
-
Tom Emerson