Re: [Python-Dev] Allowing u.encode() to return non-strings

Tim, I'm not sure this needs to be on the list. My major point, I guess, is that the byte vectors we tend to call strings in Python have no string-ness, as understood in the 21st century. There is no character set associated with them, which means that there is effectively no way to look at the "next character" in a string (you don't know how long a character is), no way to count the number of characters, etc. The documentation, particularly the language manual, is extremely confusing on this point, in classifying "string" and "Unicode" objects as the same sort of thing. And then not documenting them clearly. "struct.pack", for instance, doesn't really return a string -- it returns a byte vector. Unicode is really the only kind of *string* type that's supported, which is problematic, as it's not integrated with the file streams support. For instance, how do I write a function that opens a file containing text in some multi-byte format (which, we'll assume, I know the name of -- perhaps from a content-type field), and reads the first three characters of the text? Can't. That's because the "file" constructor doesn't take an encoding, and "read" and "readline" don't return Unicode objects. I could try, by reading some bytes, then using unicode to turn it into a string, then seeing how many characters I read, but that's pretty imprecise. I go round and round the "codecs" module thinking that someone must have thought of this -- or maybe there's an optional argument to file() that make it return real (Unicode) strings -- but no luck. I find it hard to believe that I've dreamed up something that neither you nor (especially) Martin have thought of till now. But consider this idea. Any file that is not explicitly opened as binary (with the 'b' flag (and, by the way, why isn't the 'b' flag the default for file opening? It would save a lot of grief dealing with Windows.)) should be considered a text file, and it should have an associated "encoding" attribute (as file objects already do), which would also be a keyword parameter to the constructor. The default would be sys.getdefaultencoding(). The "size" parameter to the methods "read" and "readline" should refer to characters, not bytes, for text files. The return values from "next", "read" and "readline" would be Unicode objects for text files. Similarly, the methods "write" and "writelines" should, for text files, take Unicode objects and raise an exception if fed a "byte vector". I'd go further. I'd introduce the notation v = b"abc" which means that "v" has assigned to it an 8-bit "string" byte vector. Then, after a release or two, I'd make plain old "foo" mean what u"foo" means today, so that string literals are by default Unicode (module PEP 263). Bill

Bill Janssen wrote:
Unicode is really the only kind of *string* type that's supported, which is problematic, as it's not integrated with the file streams support. For instance, how do I write a function that opens a file containing text in some multi-byte format (which, we'll assume, I know the name of -- perhaps from a content-type field), and reads the first three characters of the text? Can't.
That's really not true. To process such a file, you do f = codecs.open(filename, "r", encoding="big-5") data = f.read() first_three = data[:3]
Any file that is not explicitly opened as binary (with the 'b' flag (and, by the way, why isn't the 'b' flag the default for file opening?
Because it isn't in C.
I'd go further. I'd introduce the notation
v = b"abc"
Yes, introduction of byte string literals, and changing standard string literals, has been proposed before. There is the -U option for the interpreter that changes all literals to Unicode literals. Unfortunately, a lot of code breaks under this change, so such breakage needs to be fixed before the change can happen. Regards, Martin

Tim, Thanks for pointing this out. I find it very hard to know that from any of the documentation, unless of course you already know it, in which case you don't need the documentation :-). In particular, I'd suggest adding some text to the documentation on codecs.open() which points out that read and readlines and friends will in fact return Unicode objects. I assume, though, that the args to "read()" and friends are still about bytes.
Any file that is not explicitly opened as binary (with the 'b' flag (and, by the way, why isn't the 'b' flag the default for file opening?
Because it isn't in C.
That's probably why Python doesn't have list comprehensions, either :-). Bill

Bill Janssen wrote:
I assume, though, that the args to "read()" and friends are still about bytes.
Yes. It is not possible to determine, in advance, the number of bytes needed to decode a given number of characters. Therefore, a codec typically needs to either read more bytes than requested, or return less characters (if the bytes read don't happen to end on a character boundary). So the size parameter to .read() is just a hint - a codec might chose to completely ignore it. Regards, Martin

Bill Janssen wrote:
I'd go further. I'd introduce the notation
v = b"abc"
which means that "v" has assigned to it an 8-bit "string" byte vector. Then, after a release or two, I'd make plain old
"foo"
mean what
u"foo"
means today, so that string literals are by default Unicode (modulo PEP 263).
This would be ideal indeed and it has been dreamed up early on in 2000 when the whole Unicode thing happened. The option -U was added to be able to test the standard lib against such an approach. Unfortunately, many modules don't work under such an assumption, so we are still far from being able to make -U the default. Meanwhile, it's best to always use Unicode for text data and strings for everything else. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jun 30 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"Bill Janssen" <janssen@parc.com> wrote in message news:04Jun29.224113pdt."58612"@synergy1.parc.xerox.com...
is that the byte vectors we tend to call strings in Python have no string-ness, as understood in the 21st century.
Python strings are sequences of 0 to n chars from an abstract 256-char alphabet. This meets my understanding of the standard 20th century CS definition of string. Has there been a significant change in the last few years?
There is no character set associated with them,
The byte set is intentionally not any *particular* natural language char set, but a possible carrier for any of them. Perhaps unfortunately, it lacks a single standard glyph set or graphic representation., but I believe Unicode also differentiates between characters (code points?) and glyphs (which are also not standardized). The byte set also (fortunately) lacks the complications of letters, capitals, signs, marks, ligatures, symbols, and so on, which complications usually make the chararacter set for a particular language somewhat fuzzy.
documentation, particularly the language manual, is extremely confusing on this point, in classifying "string" and "Unicode" objects as the same sort of thing.
I think it a matter a viewpoint whether one emphasizes the similarities or differences.
And then not documenting them clearly.
The subject of strings, Unicode, internationalization, and Python could use a manual in itself.
Unicode ... is not integrated with the file streams support.
Reading numbers other than bytes is also not integrated with the file type. Adding a 'bytes' parameters to file(), or a readbytes(n) method, would be generally helpful for anyone wanting to iterate thru a file in chunks other than 'lines'. Terry J. Reedy

Terry Reedy wrote:
Python strings are sequences of 0 to n chars from an abstract 256-char alphabet. This meets my understanding of the standard 20th century CS definition of string. Has there been a significant change in the last few years?
Yes. Abstract 256-char alphabets have been found useless for the representation of natural-language text. You need concrete alphabets, and having more than 256 characters is often important.
The byte set is intentionally not any *particular* natural language char set, but a possible carrier for any of them. Perhaps unfortunately, it lacks a single standard glyph set or graphic representation., but I believe Unicode also differentiates between characters (code points?) and glyphs (which are also not standardized).
Yes. But Unicode does define concrete characters - even if it leaves the choice of glyphs. Regards, Martin
participants (4)
-
"Martin v. Löwis"
-
Bill Janssen
-
M.-A. Lemburg
-
Terry Reedy