stdout sends bytes to something -- and that something will interpret the stream of bytes in some encoding (could be Latin-1, UTF-8, ISO-2022-JP, whatever). So either:
1. You explicitly downconvert to bytes, and specify the encoding each time you do. Then write the bytes to stdout (or your file object). 2. The file object is smart and can be told what encoding to use, and Unicode strings written to the file are automatically converted to bytes.
which one's more convenient?
Marc-Andre's codec module contains file-like objects that support this (or could easily be made to).
However the problem is that print *always* first converts the object using str(), and str() enforces that the result is an 8-bit string. I'm afraid that loosening this will break too much code. (This all really happens at the C level.)
I'm also afraid that this means that str(unicode) may have to be defined to yield UTF-8. My argument goes as follows:
1. We want to be able to set things up so that print u"..." does the right thing. (What "the right thing" is, is not defined here, as long as the user sees the glyphs implied by u"...".)
2. print u is equivalent to sys.stdout.write(str(u)).
3. str() must always returns an 8-bit string.
4. So the solution must involve assigning an object to sys.stdout that does the right thing given an 8-bit encoding of u.
5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.
6. UTF-8 is the only sensible candidate.
Note that (apart from print) str() is never implicitly invoked -- all implicit conversions when Unicode and 8-bit strings are combined go from 8-bit to Unicode.
(There might be an alternative, but it would depend on having yet another hook (similar to Ping's sys.display) that gets invoked when printing an object (as opposed to displaying it at the interactive prompt). I'm not too keen on this because it would break code that temporarily sets sys.stdout to a file of its own choosing and then invokes print -- a common idiom to capture printed output in a string, for example, which could be embedded deep inside a module. If the main program were to install a naive print hook that always sent output to a designated place, this strategy might fail.)
(extra questions: how about renaming "unicode" to "string", and getting rid of "unichr"?)
Would you expect chr(x) to return an 8-bit string when x < 128, and a Unicode string when x >= 128?
that will break too much existing code, I think. but what about replacing 128 with 256?
If the 8-bit Unicode proposal were accepted, this would make sense. In my "only ASCII is implicitly convertible" proposal, this would be a mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).
I agree with everyone that things would be much simpler if we had separate data types for byte arrays and 8-bit character strings. But we don't have this distinction yet, and I don't see a quick way to add it in 1.6 without major upsetting the release schedule.
So all of my proposals are to be considered hacks to maintain as much b/w compatibility as possible while still supporting some form of Unicode. The fact that half the time 8-bit strings are really being used as byte arrays, while Python can't tell the difference, means (to me) that the default encoding is an important thing to argue about.
I don't know if I want to push it out all the way to Py3k, but I just don't see a way to implement "a character is a character" in 1.6 given all the current constraints. (BTW I promise that 1.7 will be speedy once 1.6 is out of the door -- there's a lot else that was put off to 1.7.)
Fredrik, I believe I haven't seen your response to my ASCII proposal. Is it just as bad as UTF-8 to you, or could you live with it? On a scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?
Where's my sre snapshot?
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)