[Python-Dev] Unicode debate
Guido van Rossum
Wed, 03 May 2000 08:04:29 -0400
> > stdout sends bytes to something -- and that something will
> > interpret the stream of bytes in some encoding (could be
> > Latin-1, UTF-8, ISO-2022-JP, whatever). So either:
> > 1. You explicitly downconvert to bytes, and specify
> > the encoding each time you do. Then write the
> > bytes to stdout (or your file object).
> > 2. The file object is smart and can be told what
> > encoding to use, and Unicode strings written to
> > the file are automatically converted to bytes.
> which one's more convenient?
Marc-Andre's codec module contains file-like objects that support this
(or could easily be made to).
However the problem is that print *always* first converts the object
using str(), and str() enforces that the result is an 8-bit string.
I'm afraid that loosening this will break too much code. (This all
really happens at the C level.)
I'm also afraid that this means that str(unicode) may have to be
defined to yield UTF-8. My argument goes as follows:
1. We want to be able to set things up so that print u"..." does the
right thing. (What "the right thing" is, is not defined here,
as long as the user sees the glyphs implied by u"...".)
2. print u is equivalent to sys.stdout.write(str(u)).
3. str() must always returns an 8-bit string.
4. So the solution must involve assigning an object to sys.stdout that
does the right thing given an 8-bit encoding of u.
5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.
6. UTF-8 is the only sensible candidate.
Note that (apart from print) str() is never implicitly invoked -- all
implicit conversions when Unicode and 8-bit strings are combined
go from 8-bit to Unicode.
(There might be an alternative, but it would depend on having yet
another hook (similar to Ping's sys.display) that gets invoked when
printing an object (as opposed to displaying it at the interactive
prompt). I'm not too keen on this because it would break code that
temporarily sets sys.stdout to a file of its own choosing and then
invokes print -- a common idiom to capture printed output in a string,
for example, which could be embedded deep inside a module. If the
main program were to install a naive print hook that always sent
output to a designated place, this strategy might fail.)
> > > (extra questions: how about renaming "unicode" to "string",
> > > and getting rid of "unichr"?)
> > Would you expect chr(x) to return an 8-bit string when x < 128,
> > and a Unicode string when x >= 128?
> that will break too much existing code, I think. but what
> about replacing 128 with 256?
If the 8-bit Unicode proposal were accepted, this would make sense.
In my "only ASCII is implicitly convertible" proposal, this would be a
mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).
I agree with everyone that things would be much simpler if we had
separate data types for byte arrays and 8-bit character strings. But
we don't have this distinction yet, and I don't see a quick way to add
it in 1.6 without major upsetting the release schedule.
So all of my proposals are to be considered hacks to maintain as much
b/w compatibility as possible while still supporting some form of
Unicode. The fact that half the time 8-bit strings are really being
used as byte arrays, while Python can't tell the difference, means (to
me) that the default encoding is an important thing to argue about.
I don't know if I want to push it out all the way to Py3k, but I just
don't see a way to implement "a character is a character" in 1.6 given
all the current constraints. (BTW I promise that 1.7 will be speedy
once 1.6 is out of the door -- there's a lot else that was put off to
Fredrik, I believe I haven't seen your response to my ASCII proposal.
Is it just as bad as UTF-8 to you, or could you live with it? On a
scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?
Where's my sre snapshot?
--Guido van Rossum (home page: http://www.python.org/~guido/)