[Python-Dev] Unicode

Guido van Rossum guido@python.org
Sun, 28 Apr 2002 20:38:42 -0400


[Guido]
> > No syntactic changes, no.  But the way we do things would become
> > significantly different.  And think of binary I/O vs. textual I/O --
> > currently, file.read() returns a string.  Code dealing with binary
> > files will look significantly different, and old code won't work.

[Jack]
> It could be argued that open(..., 'r').read() returns a text 
> string and open(..., 'rb').read() returns a binary blob.

They might even return different kind of objects -- arguably, binary
files don't need readline() etc., and text files may not need read(n)
(though the arg-less variant is handy).

If only I had the time to reinvent the I/O library...

> If textstrings and blobs become wholly different objects this 
> shouldn't create too many problems [see below], except for code 
> that opens a file in binary mode and (partially) reads the 
> resulting file expecting text. But this code would need 
> revisiting anyway if the normal textstring would become unicode.

Yeah, that's usually just stubborn Unix users who don't believe in the
distinction between binary and text mode. :-)

Anyway, the proper way to convert between blobs and textstrings would
be encodings.  That's how Java does it.

> [here's below] To my surprise I think that having blobs and 
> textstrings be unrelated objects creates less problems than 
> having the one be a subtype of the other. At least, every time I 
> try to do the subtyping in my head I flip back and forth between 
> textstrings-are-a-subtype-of-general-binary-buffers and 
> binary-buffers-are-a-special-case-of-python-strings every couple 
> of seconds. I think having them both be subtypes of a common 
> base type (basestring) might work, but I'm not sure.

I think they don't need anything in common (apart their
sequence-ness).  I think Java's byte[] vs. String distinction is about
right.

--Guido van Rossum (home page: http://www.python.org/~guido/)