[Python-Dev] Unicode Implementation in JPython

Guido van Rossum guido@python.org
Mon, 21 Feb 2000 15:13:19 -0500


> My feeling on the unicode proposal and its implementation is that most
> of the changes can be integrated directly into JPython without breaking
> any existing JPython code. One thing concerns me though:
> 
>    open("out", "wb").write(u"hello")

(Note that the file is opened in *binary* mode; in text mode, this
would write the 5 bytes or "hello".)

> This writes a 10 bytes to the file "out". 
> 
> I have two problems with that:
> 
> 1. In java, files are always byte-based. To move from unicode chars to
> bytes some kind of encoder must always be applied. It is also strange to
> see the actual byte layout of the data, which in my "out" file seems to
> be platform dependent. Is that the case? If it is, then the
> write(u"..") strikes me as somewhat random (unknown).
> 
> 2. To get this behavior under JPython, it is necessary to introduce a
> new string type which in all other aspects are equal to the existing
> string type. Only when passed to file.write should the new string type
> returned a faked representation of its memory. When a normal string is
> passed to .write, some byte representation of the string is written to
> the file. I would prefer that in jpython a unicode string is the same as
> a normal string (type("") == type(u"")). 
> 
> Perhaps the real reason for my dislike of this feature of the unicode
> implementation is based on my (from java) assumption that a unicode
> character is an atomic data type. 

Hm, I agree that it's not a great feature.  On the other hand it's
hard to decide what to do instead without breaking other corners of
the Unicode design.  Could we leave this implementation-dependent?

--Guido van Rossum (home page: http://www.python.org/~guido/)