[Python-Dev] More Unicode support

M.-A. Lemburg mal@lemburg.com
Mon, 06 Nov 2000 18:36:54 +0100

Guido van Rossum wrote:
> [GvR]
> > > Hm...  There's also the problem that there's no easy way to do Unicode
> > > I/O.  I'd like to have a way to turn a particular file into a Unicode
> > > output device (where the actual encoding might be UTF-8 or UTF-16 or a
> > > local encoding), which should mean that writing Unicode objects to the
> > > file should "do the right thing" (in particular should not try to
> > > coerce it to an 8-bit string using the default encoding first, like
> > > print and str() currently do) and that writing 8-bit string objects to
> > > it should first convert them to Unicode using the default encoding
> > > (meaning that at least ASCII strings can be written to a Unicode file
> > > without having to specify a conversion).  I support that reading from
> > > a "Unicode file" should always return a Unicode string object (even if
> > > the actual characters read all happen to fall in the ASCII range).
> > >
> > > This requires some serious changes to the current I/O mechanisms; in
> > > particular str() needs to be fixed, or perhaps a ustr() needs to be
> > > added that it used in certain cases.  Tricky, tricky!
> [MAL]
> > It's not all that tricky since you can write a StreamRecoder
> > subclass which implements this. AFAIR, I posted such an implementation
> > on i18n-sig.
> >
> > BTW, one of my patches on SF adds unistr(). Could be that it's
> > time to apply it :-)
> Adding unistr() and StreamRecoder isn't enough.  The problem is that
> when you set sys.stdout to a StreamRecoder, the print statement
> doesn't do the right thing!  Try it.  print u"foo" will work, but
> print u"\u1234" will fail because print always applies the default
> encoding.

Hmm, that's due to PyFile_WriteObject() calling PyObject_Str().
Perhaps we ought to let it call PyObject_Unicode() (which you
find in the patch on SF) instead for Unicode objects. That way
the file-like .write() method will be given a Unicode object
and StreamRecoder could then do the trick.

Haven't tried this, but it could work (the paths objects take
through Python to get printed are somewhat strange at times
-- there are just so many different possiblities and special
cases that it becomes hard telling from just looking at the

> The required changes to print are what's tricky.  Whether we even need
> unistr() depends on the solution we find there.

I think we'll need PyObject_Unicode() and unistr() one way
or another. Those two APIs simply complement PyObject_Str()
and str() in that they always return Unicode objects and
do the necessary conversion based on the input object type.

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/