[Python-Dev] Re: Unicode debate

Ka-Ping Yee ping@lfw.org
Wed, 3 May 2000 02:32:31 -0700 (PDT)


On Tue, 2 May 2000, Guido van Rossum wrote:
> > P. P. S.  If always having to specify encodings is really too much,
> > i'd probably be willing to consider a default-encoding state on the
> > Unicode class, but it would have to be a stack of values, not a
> > single value.
> 
> Please elaborate?

On general principle, it seems bad to just have a "set" method
that encourages people to set static state in a way that
irretrievably loses the current state.  For something like this,
you want a "push" method and a "pop" method with which to bracket
a series of operations, so that you can easily write code which
politely leaves other code unaffected.

For example:

    >>> x = unicode("d\351but")        # assume Guido-ASCII wins
    UnicodeError: ASCII encoding error: value out of range
    >>> x = unicode("d\351but", "latin-1")
    >>> x
    u'd\351but'
    >>> print x.encode("latin-1")      # on my xterm with Latin-1 fonts
    début
    >>> x.encode("utf-8")
    'd\303\251but'

Now:

    >>> u"".pushenc("latin-1")         # need a better interface to this?
    >>> x = unicode("d\351but")        # okay now
    >>> x
    u'd\351but'
    >>> u"".pushenc("utf-8")
    >>> x = unicode("d\351but")
    UnicodeError: UTF-8 decoding error: invalid data
    >>> x = unicode("d\303\251but")
    >>> print x.encode("latin-1")
    début
    >>> str(x)
    'd\303\251\but'
    >>> u"".popenc()                   # back to the Latin-1 encoding
    >>> str(x)
    'd\351but'
        .
        .
        .
    >>> u"".popenc()                   # back to the ASCII encoding

Similarly, imagine:

    >>> x = u"<Japanese text...>"

    >>> file = open("foo.jis", "w")
    >>> file.pushenc("iso-2022-jp")
    >>> file.uniwrite(x)
        .
        .
        .
    >>> file.popenc()

    >>> import sys
    >>> sys.stdout.write(x)            # bad! x contains chars > 127
    UnicodeError: ASCII decoding error: value out of range

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> sys.stdout.write(x)            # on a kterm with kanji fonts
    <Japanese text...>
        .
        .
        .
    >>> sys.stdout.popenc()

The above examples incorporate the Guido-ASCII proposal, which
makes a fair amount of sense to me now.  How do they look to y'all?



This illustrates the remaining wart:

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> print x                        # still bad! str is still doing ASCII
    UnicodeError: ASCII decoding error: value out of range

    >>> u"".pushenc("iso-2022-jp")
    >>> print x                        # on a kterm with kanji fonts
    <Japanese text...>

Writing to files asks the file object to convert from Unicode to
bytes, then write the bytes.

Printing converts the Unicode to bytes first with str(), then
hands the bytes to the file object to write.

This wart is really a larger printing issue.  If we want to
solve it, files have to know what to do with objects, i.e.

    print x

doesn't mean

    sys.stdout.write(str(x) + "\n")

instead it means

    sys.stdout.printout(x)

Hmm.  I think this might deserve a separate subject line.


-- ?!ng