[I18n-sig] Random thoughts on Unicode and Python

M.-A. Lemburg mal@lemburg.com
Sat, 10 Feb 2001 22:56:09 +0100

Tom Emerson wrote:
> Andy has raised some important and interesting points. I'd like to
> chime in with some random thoughts.
> > 2. I have been told that there are angry mumblings on the
> > Python-Japan mailing list that such a change would break all
> > their existing Python programs; I'm trying to set up my tools to
> > ask out loud in that forum.
> Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use
> them on systems that are 8-bit clean and things "just work". You don't
> need to worry about embedded nulls or any other such noise. While you
> can't use len() to get the number of *characters* in a
> Shift-JIS/EUC-JP encoded string, you can find out how many "octets"
> are in it so you can loop over it and calculate the character length.
> In essence the Japanese (and Chinese and Koreans) are using the
> existing Python string type as a raw-byte string, and imposing the
> semantics over that.
> The Ruby string class is a byte-string. You can specify how the bytes
> are to be treated for operations such as regular expression searches
> and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan
> bytes. You can set the default when you configure the sources, on the
> command-line when you invoke the interpreter, or (I believe) at
> runtime.
> Ruby also contains a library with a replacement String class for
> dealing with EUC-JP and Shift-JIS encoded strings.

How does Ruby (which seems to be the direct Python-competitor
in Japan) deal with the difference between binary data and
text data ?

I think that much concern about these proposals lies in a misunder-
standing of the general idea behind the proposed move to Unicode for
text data:

We are trying to tell people that storing text data is better
done in Unicode than in a raw data buffer like Python's current
string data type. This doesn't mean that working with text encoded
in such a binary data buffer will somehow fail in a future Python
version, it only means that the programmer will sooner or later
have to decide whether she wants to store text data or binary
and then choose the proper type of storage to be able to
take advantage of the advanced features which a text data type
can provide over a binary data buffer.

The module which we are currently talking about can be outlined
as follows:

                  binary data string *)
                  text data string 
                    |           |
                    |           |
         Unicode string      encoded 8-bit string (with encoding 
           *)                                      information !)

*) these are implemented in Python 1.6-2.1.

How does this compare to e.g. Ruby ?

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/