[I18n-sig] Pre-PEP: Proposed Python Character Model

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 6 Feb 2001 21:49:42 +0100


Hi Paul,

Interesting remarks. I comment only on those where I disagree.

>     1. Python should have a single string type. 

I disagree. There should be a character string type and a byte string
type, at least. I would agree that a single character string type is
desirable.

>     type("") == type(chr(150)) == type(chr(1500)) == type(file.read())

I disagree. For the last one, much depends on what file is. If it is a
byte-oriented file, reading from it should not return character
strings.

>     2. It should be easier and more efficient to encode and decode
>        information being sent to and retrieved from devices.

I disagree. Easier, maybe; more efficient - I don't think Python is
particular inefficient in encoding/decoding.

>         It is not possible to have a concept of "character" without having
>         a character set. After all, characters must be chosen from some
>         repertoire and there must be a mapping from characters to integers
>         (defined by ord).

Sure it is possible. Different character sets (in your terminology)
have common characters, which is a phenomenon that your definition
cannot describe. Mathematically speaking, there is an unlimited domain
CHAR (the set of all characters), and then a character set would map a
subset of NAT (the set of all natural numbers, including zero) to a
subset of CHAR. Then, a character is an element of CHAR. Depending on
the character set, it has different associated numbers, though (or may
not have an associated ordinal at all).

>         A character encoding is a mechanism for representing characters
>         in terms of bits. 

More generally, it is a mechanism for representing character sequences
in terms of bit sequences. Otherwise, you can not cover the phenomenon
that the encoding of a string is not the concatenation of the
encodings of the individual characters in some encodings.

Also, this term is often called "coded character set" (CCS).

>         A Python programmer does not need to know or care whether a long
>         integer is represented as twos complement, ones complement or
>         in terms of ASCII digits.

They need to know if they want to explain the outcome of, say, hex(~1)
(for that, they need the size of the internal representation at a
minimum). In general, I agree.

>         Similarly a Python programmer does not need to know or care
>         how characters are represented in memory. We might even
>         change the representation over time to achieve higher
>         performance.

Programmers need to know the character set, at a minimum. Since you
were assuming that you can't have characters without character sets, I
guess you've assumed that as implied.

>     Universal Character Set
> 
>         There is only one standardized international character set that
>         allows for mixed-language information. 

Not true. E.g. ISO 8859-5 allows both Russian and English text,
ISO 8859-2 allows English, Polish, German, Slovakian, and a few
others. ISO 2022 (and by reference all incorporated character sets)
supports virtually all existing languages.

>         A popular subset of the Universal Character Set is called
>         Unicode. The most popular subset of Unicode is called the "Unicode
>         Basic Multilingual Plane (Unicode BMP)". 

Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
plane 0) of ISO 10646?

>             Java 
>         It is the author's belief this "running code" is evidence of
>         Unicode's practical applicability. 

At least in the case of Java, I disagree. It very much depends on the
exact version of the JVM that you are using, but I had the following
problems:
- AWT would not find a font to display a specific character, although
  such a font was available. After changing JDK configuration files,
  AWT would not be able to display strings that mix languages.

- JDK could not print a non-Latin-1 string to System.out; there was no
  way of telling it that it should use UTF-8 for output. (sounds
  familiar ?-)

- While javac would accept non-ASCII letters in class names, the
  interpreter would refuse to load class files with "funny
  characters".

Please note that all of these occured on the first attempt to use a
certain feature which works "in theory". Since Java's Unicode support
is considered as most advanced by many, I think there is still a long
way to go.

BTW, for dealing with GUI output, I believe that Tk's handling is most
advanced.
  

>     As discussed before, Python's native character set happens to consist
>     of exactly 255 characters.  If we increase the size of Python's
>     character set, no existing code would break and there would be no
>     cost in functionality.

Sure. Code that treats character strings as if they are byte strings
will break.

>     Once Python moves to that character set it will no longer be necessary
>     to have a distinction between "Unicode string" and "regular string."

Right. The distinction will between "character string" and "byte string".

>     This means that Unicode literals and escape codes can also be
>     merged with ordinary literals and escape codes. unichr can be merged
>     with chr.

Not sure. That means that there won't be byte string literals. It is
particular worrying that you want to remove the way to get the numeric
value of a byte in a byte string.


>     Two of the most common constructs in computer science are strings of
>     characters and strings of bytes. A string of bytes can be represented
>     as a string of characters between 0 and 255. Therefore the only
>     reason to have a distinction between Unicode strings and byte
>     strings is for implementation simplicity and performance purposes.
>     This distinction should only be made visible to the average Python
>     programmer in rare circumstances.

Are you saying that byte strings are visible to the average programmer
in rare circumstances only? Then I disagree; byte strings are
extremely common, as they are what file.read returns.

>     Unfortunately, there is not one, single, dominant encoding. There are
>     at least a dozen popular ones including ASCII (which supports only
>     0-127), ISO Latin 1 (which supports only 0-255), others in the ISO
>     "extended ASCII" family (which support different European scripts),
>     UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by
>     Java and Windows), Shift-JIS (preferred in Japan) and so forth. This
>     means that the only safe way to read data from a file into Python
>     strings is to specify the encoding explicitly.

Note how you are mixing character sets and encodings here. As you had
defined earlier, a single character set (such as US-ASCII) can have
multiply encodings (e.g. with checksum bit or without).

>     Python's current assumption is that each byte translates into a
>     character of the same ordinal. This is only true for "ISO Latin 1".

I disagree. With your definition of character set, many character sets
have the property that a single byte is sufficient to represent a
single character (e.g. all of ISO 8859). You seem to assume that the
current Python character set is Latin-1, which it is not. Instead,
Python's character set is defined by the application and the operating
system.

>     Any code that does I/O should be changed to require the user to
>     specify the encoding that the I/O should use. It is the opinion of
>     the author that there should be no default encoding at all.

Not sure. IMO, the default should be to read and write byte strings.

>     Here is some Python code demonstrating a proposed API:
> 
>         fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 
>         fileobj2 = fopen("bar", "r", "ISO Latin 1")  # byte-values "as is" 
>         fileobj3 = fopen("baz", "r", "UTF-8")

Sounds good. Note that the proper way to write this is

   fileobj = codecs.open("foo", "r", "ASCII")
   # etc

>         fileobj2.encoding = "UTF-16" # changed my mind!  

Why is that a requirement. In a normal stream, you cannot change the
encoding in the middle - in particular not from Latin 1 single-byte to
UTF-16.

>     For efficiency, it should also be possible to read raw bytes into
>     a memory buffer without doing any interpretation:
> 
>     moredata = fileobj2.readbytes(1024)

Disagree. If a file is open for reading characters, reading bytes from
the middle is not possible. If made possible, it won't be more efficient,
as you have to keep track of the encoder's state. Instead, the right way
to write this is

     fileobj2 = open("bar", "rb")
     moredata = fileobj2.read(1024)

>     It should be possible to create Python files in any of the common
>     encodings that are backwards compatible with ASCII.

By "Python files", you mean source code, I assume?

>     #?encoding="UTF-8"
>     #?encoding="ISO-8859-1"

The specific syntax may be debatable; I dislike semantics being put in
comments. There should be first-class syntax for that. Agree on the
principle approach.

>     Python files which use non-ASCII characters without defining an
>     encoding should be immediately deprecated and made illegal in some
>     future version of Python.

Agree.

>     Python already has a rule that allows the automatic conversion
>     of characters up to 255 into their C equivalents.

If it is a character (i.e. Unicode) string, it only converts 127
characters in that way.

>     Once the Python character type is expanded, characters outside
>     of that range should trigger an exception (just as converting a
>     large long integer to a C int triggers an exception).

Agree; that is what it does today.

>     Some might claim it is inappropriate to presume that
>     the character-for- byte mapping is the correct "encoding" for
>     information passing from Python to C.

Indeed, I would claim so. I could not phrase a rebuttal, though,
because your understanding of the desired Python type system seems not
to match mine.

>     Python's built-in modules should migrate from char to wchar_t (aka
>     Py_UNICODE) over time. That is, more and more functions should
>     support characters greater than 255 over time.

Some certainly should. Others, which were designed for dealing with
byte strings, should not.

>         The StringType and UnicodeType objects should be aliases for
>         the same object. All PyString_* and PyUnicode_* functions should 
>         work with objects of this type.

Disagree. There should be support for a byte string type.

>         Ordinary string literals should allow large character escape codes
>         and generate Unicode string objects.

That is available today with the -U option. I'm -0 on disallowing byte
string literals, as I don't consider them too important.

>         The format string "S" and the PyString_AsString functions should
>         accept Unicode values and convert them to character arrays
>         by converting each value to its equivalent byte-value. Values
>         greater than 255 should generate an exception.

Disagree. Conversion should be automatic only up to 127; everything
else gives questionable results.

>         fopen should be like Python's current open function except that
>         it should allow and require an encoding parameter.

Disagree. This is codec.open.

>         In general, it should be possible to use byte arrays where-ever
>         it is possible to use strings. Byte arrays could be thought of
>         as a special kind of "limited but efficient" string. Arguably we
>         could go so far as to call them "byte strings" and reuse Python's
>         current string implementation. The primary differences would be
>         in their "repr", "type" and literal syntax.

Agreed.

> Appendix: Using Non-Unicode character sets
> 
>     Let's presume that a linguistics researcher objected to the
>     unification of Han characters in Unicode and wanted to invent a
>     character set that included separate characters for all Chinese,
>     Japanese and Korean character sets. 

With ISO 10646, he could easily do so in a private-use plane. Of
course, implementations that only provide BMP support are somewhat
handicapped here.

>     Python needs to support international characters. The "ASCII" of
>     internationalized characters is Unicode. Most other languages have
>     moved or are moving their basic character and string types to
>     support Unicode. Python should also.

And indeed, Python does today. I don't see a problem *at all* with the
structure of the Unicode support in Python 2.0. As initial experiences
show, application *will* need to be modified to take Unicode into
account; I doubt that any enhancements will change that.

Regards,
Martin