[I18n-sig] Pre-PEP: Proposed Python Character Model

Wed, 7 Feb 2001 09:06:40 +0100

> Once I have a file object, I don't know of a way to read unicode from it
> without reading bytes and then decoding into another string...but I may
> just not know that there is a more efficient way.

Just try

  reader = codecs.lookup("ISO-8859-2")[2]
  charfile = reader(file)

There could be a convenience function, but that also is a detail.

> CHAR is not a useful set in a computer science sense because if items
> from it are addressable or comparable then there exists an ord()
> function. 

This domain was for definition purposes only; I would not assume that
items are addressable or comparable except for equality (i.e. they are
unordered).

> Therefore there is a character set. If the items are not
> addressable or comparable then how would you make use of it?

To represent a character in a computer, you need to have a character
set; I certainly agree with that. I was just pointing out that the
*same* character can exist in different character sets.

> > >         There is only one standardized international character set that
> > >         allows for mixed-language information.
> > 
> > Not true. E.g. ISO 8859-5 allows both Russian and English text,
> > ISO 8859-2 allows English, Polish, German, Slovakian, and a few
> > others. 
> 
> If you want to use a definition of "international" that means "European"
> then I guess that's fair. But you don't say you've internationalized a
> computer program when you've added support for the Canadian dollar along
> with the American one. :)

My definition of "international standard" is "defined by an
international organization", such as ISO. So ISO 8859 certainly
qualifies. ISO 646 (aka ASCII) is also an international standard; it
even allows for "national variants", but it does not allow
mixed-language information. As for ISO 8859, it also supports Arabic
and Hebrew, BTW.

> > Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0,
> > plane 0) of ISO 10646?
> 
> No, Unicode has space for 16 planes:
> 
> UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) 

Ok. Good that they consider that part of Unicode now; that was not
always the case.

> I don't recall suggesting any such thing! chr() of a byte string should
> return the byte value. chr() of a unicode string should return the
> character value.

chr of a byte string? How exactly do I write this down? I.e. if I have
chr(42), what do I get?

> Not under my proposal. file.read returns a character string. Sometimes
> the character string contains characters between 0 and 255 and is
> indistinguishable from today's string type. Sometimes the file object
> knows that you want the data decoded and it returns large characters.

I guess we have to defer this until I see whether it is feasible
(which I believe it is not - it was the mistake Sun made in the early
JDKs).

> I believe that ASCII is both a character set and an encoding. If not,
> what is the name for the encoding we've been using prior to Unicode?

For ASCII, only a single encoding is common today. I think there used
to be other modes of operation, but nobody cared to give them names.

> > Sounds good. Note that the proper way to write this is
> 
> We need a built-in function that everyone uses as an alternative to the
> byte/string-ambiguous "open".

Why is that a requirement?

> >    fileobj = codecs.open("foo", "r", "ASCII")
> >    # etc
> > 
> > >         fileobj2.encoding = "UTF-16" # changed my mind!
> > 
> > Why is that a requirement. In a normal stream, you cannot change the
> > encoding in the middle - in particular not from Latin 1 single-byte to
> > UTF-16.
> 
> What is a "normal stream?" 

I meant the one returned from open().

> I can imagine all kinds of pickle-like or structured stream file
> formats that switch back and forth between binary information,
> strings and unicode.

For example? If a format supports mixing binary and text information,
it needs to specify what encoding to use for the text fragments, and
it needs to specify how exactly conversion is performed (in case of
stateful codecs). It is certainly the application's job to get this
right; only the application knows how the format is supposed to work.

> BTW, you only know the encoding of an XML file after you've read the
> first line...

Certainly. You don't know the encoding of a MIME message until you
have seen the Content-Type and Content-Transfer-Encoding fields.

> > The specific syntax may be debatable; I dislike semantics being put in
> > comments. There should be first-class syntax for that. Agree on the
> > principle approach.
> 
> We need a backwards-compatible syntax...

Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes.

> This is a fundamental disagreement that we will have to work through.
> What is "questionable" about interpreting a unicode 245 as a character
> 245? If you wanted UTF-8 you would have asked for UTF-8!!!

Likewise, if you want Latin-1 you should ask for it. Explicit is
better than implicit.

> > Disagree. This is codec.open.
> 
> code.open will never become popular.

Why is that?

> Let's say you are a Chinese TCL programmer. If you know the escape code
> for a Kanji character you put it in a string literal just as a Westerner
> would do. 

If, as a programmer, I have to use escape codes to put a character
into my source, I consider this quite inconvenient. Instead, I'd like
to use my keyboard to put in the characters I care about, and I'd like
them to be printed in the way I recognize them.

> The same Chinese Python programmer must use a special syntax of string
> literal and the object he creates has a different type and lots and lots
> of trivial

That Chinese Python programmer should use his editor of choice, and
put _() around strings that are meant as text (as opposed to strings
that are protocol). At the beginning of the module, he should write

def _(str):return unicode(str, "BIG-5")

(assuming BIG-5 is what his editor produces). Not that inconvenient,
and I doubt the same thing is easier in Tcl.

> otherwise language-agnostic code crashes because it tests for
> type("") when it could handle large character codes without a
> problem.

Yes, using type("") is a problem. I'd like to see a symbolic name

StringTypes = [StringType, UnicodeType]

in the types module.

Regards,
Martin