[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 8 Feb 2001 01:37:56 +0100


> > They don't write them to a file. Instead, they print them in the IDLE
> > terminal, or display them in a Tk or PythonWin window. Both support
> > arbitrary many characters, and will treat the bytes as characters
> > originating from Latin-1 (according to their ordinals).
> 
> I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data
> in a string literal. PythonWin and Tk expect Unicode. How could they
> display the characters correctly?

No, PythonWin and Tk both tell apart Unicode and byte strings
(although Tk uses quite a funny algorithm to do so). If they see a
byte string, they convert it using the platform encoding (which is
user-settable on both Windows and Unix) to a Unicode string, and
display that.

> > Or, they pass them as attributes in a DOM method, which, on
> > write-back, will encode every string as UTF-8 (as that is the default
> > encoding of XML). Then the characters will get changed, when they
> > shouldn't.
> 
> What do you think *should* happen? These are the only choices I can
> think of:
> 
>  1. DOM encodes it as UTF-8
>  2. DOM blindly passes it through and creates illegal XML
>  3. (correct) User explicitly decodes data into Unicode charset.

What users expect to happen is 2; blindly pass-through. They think
they can get it right; given enough control, this is feasible. It was
even common practice in the absence of Unicode objects, so a lot of
code depends on libraries passing things through as-is.

> The only sane thing to do when you don't know is to pass the characters
> as-is, char->ord->char.

So libraries need a way of telling for sure. With Python 2.0, they can
look at the type() and tell that something is really meant as a
character string; otherwise, I agree, they have to pass through.

Under your proposal, this strategy will fail: libraries cannot tell
for sure anymore that something is really meant as a character string.

> > > The encoding-smart alternatives should also be
> > > documented as preferred replacements as soon as possible.
> > 
> > I'm not sure they are preferred. They are if you know the encoding of
> > your data sources. If you don't, you better be safe than sorry.
> 
> If you don't know the encoding of your data sources then you should say
> that explicitly in code rather than using the same functions as people
> who *do* know what their encoding is. Explicit is better than implicit,
> right? Our current default is totally implicit.

No, it's not. The current default is: always produce byte strings. In
many applications, people certainly *should* use character strings,
but they have to change their code for that. Telling everybody to use
fopen for everything is wrong; telling them to use codecs.open for
character streams is right.

Regards,
Martin