[I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model
Martin v. Loewis
Wed, 7 Feb 2001 08:32:53 +0100
> > Then, people who put KOI8-R into their Python source code will
> > complain why the strings come out incorrectly, even though they set
> > their language to Russion, and even though it worked that way in
> > earlier Python versions.
> I don't follow.
> If I have:
> XXX is a series of non-ASCII bytes. Those are mapped into Unicode
> characters with the same ordinals. Now you write them to a file. You
> presumably do not specify an encoding on the file write operation. So
> the characters get mapped back to bytes with the same ordinals. It all
> behaves as it did in Python 1.0 ...
They don't write them to a file. Instead, they print them in the IDLE
terminal, or display them in a Tk or PythonWin window. Both support
arbitrary many characters, and will treat the bytes as characters
originating from Latin-1 (according to their ordinals).
Or, they pass them as attributes in a DOM method, which, on
write-back, will encode every string as UTF-8 (as that is the default
encoding of XML). Then the characters will get changed, when they
> You can only introduce characters greater than 256 into strings
> explicitly and presumably legacy code does not do that because there
> was no way to do that!
Legacy code will pass them to applications that know to operate with
the full Unicode character set, e.g. by applying encodings where
necessary, or selecting proper fonts (which might include applying
encodings). *That* is where it will break, and the library has no way
of telling whether the strings where meant as byte strings (in an
unspecified character set), or as Unicode character strings.
> It isn't the appropriate time to create such a core code patch. I'm
> trying to figure out our direction so that we can figure out what can be
> done in the short term. The only two things I can think of are merge
> chr/unichr (easy) and provide encoding-smart alternatives to open() and
> read() (also easy). The encoding-smart alternatives should also be
> documented as preferred replacements as soon as possible.
I'm not sure they are preferred. They are if you know the encoding of
your data sources. If you don't, you better be safe than sorry.