[I18n-sig] Strawman Proposal (2): Encoding attributes

M.-A. Lemburg mal@lemburg.com
Sat, 10 Feb 2001 13:37:06 +0100

Paul Prescod wrote:
> "M.-A. Lemburg" wrote:
> >
> > > ...
> > > Also, if we wanted a quick hack, couldn't we implement it at first by
> > > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode
> > > string literals and translate those into real Unicode.
> >
> > I don't want to do "quick hacks", so this is a non-option.
> If it works and it is easy, there should not be a problem!

This is how I started into the Unicode debate (making UTF-8 the default
encoding). It doesn't work out... let's not restart that discussion.

> > Making the parser Unicode aware is non-trivial as it requires
> > changing lots of the internals which expect 8-bit C char buffers.
> Are you talking about the Python internals or the parser internals. If
> the former, then I do not think you are correct. Only the parser needs
> to change.

The parser would have to accept Py_UNICODE strings and work
on these. The compiler needs to be able to convert Py_UNICODE
back to char for e.g. string literals.

We'd also have to provide external interfaces which convert
char input for the parser into Unicode. This would introduce
many new locations of possible breakage (please remember that
variable length encodings are *very* touchy about wrong byte 

> > If we change the parser to use Unicode, then we would
> > have to decode *all* program text into Unicode and this is very
> > likely to fail for people who put non-ASCII characters into their
> > string literals.
> Files with no declaration could be interpreted byte for char just as
> they are today!

Then we'd have to write two sets of parsers and compilers:
one for Py_UNICODE and one for char... no way ;-)
> > ....
> > ASCII is not Euro-centric at all since it is a common subset
> > of very many common encodings which are in use today.
> Oh come on! The ASCII characters are sufficient to encode English and a
> very few other languages.

Paul, programs have been written in ASCII for many many years.
Are you trying to tell me that 30+ years of common usage should
be ignored ? Programmers have gotten along with ASCII quite well,
not only English speaking ones -- ASCII can be used to approximate
quite a few other languages as well (provided you ignore accents
and the like). For most other languages there are transliterations
into ASCII which are in common use.

For other good arguments, see Tim's post on the subject.

> > Latin-1
> > would be, though... which is why ASCII was chosen as standard
> > default encoding.
> We could go back and forth on this but let me suggest you type in a
> program with Latin 1 in your Unicode literals and try and see what
> happens. Python already "recognizes" that there is a single logical
> translation from "old style strings" to Unicode strings and vice versa.

Fact is, I would never use Latin-1 characters outside of literals.
All my programs are written in (more or less ;) English, even the
comments and doc-strings. If you ever write applications which
programmers from around the world are supposed to comprehend
and maintain, then English is the only reasonable common base,
at least IMHO.

> > The added flexibility in choosing identifiers would soon turn
> > against the programmers themselves. Others have tried this and
> > failed badly (e.g. look at the language specific versions of
> > Visual Basic).
> That's a totally different and unrelated issue. Nobody is talking about
> language specific Pythons. We're talking about allowing people to name
> variables in their own languages. I think that anything else is
> Euro-centric.

Funny, how you always refer to "Euro"-centric... ASCII is an
American standard ;-)

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/