[I18n-sig] Re: [Python-Dev] unichr

M.-A. Lemburg mal@lemburg.com
Thu, 08 Feb 2001 17:45:21 +0100

Paul Prescod wrote:
> On Thu, 8 Feb 2001, M.-A. Lemburg wrote:
> > You are forgetting that the range 128-255 is used by many codepages
> > to support language specific characters.
> No, I'm not forgetting that. I just don't think it is relevant.

It is not irrelevant as you describe below...
> > chr(0xE0) will give different
> > characters in the US than e.g. in Russia. If we were to simply
> > let these conversions slip through, then people would find garbled
> > data in their text files.
> People in Russia understand the concept of code pages. They know that
> if they put "special" characters in their files they will be interpreted
> on other platforms as Western European characters. If we make it easy for
> them to explicitly state their encoding then the will do so and get better
> behavior then they did before. We can also simplify Python and remove an
> arbitrary restriction at the same time.

Well, we can remove the restriction for string literals, but
the same coercion happens for generated strings and these are not
under control of some source encoding parameter.

I once suggested that strings (the 8-bit ones) get an .encoding
attribute to carry along that information, but it quickly showed
that the idea would not be of much use because of the generation
problem and because the only coercion from a string with encoding
information and one without that information is to produce a
new string without encoding information (or maybe not coerce them
at all).

See the python-dev archives for more on this idea (early last year).
> > Of course, if a user explicitly sets the default encoding to
> > Latin-1, then everything will be fine, but for ASCII (which is
> > the base of most character encodings in use today) there is
> > little other we can do except to raise an exception.
> I don't think the "default encoding" is a relevant concept. Most people
> came out strongly against it on the Python lists and it was hidden from
> user view for that reason. It is a terrible idea to encourage people to
> write software that works right on their computer but not on anyone
> else's. I think that we should view the "default encoding" as an
> implementation artifact and nothing more. We need to define portable rules
> that will consistently make sense everywhere.

That is exactly why we made as hard as possible for people to
*change* the default. It is pretty obvious that they are on their
own when trying to fiddle with site.py or sitecustomize.py.

Still, I believe its a valid idea. Back when I wrote the proposal
for Unicode integration I had fixed the default encoding to UTF-8.
As the first working patches appeared, there was a long and heated
discussion about what encoding to choose as default (people didn't
like UTF-8). 

There were basically two camps: UTF-8 and Latin-1.
We then decided to make the encoding a variable for have people
try out different encodings. 

Next, the idea of a locale based
default encoding was brought up. Fredrik and I then implemented
the needed magic to figure out the platform specific default
encoding, but subsequently the idea was dropped by our BDFL
in favour of ASCII which is what we see now.

The support code was left in the distribution... and Pythoneers
quickly found it ;-)

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/