[I18n-sig] Strawman Proposal (2): Encoding attributes

Paul Prescod paulp@ActiveState.com
Fri, 9 Feb 2001 07:29:54 -0800 (PST)

On Fri, 9 Feb 2001, M.-A. Lemburg wrote:

> ...>
> I'd rather restrict this to ASCII since codec names must be ASCII
> and this would also allow detecting wrong formats of the source file
> in addition to make UTF-16 detection possible.

That's fine with me.

> > <some string> is the encoding name and must be associated with a
> > registered codec. The appropriate codec is used to decode the source
> > file.
> Decode to what other format ? Unicode, the current locale's encoding ?
> What would happen if the decoding step fails ?

We would decode to Unicode. If Decoding fails you get some kind of
EncodingException error. This would be trapped in import machinery to be
raised as an ImportError for imported modules.

> > The decoded result is passed to the compiler. Once the decoding is
> > done, the encoding declaration has no other effect. In other words, it
> > does not further affect the interpretation of string literals with
> > non-ASCII characters or anything else.
> But if it doesn't affect the interpretation of string literals then
> what benefits do we gain from knowing the encoding ?

Let's say that you have a string literal:


XX are bytes representing a character. If the character represented has an
ordinal less than 255 then this would work. More often you would say:


The system would treat those examples no differently than this one:t


This keeps the model very simple and allows us to evolve to wide-character
variable names some day.

> I think that such a scheme is indeed possible, but not until we
> have made all strings default to Unicode. Then decoding to Unicode
> would be the proper thing to do.

Making all strings default to Unicode is a good idea but it is a separate
project. I think that my proposal above is still useful. It means that a
Russian can type Unicode characters into their document using their KOI8-R

They can't type those Unicode characters directly into a string literal
but why would they want to now that we have Unicode? If there is some
reason they want to keep typing wide chars into string literals then there
must be some problem with our Unicode support and we should work that out.
Until we work that out, they probably just wouldn't use our encoding
declaration feature.

 Paul Prescod