[I18n-sig] Strawman Proposal (2): Encoding attributes

M.-A. Lemburg mal@lemburg.com
Fri, 09 Feb 2001 18:24:45 +0100

Paul Prescod wrote:
> On Fri, 9 Feb 2001, M.-A. Lemburg wrote:
> > ...>
> > I'd rather restrict this to ASCII since codec names must be ASCII
> > and this would also allow detecting wrong formats of the source file
> > in addition to make UTF-16 detection possible.
> That's fine with me.
> > > <some string> is the encoding name and must be associated with a
> > > registered codec. The appropriate codec is used to decode the source
> > > file.
> >
> > Decode to what other format ? Unicode, the current locale's encoding ?
> > What would happen if the decoding step fails ?
> We would decode to Unicode. If Decoding fails you get some kind of
> EncodingException error. This would be trapped in import machinery to be
> raised as an ImportError for imported modules.
> > > The decoded result is passed to the compiler. Once the decoding is
> > > done, the encoding declaration has no other effect. In other words, it
> > > does not further affect the interpretation of string literals with
> > > non-ASCII characters or anything else.
> >
> > But if it doesn't affect the interpretation of string literals then
> > what benefits do we gain from knowing the encoding ?
> Let's say that you have a string literal:
> a="XX"
> XX are bytes representing a character. If the character represented has an
> ordinal less than 255 then this would work. More often you would say:
> a=u"XX"
> The system would treat those examples no differently than this one:t
> XX="a"
> This keeps the model very simple and allows us to evolve to wide-character
> variable names some day.
> > I think that such a scheme is indeed possible, but not until we
> > have made all strings default to Unicode. Then decoding to Unicode
> > would be the proper thing to do.
> Making all strings default to Unicode is a good idea but it is a separate
> project. I think that my proposal above is still useful. It means that a
> Russian can type Unicode characters into their document using their KOI8-R
> editor.
> They can't type those Unicode characters directly into a string literal
> but why would they want to now that we have Unicode? If there is some
> reason they want to keep typing wide chars into string literals then there
> must be some problem with our Unicode support and we should work that out.
> Until we work that out, they probably just wouldn't use our encoding
> declaration feature.

Ah, ok. The encoding information will only be applied to literal
Unicode strings (u"text"), right ?

That's in line with what we have already discussed here or on
python-dev some time ago. Only then we tried to achive this using
some form of pragma statement.

So what this strawman suggest is in summary:

1. add an encoding identifier to the top of a source code file
2. use that encoding information to decode u"..." literals into
3. leave all other literals and text alone

Sounds ok, even though it should probably made clear that only
the u"" literals actually use the encoding information (perhaps
the name should be #?unicode-encoding="" ?) and nothing else.

Marc-Andre Lemburg
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/