PEP: Defining Python Source Code Encodings

Roman Suzi rnd at onego.ru
Tue Jul 17 09:37:50 EDT 2001


On Tue, 17 Jul 2001, M.-A. Lemburg wrote:

> > I think, that if encoding is not given, it must sillently assume "UNKNOWN"
> > encoding and do nothing, that is be 8-bit clean (as it is now).
> 
> To be 8-bit clean it will have to use Latin-1 as fallback encoding
> since this encoding assures the roundtrip safety (decode to Unicode,
> then reencode).

Nope. There must be no encode-decode back. Or it will slow down
starting Python _scripts_ unnecessary.

That is why I suggested "unknown" encoding - a safe default 
for those who do not want any back-and-force recodings.
  
> > Otherwise, it will slow down parser considerably.
> 
> Yes, that could be an issue (I don't think it matters much though,
> since parsing usually only done during byte-code compilation and
> the results are buffered in .pyc files).

No! Scripts are compiled each time they run. 
If this will be implemented, developers will need to do the trick
of making each script a module and so on. This is not good idea.

There clearly must be the way to prevent encode-decode. And it would be
better if only EXPLICITLY given encoding will trigger encode-decode
mechanism.

> > I also think that if encoding is choosen, there is no need to reencode it
> > back to literal strings: let them be in Unicode.
> 
> That would be nice, but is not feasable at the moment (just try
> to run Python with -U option and see what happens...).

Then indeed --encoding= is needed with -U ;-)
  
> > Or am I missing something?
> 
> It won't switch any time soon... there's still too much work
> ahead and I'm also pretty sure that the 8-bit string type won't
> go away for backward compatibility reasons.

...and efficiency reasons too. re was slowed down significantly by adding
Unicode support.

> > >    To make this backwards compatible, the implementation would have to
> > >    assume Latin-1 as the original file encoding if not given (otherwise,
> > >    binary data currently stored in 8-bit strings wouldn't make the
> > >    roundtrip).
> > 
> > ...as I said, there must be no assumed charset. Things must
> > be left as is now when no explicit encoding given.
> 
> This is what the Latin-1 encoding assures.

I still think something like "raw" is needed...
  
> > > 4. The encoding used in a Python source file should be easily
> > >    parseable for en editor; a magic comment at the top of the
> > >    file seems to be what people want to see, so I'll drop the
> > >    directive (PEP 244) requirement in the PEP.
> > >
> > > Issues that still need to be resolved:
> > >
> > > - how to enable embedding of differently encoded data in Python
> > >   source code (e.g. UTF-8 encoded XML data in a Latin-1
> > >   source file)
> > 
> > Probably, adding explicit conversions.
> 
> Yes, but there are cases where the source file having the embedded
> data will not decode into Unicode (I got the example backwards:
> think of a UTF-8 encoded source file with a Latin-1 string literal).

utf-7 bit for embedded things ;-)
 
> Perhaps we should simply rule out this case and have the 
> programmer stick to the source file encoding + some escaping
> or a run-time recoding of the literal data into the preferred
> encoding.

This is probably wise. Python program need not be a 
while Zoo of encodings... 
  
> > No variant is ideal. The 2nd is worse/best than all
> > (it depends on how to look at it!)
> > 
> > Python has no macro directives. In this situation
> > they could help greatly!
> 
> We've been discussing these on python-dev, but Guido is not
> too keen on having them.

And this is right. I even think encoding information could be EXTERNAL.
For example, directory will need to have "__encodings__.py" file where
encodings are listed.

Then, Python if started with some key could check for such file and
compile modules accordingly.

If there is not __encodings__.py, then Python proceed as it does now,
WITHOUT any conversions to and from.

This will make script-writer happy (no conversion overhead) and those who
want to write encoding-enabled programs (they could specify
__encodings__.py)

I think, this solves most problems.

> > That "#!encoding" is special case of macro directive.
> > 
> > May be just put something like ''# <!DOCTYPE HTML PUBLIC''
> > at the beginning...
> > 
> > Or, even greater idea occured to me: allow some XML
> > with meta-information (not only encoding) somehow escaped.
> > 
> > I think, GvR could come with some advice here...
> > 
> > > Comments are welcome !
> 
> Thanks for your comments,

I just hope the realisation of your PEP will not make Python scripts
running slower ;-) while allowing truly useful i18n functionality.

*

I have a feeling that PEP is solving non-problem. Or have I lost
the thread? 
 
For example, I usually write scripts where I assume "koi8-r" or
"windows-1251" encodings. And the only problems I have is when I use
"koi8-r" strings from modules when I need "1251" ones. (In this case I
explicitely recode).

Aren't encodings better confined in documents (in XML for examples),
than programs?

If programs need to be written in unicode, then isn't the next step 
o allow embedding sound, graphics and video?
(I imagine video docstring ;-)

I can admit that using utf-8 in writing programs could be justified,
but Unicode... It will bring nightmare...

Sincerely yours, Roman A.Suzi
-- 
 - Petrozavodsk - Karelia - Russia - mailto:rnd at onego.ru -
 








More information about the Python-list mailing list