PEP: Defining Python Source Code Encodings

M.-A. Lemburg mal at lemburg.com
Wed Jul 18 09:34:51 EDT 2001


Roman Suzi wrote:
> 
> On Tue, 17 Jul 2001, M.-A. Lemburg wrote:
> 
> > > I think, that if encoding is not given, it must sillently assume "UNKNOWN"
> > > encoding and do nothing, that is be 8-bit clean (as it is now).
> >
> > To be 8-bit clean it will have to use Latin-1 as fallback encoding
> > since this encoding assures the roundtrip safety (decode to Unicode,
> > then reencode).
> 
> Nope. There must be no encode-decode back. Or it will slow down
> starting Python _scripts_ unnecessary.
> 
> That is why I suggested "unknown" encoding - a safe default
> for those who do not want any back-and-force recodings.

If you want to avoid having to decode and the reencode data
in ther parser, we would have to live with two sets of parsers
in Python -- one for Unicode and one for 8-bit data.

I don't think that anyone would like to maintain those two
sets, so it's basically either go all the way or not move
at all.
 
> > > Otherwise, it will slow down parser considerably.
> >
> > Yes, that could be an issue (I don't think it matters much though,
> > since parsing usually only done during byte-code compilation and
> > the results are buffered in .pyc files).
> 
> No! Scripts are compiled each time they run.
> If this will be implemented, developers will need to do the trick
> of making each script a module and so on. This is not good idea.
> 
> There clearly must be the way to prevent encode-decode. And it would be
> better if only EXPLICITLY given encoding will trigger encode-decode
> mechanism.

That's not true: Python caches byte-code compiled versions of
scripts in .pyc|o files. So the performance problem is really not
all that important.

> > > I also think that if encoding is choosen, there is no need to reencode it
> > > back to literal strings: let them be in Unicode.
> >
> > That would be nice, but is not feasable at the moment (just try
> > to run Python with -U option and see what happens...).
> 
> Then indeed --encoding= is needed with -U ;-)

No, a lot of voluntary work is needed to make the Python standard
lib Unicode compatible !
 
> > > Or am I missing something?
> >
> > It won't switch any time soon... there's still too much work
> > ahead and I'm also pretty sure that the 8-bit string type won't
> > go away for backward compatibility reasons.
> 
> ...and efficiency reasons too. re was slowed down significantly by adding
> Unicode support.

I seriously doubt that. Fredrik (who wrote the sre engine) is an
optimization genius and in some cases even made the sre engine
faster than the string module implementations of e.g. find().
 
> > > >    To make this backwards compatible, the implementation would have to
> > > >    assume Latin-1 as the original file encoding if not given (otherwise,
> > > >    binary data currently stored in 8-bit strings wouldn't make the
> > > >    roundtrip).
> > >
> > > ...as I said, there must be no assumed charset. Things must
> > > be left as is now when no explicit encoding given.
> >
> > This is what the Latin-1 encoding assures.
> 
> I still think something like "raw" is needed...

Latin-1 gives you this "raw" feature.
 
> > > > 4. The encoding used in a Python source file should be easily
> > > >    parseable for en editor; a magic comment at the top of the
> > > >    file seems to be what people want to see, so I'll drop the
> > > >    directive (PEP 244) requirement in the PEP.
> > > >
> > > > Issues that still need to be resolved:
> > > >
> > > > - how to enable embedding of differently encoded data in Python
> > > >   source code (e.g. UTF-8 encoded XML data in a Latin-1
> > > >   source file)
> > >
> > > Probably, adding explicit conversions.
> >
> > Yes, but there are cases where the source file having the embedded
> > data will not decode into Unicode (I got the example backwards:
> > think of a UTF-8 encoded source file with a Latin-1 string literal).
> 
> utf-7 bit for embedded things ;-)
> 
> > Perhaps we should simply rule out this case and have the
> > programmer stick to the source file encoding + some escaping
> > or a run-time recoding of the literal data into the preferred
> > encoding.
> 
> This is probably wise. Python program need not be a
> while Zoo of encodings...

I'll put this into the PEP update.
 
> > > No variant is ideal. The 2nd is worse/best than all
> > > (it depends on how to look at it!)
> > >
> > > Python has no macro directives. In this situation
> > > they could help greatly!
> >
> > We've been discussing these on python-dev, but Guido is not
> > too keen on having them.
> 
> And this is right. I even think encoding information could be EXTERNAL.

No -- how are editors supposed to know about these external
files ?

> For example, directory will need to have "__encodings__.py" file where
> encodings are listed.
> 
> Then, Python if started with some key could check for such file and
> compile modules accordingly.
> 
> If there is not __encodings__.py, then Python proceed as it does now,
> WITHOUT any conversions to and from.
> 
> This will make script-writer happy (no conversion overhead) and those who
> want to write encoding-enabled programs (they could specify
> __encodings__.py)
> 
> I think, this solves most problems.
> 
> > > That "#!encoding" is special case of macro directive.
> > >
> > > May be just put something like ''# <!DOCTYPE HTML PUBLIC''
> > > at the beginning...
> > >
> > > Or, even greater idea occured to me: allow some XML
> > > with meta-information (not only encoding) somehow escaped.
> > >
> > > I think, GvR could come with some advice here...
> > >
> > > > Comments are welcome !
> >
> > Thanks for your comments,
> 
> I just hope the realisation of your PEP will not make Python scripts
> running slower ;-) while allowing truly useful i18n functionality.

By the time the PEP will be implemented, CPUs will run at least 50% 
faster than they do now -- this should answer your question ;-)
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/





More information about the Python-list mailing list