[Python-3000] PEP 3131 accepted

Ka-Ping Yee python at zesty.ca
Sat May 26 12:33:23 CEST 2007


Ka-Ping Yee wrote:
> Alas, the coding directive is not good enough.  Have a look at this:
>
>     http://zesty.ca/python/tricky.png
>
> That's an image of a text editor containing some Python code.  Can you
> tell whether running it (post-PEP-3131) will delete your .bashrc file?

Martin v. Löwis wrote:
> I would think that it doesn't (i.e. allowed should stay at 0).
>
> Why does os.remove get invoked?

Mike Klaas wrote:
> Perhaps a letter in the encoding declaration is non-ascii, nullifying
> the encoding enforcement and allowing a cyrillic 'a' in  allowed = 0?

You got it.

See the actual source file at

    http://zesty.ca/python/tricky.py

There are three things going on here:

    1.  All three occurrences of "allowed" look the same.  And
        it seems they are truly the same, because the coding
        declaration on line 2 says the file is ASCII.  But in
        fact, they aren't the same -- one of them contains a
        Cyrillic "a", which changes the meaning of the program.

    2.  But how is that possible when the coding declaration
        says the file is ASCII?  If you believe it, then you
        also expect the coding declaration itself to be ASCII,
        i.e., a real coding declaration.  But it isn't -- the
        word "coding" contains a Cyrillic "c".

    3.  Then why doesn't Python complain about this non-ASCII
        character on line 2 of the file, since ASCII is supposed
        to be the default encoding?  Because there is a UTF-8 BOM
        at the beginning of the file.

        PEP 263 tries to prevent confusion by making Python complain
        if the coding declaration conflicts with the already-set
        UTF-8 encoding.  But even though line 2 looks like a coding
        declaration, Python doesn't notice it, so you get no warning.

The conclusion is that one cannot rely on the coding declaration
to know what the encoding is, because one cannot know what the
coding declaration says.  We would be able to rely on it, if only
it were encoded in ASCII.  But the enabling of UTF-8 by a BOM at the
beginning of the file is an invisible override.  This invisible
override is the source of the danger.  If we want to be able to
read the coding declaration with any confidence, we should get rid
of the invisible override.


-- ?!ng


More information about the Python-3000 mailing list