[Python-Dev] What does a double coding cookie mean?

Thu Mar 17 13:53:07 EDT 2016

On 17.03.16 19:23, M.-A. Lemburg wrote:
> On 17.03.2016 15:02, Serhiy Storchaka wrote:
>> On 17.03.16 15:14, M.-A. Lemburg wrote:
>>> On 17.03.2016 01:29, Guido van Rossum wrote:
>>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>>
>>> I'd prefer a separate utility for this somewhere, since
>>> tokenize.detect_encoding() is not available in Python 2.
>>>
>>> I've attached an example implementation with tests, which works
>>> in Python 2.7 and 3.
>>
>> Sorry, but this code doesn't match the behaviour of Python interpreter,
>> nor other tools. I suggest to backport tokenize.detect_encoding() (but
>> be aware that the default encoding in Python 2 is ASCII, not UTF-8).
>
> Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
> for the note.
>
> What other aspects are different than what Python implements ?

1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".

2. If there is a BOM and coding cookie is not 'utf-8', this is an error.

3. If the first line is not blank or comment line, the coding cookie is 
not searched in the second line.

4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and 
"utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).

5. There isn't the limit of 400 bytes. Actually there is a bug with 
handling long lines in current code, but even with this bug the limit is 
larger.

6. I made a mistake in the regular expression, missed the underscore.

tokenize.detect_encoding() is the closest imitation of the behavior of 
Python interpreter.