[Python-Dev] What does a double coding cookie mean?

Thu Mar 17 14:35:02 EDT 2016

On 17.03.2016 18:53, Serhiy Storchaka wrote:
> On 17.03.16 19:23, M.-A. Lemburg wrote:
>> On 17.03.2016 15:02, Serhiy Storchaka wrote:
>>> On 17.03.16 15:14, M.-A. Lemburg wrote:
>>>> On 17.03.2016 01:29, Guido van Rossum wrote:
>>>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>>>
>>>> I'd prefer a separate utility for this somewhere, since
>>>> tokenize.detect_encoding() is not available in Python 2.
>>>>
>>>> I've attached an example implementation with tests, which works
>>>> in Python 2.7 and 3.
>>>
>>> Sorry, but this code doesn't match the behaviour of Python interpreter,
>>> nor other tools. I suggest to backport tokenize.detect_encoding() (but
>>> be aware that the default encoding in Python 2 is ASCII, not UTF-8).
>>
>> Yes, I got the default for Python 3 wrong. I'll fix that. Thanks
>> for the note.
>>
>> What other aspects are different than what Python implements ?
> 
> 1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".

Ok, that makes sense (even though it's not mandated by the PEP;
the utf-8-sig codec didn't exist yet).

> 2. If there is a BOM and coding cookie is not 'utf-8', this is an error.

It's an error for Python, but why should a detection function
always raise an error for this case ? It would probably be a good
idea to have an errors parameter to leave this to the use to decide.

Same for unknown encodings.

> 3. If the first line is not blank or comment line, the coding cookie is
> not searched in the second line.

Hmm, the PEP does allow having the coding cookie in the
second line, even if the first line is not a comment. Perhaps
that's not really needed.

> 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and
> "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).

Well, that's cosmetics :-) The codec system will take care of
this when needed.

> 5. There isn't the limit of 400 bytes. Actually there is a bug with
> handling long lines in current code, but even with this bug the limit is
> larger.

I think it's a reasonable limit, since shebang lines may only be
127 long on at least Linux (and probably several other Unix systems
as well).

But just in case, I made this configurable :-)

> 6. I made a mistake in the regular expression, missed the underscore.

I added it.

> tokenize.detect_encoding() is the closest imitation of the behavior of
> Python interpreter.

Probably, but that doesn't us on Python 2, right ?

I'll upload the script to github later today or tomorrow to
continue development.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/
________________________________________________________________________
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89
2016-02-19: Released eGenix PyRun 2.1.2 ...       http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
                      http://www.malemburg.com/