What does a double coding cookie mean?

I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)? Reference: https://github.com/python/mypy/issues/1281 -- --Guido van Rossum (python.org/~guido)

On 2016-03-15 20:30, Guido van Rossum wrote:
I think it should follow CPython. As I see it, CPython allows it to be on the second line because the first line might be needed for the shebang. If the first two lines both had an encoding, and then you inserted a shebang line, the second one would be ignored anyway.

On Tue, Mar 15, 2016 at 01:30:08PM -0700, Guido van Rossum wrote:
If it helps, what 'vim' appears to do is to read the first 'n' lines in order and then last 'n' lines in reverse order, stopping if the second stage reaches a line already processed by the first stage. So with 'modelines=5', the following file: /* vim: set ts=1: */ /* vim: set ts=2: */ /* vim: set ts=3: */ /* vim: set ts=4: */ /* vim: set sw=5 ts=5: */ /* vim: set ts=6: */ /* vim: set ts=7: */ /* vim: set ts=8: */ sets sw=5 and ts=6. Obviously CPython shouldn't be going through all that palaver! But it would be a bit more vim-like to use the second line rather than the first if both lines have the cookie. Take that as you will - I'm not saying being 'vim-like' is an inherent virtue ;-)

On Tue, 15 Mar 2016 at 13:31 Guido van Rossum <guido@python.org> wrote:
I think the spirit of PEP 263 is for the first specified encoding to win as the support of two lines is to support shebangs and not multiple encodings :) . I also think the fact that tokenize.detect_encoding() <https://docs.python.org/3/library/tokenize.html#tokenize.detect_encoding> doesn't automatically read two lines from its input also suggests the intent is "first encoding wins" (and that is the semantics of the function).

I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs. On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon <brett@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

Guido van Rossum <guido@python.org> writes:
+1, it never occurred to me that the specification could mean otherwise. On reflection I can't see a good reason for it to mean otherwise. -- \ “Alternative explanations are always welcome in science, if | `\ they are better and explain more. Alternative explanations that | _o__) explain nothing are not welcome.” —Victor J. Stenger, 2001-11-05 | Ben Finney

On 16.03.16 02:28, Guido van Rossum wrote:
Could you please also update the regular expression in PEP 263 to "^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"? Coding cookie must be in comment, only the first occurrence in the line must be taken to account (here is a bug in CPython), encoding name must be ASCII, and there must not be any Python statement on the line that contains the encoding declaration. [1] [1] https://bugs.python.org/issue18873

On 3/16/2016 3:14 AM, Serhiy Storchaka wrote:
Also, I think there should be one 'official' function somewhere in the stdlib to get and return the encoding declaration. The patch for the issue above had to make the same change in four places other than tests, a violent violation of DRY. -- Terry Jan Reedy

On 16.03.2016 01:28, Guido van Rossum wrote:
+1 The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 16 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here. I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first. I will update the regex in the PEP too (or change the wording to avoid "match"). I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior. Should we recommend that everyone use tokenize.detect_encoding()? -- --Guido van Rossum (python.org/~guido)

I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there). Should we recommend that everyone use tokenize.detect_encoding()? On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On 3/16/2016 5:29 PM, Guido van Rossum wrote:
Oh sure. Updating the PEP is the best way forward. But the reasoning, although from somewhat vague specifications, seems sound enough to declare that it meant "find the first cookie in the first two lines". Which is what you've said in the update, although not quite that tersely. It now leaves no room for ambiguous interpretations.
The only reason for an error would be to alert people that had depended on the bugs, or misinterpretations. Personally, I think if they haven't converted to UTF-8 by now, they've got bigger problems than this change.

On 17.03.16 02:29, Guido van Rossum wrote:
http://bugs.python.org/issue26581
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.

On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines). -- --Guido van Rossum (python.org/~guido)

On 17.03.16 16:55, Guido van Rossum wrote:
The simplest way to detect encoding of bytes string: lines = data.splitlines() encoding = tokenize.detect_encoding(iter(lines).__next__)[0] If you don't want to split all data on lines, the most efficient way in Python 3.5 is: encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0] In Python 3.5 io.BytesIO(data) has constant complexity. In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example: def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:] encoding = tokenize.detect_encoding(iterlines(data).__next__)[0] or it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detect_encoding(it.__next__) I don't know what approach is more efficient.

On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception. I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9f...).
Ditto with the SyntaxError though.
Having my own regex was simpler. :-( -- --Guido van Rossum (python.org/~guido)

On 17.03.2016 15:55, Guido van Rossum wrote:
I've uploaded the code I posted yesterday, modified to address some of the issues it had to github: https://github.com/malemburg/python-snippets/blob/master/detect_source_encod... I'm pretty sure the two-lines read can be optimized away and put straight into the regular expression used for matching. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 18 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.2016 01:29, Guido van Rossum wrote:
Thanks, will do.
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2. I've attached an example implementation with tests, which works in Python 2.7 and 3.
I suppose it's a rather rare case, since it's the first time that I heard about anyone thinking that a possible second line could be picked - after 15 years :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.2016 15:02, Serhiy Storchaka wrote:
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note. What other aspects are different than what Python implements ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.16 19:23, M.-A. Lemburg wrote:
1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig". 2. If there is a BOM and coding cookie is not 'utf-8', this is an error. 3. If the first line is not blank or comment line, the coding cookie is not searched in the second line. 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM). 5. There isn't the limit of 400 bytes. Actually there is a bug with handling long lines in current code, but even with this bug the limit is larger. 6. I made a mistake in the regular expression, missed the underscore. tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.

On 17.03.2016 18:53, Serhiy Storchaka wrote:
Ok, that makes sense (even though it's not mandated by the PEP; the utf-8-sig codec didn't exist yet).
2. If there is a BOM and coding cookie is not 'utf-8', this is an error.
It's an error for Python, but why should a detection function always raise an error for this case ? It would probably be a good idea to have an errors parameter to leave this to the use to decide. Same for unknown encodings.
3. If the first line is not blank or comment line, the coding cookie is not searched in the second line.
Hmm, the PEP does allow having the coding cookie in the second line, even if the first line is not a comment. Perhaps that's not really needed.
4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).
Well, that's cosmetics :-) The codec system will take care of this when needed.
I think it's a reasonable limit, since shebang lines may only be 127 long on at least Linux (and probably several other Unix systems as well). But just in case, I made this configurable :-)
6. I made a mistake in the regular expression, missed the underscore.
I added it.
tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.
Probably, but that doesn't us on Python 2, right ? I'll upload the script to github later today or tomorrow to continue development. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 15.03.16 22:30, Guido van Rossum wrote:
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.

On Wed, Mar 16, 2016 at 2:07 AM, Chris Angelico <rosuav@gmail.com> wrote:
+1. If multiple coding cookies are found, and all do not agree, I would expect an error to be raised. That it apparently does not raise an error currently is surprising to me. (If multiple coding cookies are found but do agree, perhaps raising a warning would be a good idea.)

On 3/15/2016 11:07 PM, Chris Angelico wrote:
From the PEP 263:
So clearly there is only one magic comment. "either" the first or second line, not both. Both, therefore, should be an error. From the PEP 263:
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment". Glenn

On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:
Sure. But there is no mention anywhere in the PEP of more than one being legal: just more than one position for it, EITHER line 1 or line 2. So while the regular expression mentioned is not anchored, to allow variation in syntax between emacs and vim, "must match the regular expression" doesn't imply "several times", and when searching for a regular expression that might not be anchored, one typically expects to find the first. Glenn

On 16.03.16 08:03, Serhiy Storchaka wrote:
I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins. Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.

On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs. (1) CPython should not necessarily act like emacs, unless the coding syntax exactly matches emacs, rather than the generic coding that CPython interprets, that matches emacs, vim, and other similar things that both emacs and vim would ignore. (1a) Maybe if a similar test were run on vim with its syntax, and it also works the same way, then one might think it is a trend worth following, but it is not clear to this non-vim user that vim syntax allows more than one coding specification per line. (2) emacs has no requirement that the coding be placed on the first two lines. It specifically looks at the second line only if the first line has a “ #! ” or a “ '\" ” (for troff). (according to docs, not experimentation) (3) emacs also allows for Local Variables to be specified at the end of the file. If CPython were really to act like emacs, then it would need to allow for that too. (4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim. (4a) Here's an untested line that emacs would interpret as utf-8, and CPython with the greedy regulare expression would interpret as latin-1, because emacs looks only between the -*- pair, and CPython ignores that. # -*- coding: utf-8 -*- this file does not use coding: latin-1

Glenn Linderman writes:
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
We can't treat Emacs as a spec, because Emacs doesn't follow specs, doesn't respect standards, and above a certain level of inconvenience to developers doesn't respect backward compatibility. There's never any guarantee that Emacs will do the same thing tomorrow that it does today, although inertia has mostly the same effect. In this case, there's a reason why Emacs behaves the way it does, which is that you can put an arbitrary sequence of variable assignments in "-*- ... -*-" and they will be executed in order. So it makes sense that "last coding wins". But pragmas are severely deprecated in Python; cookies got a very special exception. So that rationale can't apply to Python.
(4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim.
Indeed. I see no point in reading past the first cookie found (whether a valid codec or not), unless an error would be raised. That might be a good idea, but I doubt it's worth the implementation complexity.

On 19.03.16 19:36, Glenn Linderman wrote:
Yes. But current CPython works that way. The behavior of Emacs is the argument that maybe this is not a bug.
Since Emacs allows to specify the coding twice on a line, and this can be ambiguous, and CPython already detects some ambiguous situations (UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a check that the coding is specified only once on a line.

On 3/19/2016 2:37 PM, Serhiy Storchaka wrote:
If CPython properly handles the following line as having only one proper coding declaration (utf-8), then I might reluctantly agree that the behavior of Emacs might be a relevant argument. Otherwise, vehemently not relevant. # -*- coding: utf-8 -*- this file does not use coding: latin-1
Diagnosing ambiguous conditions, even including my example above, might be useful... for a few files... is it worth the effort? What % of .py sources have coding specifications? What % of those have two?

On 2016-03-15 20:30, Guido van Rossum wrote:
I think it should follow CPython. As I see it, CPython allows it to be on the second line because the first line might be needed for the shebang. If the first two lines both had an encoding, and then you inserted a shebang line, the second one would be ignored anyway.

On Tue, Mar 15, 2016 at 01:30:08PM -0700, Guido van Rossum wrote:
If it helps, what 'vim' appears to do is to read the first 'n' lines in order and then last 'n' lines in reverse order, stopping if the second stage reaches a line already processed by the first stage. So with 'modelines=5', the following file: /* vim: set ts=1: */ /* vim: set ts=2: */ /* vim: set ts=3: */ /* vim: set ts=4: */ /* vim: set sw=5 ts=5: */ /* vim: set ts=6: */ /* vim: set ts=7: */ /* vim: set ts=8: */ sets sw=5 and ts=6. Obviously CPython shouldn't be going through all that palaver! But it would be a bit more vim-like to use the second line rather than the first if both lines have the cookie. Take that as you will - I'm not saying being 'vim-like' is an inherent virtue ;-)

On Tue, 15 Mar 2016 at 13:31 Guido van Rossum <guido@python.org> wrote:
I think the spirit of PEP 263 is for the first specified encoding to win as the support of two lines is to support shebangs and not multiple encodings :) . I also think the fact that tokenize.detect_encoding() <https://docs.python.org/3/library/tokenize.html#tokenize.detect_encoding> doesn't automatically read two lines from its input also suggests the intent is "first encoding wins" (and that is the semantics of the function).

I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs. On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon <brett@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

Guido van Rossum <guido@python.org> writes:
+1, it never occurred to me that the specification could mean otherwise. On reflection I can't see a good reason for it to mean otherwise. -- \ “Alternative explanations are always welcome in science, if | `\ they are better and explain more. Alternative explanations that | _o__) explain nothing are not welcome.” —Victor J. Stenger, 2001-11-05 | Ben Finney

On 16.03.16 02:28, Guido van Rossum wrote:
Could you please also update the regular expression in PEP 263 to "^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"? Coding cookie must be in comment, only the first occurrence in the line must be taken to account (here is a bug in CPython), encoding name must be ASCII, and there must not be any Python statement on the line that contains the encoding declaration. [1] [1] https://bugs.python.org/issue18873

On 3/16/2016 3:14 AM, Serhiy Storchaka wrote:
Also, I think there should be one 'official' function somewhere in the stdlib to get and return the encoding declaration. The patch for the issue above had to make the same change in four places other than tests, a violent violation of DRY. -- Terry Jan Reedy

On 16.03.2016 01:28, Guido van Rossum wrote:
+1 The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 16 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here. I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first. I will update the regex in the PEP too (or change the wording to avoid "match"). I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior. Should we recommend that everyone use tokenize.detect_encoding()? -- --Guido van Rossum (python.org/~guido)

I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there). Should we recommend that everyone use tokenize.detect_encoding()? On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <guido@python.org> wrote:
-- --Guido van Rossum (python.org/~guido)

On 3/16/2016 5:29 PM, Guido van Rossum wrote:
Oh sure. Updating the PEP is the best way forward. But the reasoning, although from somewhat vague specifications, seems sound enough to declare that it meant "find the first cookie in the first two lines". Which is what you've said in the update, although not quite that tersely. It now leaves no room for ambiguous interpretations.
The only reason for an error would be to alert people that had depended on the bugs, or misinterpretations. Personally, I think if they haven't converted to UTF-8 by now, they've got bigger problems than this change.

On 17.03.16 02:29, Guido van Rossum wrote:
http://bugs.python.org/issue26581
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.

On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines). -- --Guido van Rossum (python.org/~guido)

On 17.03.16 16:55, Guido van Rossum wrote:
The simplest way to detect encoding of bytes string: lines = data.splitlines() encoding = tokenize.detect_encoding(iter(lines).__next__)[0] If you don't want to split all data on lines, the most efficient way in Python 3.5 is: encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0] In Python 3.5 io.BytesIO(data) has constant complexity. In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example: def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:] encoding = tokenize.detect_encoding(iterlines(data).__next__)[0] or it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detect_encoding(it.__next__) I don't know what approach is more efficient.

On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception. I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9f...).
Ditto with the SyntaxError though.
Having my own regex was simpler. :-( -- --Guido van Rossum (python.org/~guido)

On 17.03.2016 15:55, Guido van Rossum wrote:
I've uploaded the code I posted yesterday, modified to address some of the issues it had to github: https://github.com/malemburg/python-snippets/blob/master/detect_source_encod... I'm pretty sure the two-lines read can be optimized away and put straight into the regular expression used for matching. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 18 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.2016 01:29, Guido van Rossum wrote:
Thanks, will do.
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2. I've attached an example implementation with tests, which works in Python 2.7 and 3.
I suppose it's a rather rare case, since it's the first time that I heard about anyone thinking that a possible second line could be picked - after 15 years :-)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.2016 15:02, Serhiy Storchaka wrote:
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note. What other aspects are different than what Python implements ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 17.03.16 19:23, M.-A. Lemburg wrote:
1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig". 2. If there is a BOM and coding cookie is not 'utf-8', this is an error. 3. If the first line is not blank or comment line, the coding cookie is not searched in the second line. 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM). 5. There isn't the limit of 400 bytes. Actually there is a bug with handling long lines in current code, but even with this bug the limit is larger. 6. I made a mistake in the regular expression, missed the underscore. tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.

On 17.03.2016 18:53, Serhiy Storchaka wrote:
Ok, that makes sense (even though it's not mandated by the PEP; the utf-8-sig codec didn't exist yet).
2. If there is a BOM and coding cookie is not 'utf-8', this is an error.
It's an error for Python, but why should a detection function always raise an error for this case ? It would probably be a good idea to have an errors parameter to leave this to the use to decide. Same for unknown encodings.
3. If the first line is not blank or comment line, the coding cookie is not searched in the second line.
Hmm, the PEP does allow having the coding cookie in the second line, even if the first line is not a comment. Perhaps that's not really needed.
4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).
Well, that's cosmetics :-) The codec system will take care of this when needed.
I think it's a reasonable limit, since shebang lines may only be 127 long on at least Linux (and probably several other Unix systems as well). But just in case, I made this configurable :-)
6. I made a mistake in the regular expression, missed the underscore.
I added it.
tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.
Probably, but that doesn't us on Python 2, right ? I'll upload the script to github later today or tomorrow to continue development. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 15.03.16 22:30, Guido van Rossum wrote:
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.

On Wed, Mar 16, 2016 at 2:07 AM, Chris Angelico <rosuav@gmail.com> wrote:
+1. If multiple coding cookies are found, and all do not agree, I would expect an error to be raised. That it apparently does not raise an error currently is surprising to me. (If multiple coding cookies are found but do agree, perhaps raising a warning would be a good idea.)

On 3/15/2016 11:07 PM, Chris Angelico wrote:
From the PEP 263:
So clearly there is only one magic comment. "either" the first or second line, not both. Both, therefore, should be an error. From the PEP 263:
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment". Glenn

On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:
Sure. But there is no mention anywhere in the PEP of more than one being legal: just more than one position for it, EITHER line 1 or line 2. So while the regular expression mentioned is not anchored, to allow variation in syntax between emacs and vim, "must match the regular expression" doesn't imply "several times", and when searching for a regular expression that might not be anchored, one typically expects to find the first. Glenn

On 16.03.16 08:03, Serhiy Storchaka wrote:
I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins. Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.

On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs. (1) CPython should not necessarily act like emacs, unless the coding syntax exactly matches emacs, rather than the generic coding that CPython interprets, that matches emacs, vim, and other similar things that both emacs and vim would ignore. (1a) Maybe if a similar test were run on vim with its syntax, and it also works the same way, then one might think it is a trend worth following, but it is not clear to this non-vim user that vim syntax allows more than one coding specification per line. (2) emacs has no requirement that the coding be placed on the first two lines. It specifically looks at the second line only if the first line has a “ #! ” or a “ '\" ” (for troff). (according to docs, not experimentation) (3) emacs also allows for Local Variables to be specified at the end of the file. If CPython were really to act like emacs, then it would need to allow for that too. (4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim. (4a) Here's an untested line that emacs would interpret as utf-8, and CPython with the greedy regulare expression would interpret as latin-1, because emacs looks only between the -*- pair, and CPython ignores that. # -*- coding: utf-8 -*- this file does not use coding: latin-1

Glenn Linderman writes:
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
We can't treat Emacs as a spec, because Emacs doesn't follow specs, doesn't respect standards, and above a certain level of inconvenience to developers doesn't respect backward compatibility. There's never any guarantee that Emacs will do the same thing tomorrow that it does today, although inertia has mostly the same effect. In this case, there's a reason why Emacs behaves the way it does, which is that you can put an arbitrary sequence of variable assignments in "-*- ... -*-" and they will be executed in order. So it makes sense that "last coding wins". But pragmas are severely deprecated in Python; cookies got a very special exception. So that rationale can't apply to Python.
(4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim.
Indeed. I see no point in reading past the first cookie found (whether a valid codec or not), unless an error would be raised. That might be a good idea, but I doubt it's worth the implementation complexity.

On 19.03.16 19:36, Glenn Linderman wrote:
Yes. But current CPython works that way. The behavior of Emacs is the argument that maybe this is not a bug.
Since Emacs allows to specify the coding twice on a line, and this can be ambiguous, and CPython already detects some ambiguous situations (UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a check that the coding is specified only once on a line.

On 3/19/2016 2:37 PM, Serhiy Storchaka wrote:
If CPython properly handles the following line as having only one proper coding declaration (utf-8), then I might reluctantly agree that the behavior of Emacs might be a relevant argument. Otherwise, vehemently not relevant. # -*- coding: utf-8 -*- this file does not use coding: latin-1
Diagnosing ambiguous conditions, even including my example above, might be useful... for a few files... is it worth the effort? What % of .py sources have coding specifications? What % of those have two?
participants (14)
-
Ben Finney
-
Brett Cannon
-
Chris Angelico
-
Ethan Furman
-
Glenn Linderman
-
Guido van Rossum
-
Jon Ribbens
-
Jonathan Goble
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Terry Reedy