What does a double coding cookie mean?
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)? Reference: https://github.com/python/mypy/issues/1281 -- --Guido van Rossum (python.org/~guido)
On 2016-03-15 20:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
I think it should follow CPython. As I see it, CPython allows it to be on the second line because the first line might be needed for the shebang. If the first two lines both had an encoding, and then you inserted a shebang line, the second one would be ignored anyway.
On 2016-03-15 20:53, MRAB wrote:
On 2016-03-15 20:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
I think it should follow CPython.
As I see it, CPython allows it to be on the second line because the first line might be needed for the shebang.
If the first two lines both had an encoding, and then you inserted a shebang line, the second one would be ignored anyway.
A further thought: is mypy just assuming that the first line contains the shebang? If there's only one encoding line, and it's the first line, does mypy still get it right?
On Tue, Mar 15, 2016 at 01:30:08PM -0700, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
If it helps, what 'vim' appears to do is to read the first 'n' lines in order and then last 'n' lines in reverse order, stopping if the second stage reaches a line already processed by the first stage. So with 'modelines=5', the following file: /* vim: set ts=1: */ /* vim: set ts=2: */ /* vim: set ts=3: */ /* vim: set ts=4: */ /* vim: set sw=5 ts=5: */ /* vim: set ts=6: */ /* vim: set ts=7: */ /* vim: set ts=8: */ sets sw=5 and ts=6. Obviously CPython shouldn't be going through all that palaver! But it would be a bit more vim-like to use the second line rather than the first if both lines have the cookie. Take that as you will - I'm not saying being 'vim-like' is an inherent virtue ;-)
On Tue, 15 Mar 2016 at 13:31 Guido van Rossum <guido@python.org> wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
I think the spirit of PEP 263 is for the first specified encoding to win as the support of two lines is to support shebangs and not multiple encodings :) . I also think the fact that tokenize.detect_encoding() <https://docs.python.org/3/library/tokenize.html#tokenize.detect_encoding> doesn't automatically read two lines from its input also suggests the intent is "first encoding wins" (and that is the semantics of the function).
I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs. On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon <brett@python.org> wrote:
On Tue, 15 Mar 2016 at 13:31 Guido van Rossum <guido@python.org> wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
I think the spirit of PEP 263 is for the first specified encoding to win as the support of two lines is to support shebangs and not multiple encodings :) . I also think the fact that tokenize.detect_encoding() doesn't automatically read two lines from its input also suggests the intent is "first encoding wins" (and that is the semantics of the function).
-- --Guido van Rossum (python.org/~guido)
Guido van Rossum <guido@python.org> writes:
I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs.
+1, it never occurred to me that the specification could mean otherwise. On reflection I can't see a good reason for it to mean otherwise. -- \ “Alternative explanations are always welcome in science, if | `\ they are better and explain more. Alternative explanations that | _o__) explain nothing are not welcome.” —Victor J. Stenger, 2001-11-05 | Ben Finney
On 16.03.16 02:28, Guido van Rossum wrote:
I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs.
Could you please also update the regular expression in PEP 263 to "^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"? Coding cookie must be in comment, only the first occurrence in the line must be taken to account (here is a bug in CPython), encoding name must be ASCII, and there must not be any Python statement on the line that contains the encoding declaration. [1] [1] https://bugs.python.org/issue18873
On 3/16/2016 3:14 AM, Serhiy Storchaka wrote:
On 16.03.16 02:28, Guido van Rossum wrote:
I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs.
Could you please also update the regular expression in PEP 263 to "^[ \t\v]*#.*?coding[:=][ \t]*([-.a-zA-Z0-9]+)"?
Coding cookie must be in comment, only the first occurrence in the line must be taken to account (here is a bug in CPython), encoding name must be ASCII, and there must not be any Python statement on the line that contains the encoding declaration. [1]
Also, I think there should be one 'official' function somewhere in the stdlib to get and return the encoding declaration. The patch for the issue above had to make the same change in four places other than tests, a violent violation of DRY. -- Terry Jan Reedy
On 16.03.2016 01:28, Guido van Rossum wrote:
I agree that the spirit of the PEP is to stop at the first coding cookie found. Would it be okay if I updated the PEP to clarify this? I'll definitely also update the docs.
+1 The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
On Tue, Mar 15, 2016 at 2:04 PM, Brett Cannon <brett@python.org> wrote:
On Tue, 15 Mar 2016 at 13:31 Guido van Rossum <guido@python.org> wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
I think the spirit of PEP 263 is for the first specified encoding to win as the support of two lines is to support shebangs and not multiple encodings :) . I also think the fact that tokenize.detect_encoding() doesn't automatically read two lines from its input also suggests the intent is "first encoding wins" (and that is the semantics of the function).
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 16 2016)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here. I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first. I will update the regex in the PEP too (or change the wording to avoid "match"). I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior. Should we recommend that everyone use tokenize.detect_encoding()? -- --Guido van Rossum (python.org/~guido)
I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there). Should we recommend that everyone use tokenize.detect_encoding()? On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <guido@python.org> wrote:
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here.
I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first.
I will update the regex in the PEP too (or change the wording to avoid "match").
I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior.
Should we recommend that everyone use tokenize.detect_encoding()?
-- --Guido van Rossum (python.org/~guido)
-- --Guido van Rossum (python.org/~guido)
On 3/16/2016 5:29 PM, Guido van Rossum wrote:
I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there).
Should we recommend that everyone use tokenize.detect_encoding()?
On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <guido@python.org> wrote:
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-) I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here.
Oh sure. Updating the PEP is the best way forward. But the reasoning, although from somewhat vague specifications, seems sound enough to declare that it meant "find the first cookie in the first two lines". Which is what you've said in the update, although not quite that tersely. It now leaves no room for ambiguous interpretations.
I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first.
The only reason for an error would be to alert people that had depended on the bugs, or misinterpretations. Personally, I think if they haven't converted to UTF-8 by now, they've got bigger problems than this change.
I will update the regex in the PEP too (or change the wording to avoid "match").
I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior.
Should we recommend that everyone use tokenize.detect_encoding()?
-- --Guido van Rossum (python.org/~guido)
On 17.03.16 02:29, Guido van Rossum wrote:
I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there).
http://bugs.python.org/issue26581
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines). -- --Guido van Rossum (python.org/~guido)
On Thu, 17 Mar 2016 at 07:56 Guido van Rossum <guido@python.org> wrote:
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines).
Since this is for mypy my guess is you only want to know the encoding, but if you're simply trying to decode bytes of syntax then importilb.util.decode_source() will handle that for you.
On 17.03.16 16:55, Guido van Rossum wrote:
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines).
The simplest way to detect encoding of bytes string: lines = data.splitlines() encoding = tokenize.detect_encoding(iter(lines).__next__)[0] If you don't want to split all data on lines, the most efficient way in Python 3.5 is: encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0] In Python 3.5 io.BytesIO(data) has constant complexity. In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example: def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:] encoding = tokenize.detect_encoding(iterlines(data).__next__)[0] or it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detect_encoding(it.__next__) I don't know what approach is more efficient.
On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 17.03.16 16:55, Guido van Rossum wrote:
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines).
The simplest way to detect encoding of bytes string:
lines = data.splitlines() encoding = tokenize.detect_encoding(iter(lines).__next__)[0]
This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception. I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9f...).
If you don't want to split all data on lines, the most efficient way in Python 3.5 is:
encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]
In Python 3.5 io.BytesIO(data) has constant complexity.
Ditto with the SyntaxError though.
In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example:
def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:]
encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]
or
it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detect_encoding(it.__next__)
I don't know what approach is more efficient.
Having my own regex was simpler. :-( -- --Guido van Rossum (python.org/~guido)
On 17.03.16 21:11, Guido van Rossum wrote:
I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9f...).
\s* matches newlines. {0,1}? is the same as ??.
On 17.03.16 21:11, Guido van Rossum wrote:
This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception.
Good point. "lineno" and "offset" attributes of SyntaxError is set to None by tokenize.detect_encoding() and to 0 by CPython interpreter. They should be set to useful values.
On 17.03.2016 15:55, Guido van Rossum wrote:
On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
Likely. However the interface of tokenize.detect_encoding() is not very simple.
I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines).
I've uploaded the code I posted yesterday, modified to address some of the issues it had to github: https://github.com/malemburg/python-snippets/blob/master/detect_source_encod... I'm pretty sure the two-lines read can be optimized away and put straight into the regular expression used for matching. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 18 2016)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 17.03.2016 01:29, Guido van Rossum wrote:
I've updated the PEP. Please review. I decided not to update the Unicode howto (the thing is too obscure). Serhiy, you're probably in a better position to fix the code looking for cookies to pick the first one if there are two on the same line (or do whatever you think should be done there).
Thanks, will do.
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2. I've attached an example implementation with tests, which works in Python 2.7 and 3.
On Wed, Mar 16, 2016 at 5:05 PM, Guido van Rossum <guido@python.org> wrote:
On Wed, Mar 16, 2016 at 12:59 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The only reason to read up to two lines was to address the use of the shebang on Unix, not to be able to define two competing source code encodings :-)
I know. I was just surprised that the PEP was sufficiently vague about it that when I found that mypy picked the second if there were two, I couldn't prove to myself that it was violating the PEP. I'd rather clarify the PEP than rely on the reasoning presented earlier here.
I suppose it's a rather rare case, since it's the first time that I heard about anyone thinking that a possible second line could be picked - after 15 years :-)
I don't like erroring out when there are two different cookies on two lines; I feel that the spirit of the PEP is to read up to two lines until a cookie is found, whichever comes first.
I will update the regex in the PEP too (or change the wording to avoid "match").
I'm not sure what to do if there are two cooking on one line. If CPython currently picks the latter we may want to preserve that behavior.
Should we recommend that everyone use tokenize.detect_encoding()?
-- --Guido van Rossum (python.org/~guido)
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 17.03.16 15:14, M.-A. Lemburg wrote:
On 17.03.2016 01:29, Guido van Rossum wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2.
I've attached an example implementation with tests, which works in Python 2.7 and 3.
Sorry, but this code doesn't match the behaviour of Python interpreter, nor other tools. I suggest to backport tokenize.detect_encoding() (but be aware that the default encoding in Python 2 is ASCII, not UTF-8).
On 17.03.2016 15:02, Serhiy Storchaka wrote:
On 17.03.16 15:14, M.-A. Lemburg wrote:
On 17.03.2016 01:29, Guido van Rossum wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2.
I've attached an example implementation with tests, which works in Python 2.7 and 3.
Sorry, but this code doesn't match the behaviour of Python interpreter, nor other tools. I suggest to backport tokenize.detect_encoding() (but be aware that the default encoding in Python 2 is ASCII, not UTF-8).
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note. What other aspects are different than what Python implements ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 17.03.16 19:23, M.-A. Lemburg wrote:
On 17.03.2016 15:02, Serhiy Storchaka wrote:
On 17.03.16 15:14, M.-A. Lemburg wrote:
On 17.03.2016 01:29, Guido van Rossum wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2.
I've attached an example implementation with tests, which works in Python 2.7 and 3.
Sorry, but this code doesn't match the behaviour of Python interpreter, nor other tools. I suggest to backport tokenize.detect_encoding() (but be aware that the default encoding in Python 2 is ASCII, not UTF-8).
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note.
What other aspects are different than what Python implements ?
1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig". 2. If there is a BOM and coding cookie is not 'utf-8', this is an error. 3. If the first line is not blank or comment line, the coding cookie is not searched in the second line. 4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM). 5. There isn't the limit of 400 bytes. Actually there is a bug with handling long lines in current code, but even with this bug the limit is larger. 6. I made a mistake in the regular expression, missed the underscore. tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.
On 17.03.2016 18:53, Serhiy Storchaka wrote:
On 17.03.16 19:23, M.-A. Lemburg wrote:
On 17.03.2016 15:02, Serhiy Storchaka wrote:
On 17.03.16 15:14, M.-A. Lemburg wrote:
On 17.03.2016 01:29, Guido van Rossum wrote:
Should we recommend that everyone use tokenize.detect_encoding()?
I'd prefer a separate utility for this somewhere, since tokenize.detect_encoding() is not available in Python 2.
I've attached an example implementation with tests, which works in Python 2.7 and 3.
Sorry, but this code doesn't match the behaviour of Python interpreter, nor other tools. I suggest to backport tokenize.detect_encoding() (but be aware that the default encoding in Python 2 is ASCII, not UTF-8).
Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note.
What other aspects are different than what Python implements ?
1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".
Ok, that makes sense (even though it's not mandated by the PEP; the utf-8-sig codec didn't exist yet).
2. If there is a BOM and coding cookie is not 'utf-8', this is an error.
It's an error for Python, but why should a detection function always raise an error for this case ? It would probably be a good idea to have an errors parameter to leave this to the use to decide. Same for unknown encodings.
3. If the first line is not blank or comment line, the coding cookie is not searched in the second line.
Hmm, the PEP does allow having the coding cookie in the second line, even if the first line is not a comment. Perhaps that's not really needed.
4. Encoding name should be canonized. "UTF8", "utf8", "utf_8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).
Well, that's cosmetics :-) The codec system will take care of this when needed.
5. There isn't the limit of 400 bytes. Actually there is a bug with handling long lines in current code, but even with this bug the limit is larger.
I think it's a reasonable limit, since shebang lines may only be 127 long on at least Linux (and probably several other Unix systems as well). But just in case, I made this configurable :-)
6. I made a mistake in the regular expression, missed the underscore.
I added it.
tokenize.detect_encoding() is the closest imitation of the behavior of Python interpreter.
Probably, but that doesn't us on Python 2, right ? I'll upload the script to github later today or tomorrow to continue development. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Mar 17 2016)
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/
2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/
On 15.03.16 22:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.
On Wed, Mar 16, 2016 at 5:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
On 15.03.16 22:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.
Why would you ever have two coding cookies in a file? Surely this should be either an error, or ill-defined (ie parsers are allowed to pick whichever they like, including raising)? ChrisA
On Wed, Mar 16, 2016 at 2:07 AM, Chris Angelico <rosuav@gmail.com> wrote:
Why would you ever have two coding cookies in a file? Surely this should be either an error, or ill-defined (ie parsers are allowed to pick whichever they like, including raising)?
ChrisA
+1. If multiple coding cookies are found, and all do not agree, I would expect an error to be raised. That it apparently does not raise an error currently is surprising to me. (If multiple coding cookies are found but do agree, perhaps raising a warning would be a good idea.)
On 3/15/2016 11:07 PM, Chris Angelico wrote:
On 15.03.16 22:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug. Why would you ever have two coding cookies in a file? Surely this should be either an error, or ill-defined (ie parsers are allowed to
On Wed, Mar 16, 2016 at 5:03 PM, Serhiy Storchaka <storchaka@gmail.com> wrote: pick whichever they like, including raising)?
ChrisA
From the PEP 263:
To define a source code encoding, a magic comment must be placed into the source files either as first or second line in the file, such as:
So clearly there is only one magic comment. "either" the first or second line, not both. Both, therefore, should be an error. From the PEP 263:
More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration.
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment". Glenn
On 16.03.16 08:34, Glenn Linderman wrote:
From the PEP 263:
More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration.
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment".
"The first group of this expression" means the first regular expression group. Only the part between parenthesis "([-\w.]+)" is interpreted as encoding name, not all expression.
On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:
On 16.03.16 08:34, Glenn Linderman wrote:
From the PEP 263:
More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration.
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment".
"The first group of this expression" means the first regular expression group. Only the part between parenthesis "([-\w.]+)" is interpreted as encoding name, not all expression.
Sure. But there is no mention anywhere in the PEP of more than one being legal: just more than one position for it, EITHER line 1 or line 2. So while the regular expression mentioned is not anchored, to allow variation in syntax between emacs and vim, "must match the regular expression" doesn't imply "several times", and when searching for a regular expression that might not be anchored, one typically expects to find the first. Glenn
On 16.03.16 09:46, Glenn Linderman wrote:
On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:
On 16.03.16 08:34, Glenn Linderman wrote:
From the PEP 263:
More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration.
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment".
"The first group of this expression" means the first regular expression group. Only the part between parenthesis "([-\w.]+)" is interpreted as encoding name, not all expression.
Sure. But there is no mention anywhere in the PEP of more than one being legal: just more than one position for it, EITHER line 1 or line 2. So while the regular expression mentioned is not anchored, to allow variation in syntax between emacs and vim, "must match the regular expression" doesn't imply "several times", and when searching for a regular expression that might not be anchored, one typically expects to find the first.
Actually "must match the regular expression" is not correct, because re.match() implies anchoring at the start. I have proposed more correct regular expression in other branch of this thread.
On 3/16/2016 12:59 AM, Serhiy Storchaka wrote:
On 16.03.16 09:46, Glenn Linderman wrote:
On 3/16/2016 12:09 AM, Serhiy Storchaka wrote:
On 16.03.16 08:34, Glenn Linderman wrote:
From the PEP 263:
More precisely, the first or second line must match the regular expression "coding[:=]\s*([-\w.]+)". The first group of this expression is then interpreted as encoding name. If the encoding is unknown to Python, an error is raised during compilation. There must not be any Python statement on the line that contains the encoding declaration.
Clearly the regular expression would only match the first of multiple cookies on the same line, so the first one should always win... but there should only be one, from the first PEP quote "a magic comment".
"The first group of this expression" means the first regular expression group. Only the part between parenthesis "([-\w.]+)" is interpreted as encoding name, not all expression.
Sure. But there is no mention anywhere in the PEP of more than one being legal: just more than one position for it, EITHER line 1 or line 2. So while the regular expression mentioned is not anchored, to allow variation in syntax between emacs and vim, "must match the regular expression" doesn't imply "several times", and when searching for a regular expression that might not be anchored, one typically expects to find the first.
Actually "must match the regular expression" is not correct, because re.match() implies anchoring at the start. I have proposed more correct regular expression in other branch of this thread.
"match" doesn't imply anchoring at the start. "re.match()" does (and as a result is very confusing to newbies to Python re, that have used other regexp systems).
On 03/17/2016 04:54 PM, Glenn Linderman wrote:
On 3/16/2016 12:59 AM, Serhiy Storchaka wrote:
Actually "must match the regular expression" is not correct, because re.match() implies anchoring at the start. I have proposed more correct regular expression in other branch of this thread.
"match" doesn't imply anchoring at the start. "re.match()" does (and as a result is very confusing to newbies to Python re, that have used other regexp systems).
It still confuses me from time to time. :( -- ~Ethan~
On 16.03.16 08:03, Serhiy Storchaka wrote:
On 15.03.16 22:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.
I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins. Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
On 16.03.16 08:03, Serhiy Storchaka wrote:
On 15.03.16 22:30, Guido van Rossum wrote:
I came across a file that had two different coding cookies -- one on the first line and one on the second. CPython uses the first, but mypy happens to use the second. I couldn't find anything in the spec or docs ruling out the second interpretation. Does anyone have a suggestion (apart from following CPython)?
Reference: https://github.com/python/mypy/issues/1281
There is similar question. If a file has two different coding cookies on the same line, what should win? Currently the last cookie wins, in CPython parser, in the tokenize module, in IDLE, and in number of other code. I think this is a bug.
I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins.
Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs. (1) CPython should not necessarily act like emacs, unless the coding syntax exactly matches emacs, rather than the generic coding that CPython interprets, that matches emacs, vim, and other similar things that both emacs and vim would ignore. (1a) Maybe if a similar test were run on vim with its syntax, and it also works the same way, then one might think it is a trend worth following, but it is not clear to this non-vim user that vim syntax allows more than one coding specification per line. (2) emacs has no requirement that the coding be placed on the first two lines. It specifically looks at the second line only if the first line has a “ #! ” or a “ '\" ” (for troff). (according to docs, not experimentation) (3) emacs also allows for Local Variables to be specified at the end of the file. If CPython were really to act like emacs, then it would need to allow for that too. (4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim. (4a) Here's an untested line that emacs would interpret as utf-8, and CPython with the greedy regulare expression would interpret as latin-1, because emacs looks only between the -*- pair, and CPython ignores that. # -*- coding: utf-8 -*- this file does not use coding: latin-1
Glenn Linderman writes:
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs.
(1) CPython should not necessarily act like emacs,
We can't treat Emacs as a spec, because Emacs doesn't follow specs, doesn't respect standards, and above a certain level of inconvenience to developers doesn't respect backward compatibility. There's never any guarantee that Emacs will do the same thing tomorrow that it does today, although inertia has mostly the same effect. In this case, there's a reason why Emacs behaves the way it does, which is that you can put an arbitrary sequence of variable assignments in "-*- ... -*-" and they will be executed in order. So it makes sense that "last coding wins". But pragmas are severely deprecated in Python; cookies got a very special exception. So that rationale can't apply to Python.
(4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim.
Indeed. I see no point in reading past the first cookie found (whether a valid codec or not), unless an error would be raised. That might be a good idea, but I doubt it's worth the implementation complexity.
On 19.03.16 19:36, Glenn Linderman wrote:
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
On 16.03.16 08:03, Serhiy Storchaka wrote: I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins.
Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs.
Yes. But current CPython works that way. The behavior of Emacs is the argument that maybe this is not a bug.
(4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim. (4a) Here's an untested line that emacs would interpret as utf-8, and CPython with the greedy regulare expression would interpret as latin-1, because emacs looks only between the -*- pair, and CPython ignores that. # -*- coding: utf-8 -*- this file does not use coding: latin-1
Since Emacs allows to specify the coding twice on a line, and this can be ambiguous, and CPython already detects some ambiguous situations (UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a check that the coding is specified only once on a line.
On 3/19/2016 2:37 PM, Serhiy Storchaka wrote:
On 19.03.16 19:36, Glenn Linderman wrote:
On 3/19/2016 8:19 AM, Serhiy Storchaka wrote:
On 16.03.16 08:03, Serhiy Storchaka wrote: I just tested with Emacs, and it looks that when specify different codings on two different lines, the first coding wins, but when specify different codings on the same line, the last coding wins.
Therefore current CPython behavior can be correct, and the regular expression in PEP 263 should be changed to use greedy repetition.
Just because emacs works that way (and even though I'm an emacs user), that doesn't mean CPython should act like emacs.
Yes. But current CPython works that way. The behavior of Emacs is the argument that maybe this is not a bug.
If CPython properly handles the following line as having only one proper coding declaration (utf-8), then I might reluctantly agree that the behavior of Emacs might be a relevant argument. Otherwise, vehemently not relevant. # -*- coding: utf-8 -*- this file does not use coding: latin-1
(4) there is no benefit to specifying the coding twice on a line, it only adds confusion, whether in CPython, emacs, or vim. (4a) Here's an untested line that emacs would interpret as utf-8, and CPython with the greedy regulare expression would interpret as latin-1, because emacs looks only between the -*- pair, and CPython ignores that. # -*- coding: utf-8 -*- this file does not use coding: latin-1
Since Emacs allows to specify the coding twice on a line, and this can be ambiguous, and CPython already detects some ambiguous situations (UTF-8 BOM and non-UTF-8 coding cookie), it may be worth to add a check that the coding is specified only once on a line.
Diagnosing ambiguous conditions, even including my example above, might be useful... for a few files... is it worth the effort? What % of .py sources have coding specifications? What % of those have two?
On 20 March 2016 at 07:46, Glenn Linderman <v+python@g.nevcal.com> wrote:
Diagnosing ambiguous conditions, even including my example above, might be useful... for a few files... is it worth the effort? What % of .py sources have coding specifications? What % of those have two?
And there's a decent argument for leaving detecting such cases to linters rather than the tokeniser. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (14)
-
Ben Finney
-
Brett Cannon
-
Chris Angelico
-
Ethan Furman
-
Glenn Linderman
-
Guido van Rossum
-
Jon Ribbens
-
Jonathan Goble
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Serhiy Storchaka
-
Stephen J. Turnbull
-
Terry Reedy