[issue14629] discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()
report at bugs.python.org
Fri Apr 20 07:17:18 CEST 2012
New submission from Eric Snow <ericsnowcurrently at gmail.com>:
The behavior of tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename() is unexpectedly different and this has bearing on the current work on imports.
When a file has no encoding indicator (see PEP 263) it falls back to UTF8 (see PEP 3120). The tokenize module (Lib/tokenize.py) facilitates this through "detect_encoding()". The CPython internal tokenizer (Python/tokenizer.c) does so through "PyTokenizer_FindEncodingFilename()". Both check the first two lines of the file, per PEP 263.
When faced with an unparsable file (per the encoding), tokenize.detect_encoding() will gladly give you the encoding without any fuss. However, PyTokenizer_FindEncodingFilename() will raise a SyntaxError in that situation.
The 'badsyntax_pep3120' test (Lib/test/badsyntax_pep3120.py) is one module that demonstrates this discrepency. I'll use it in the following example.
enc = tokenize.detect_encoding(open("cpython/Lib/test/badsyntax_pep3120.py").readline)
print(enc) # "utf-8" (no SyntaxError)
I've attached the source for a C extension module ('_tokenizer') that wraps PyTokenizer_FindEncodingFilename().
enc = _tokenizer.detect_encoding("cpython/Lib/test/badsyntax_pep3120.py")
print(enc) # raises SyntaxError
Some relevant, related notes:
The discrepencies extend further too. The following code returns a UnicodeDecodeError, rather than a SyntaxError:
In 3.1 (C-based import machinery, Python/import.c), the following results in a SyntaxError, during encoding detection. In the current repo tip (importlib-based import machinery, Lib/importlib/_bootstrap.py), the following results in a SyntaxError much later, during compilation.
importlib uses tokenize.detect_encoding() and import.c uses PyTokenizer_FindEncodingFilename()...
components: Library (Lib)
nosy: brett.cannon, eric.snow, loewis
title: discrepency between tokenize.detect_encoding() and PyTokenizer_FindEncodingFilename()
versions: Python 3.3
Added file: http://bugs.python.org/file25283/_tokenizer.c
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list