[issue20115] NUL bytes in commented lines

Fri Jan 10 19:22:35 CET 2014

Terry J. Reedy added the comment:

Python should have a uniform definition of 'Python source' in both the doc and in practice in all source code processing functions. Currently, "2. Lexical analysis" in the Language Manual just says "Python reads program text as Unicode code points; the encoding of a source file can be given by an encoding declaration and defaults to UTF-8." UTF-8 encodes code point U+0000 as a null byte and this code point is nowhere excluded in the doc. (The definition of string literals uses 'source character' without any additional specification, so I take it to mean 'Unicode code point'.)

If U+0000 is a legal 'source character', it, as with other control chars not given special meaning, should be a SyntaxError unless occurring in a comment or string literal. Eval and exec exclude even the latter with 
TypeError: source code string cannot contain null bytes
If null bytes are legal, this is wrong.

Simply truncating lines as done by the CPython parser is wrong whether not not U+0000 is legal.

The simplest change would be to change the parser to match exec and add " other than U+000" after "Unicode code points" in the sentence quoted above.

----------
nosy: +terry.reedy

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue20115>
_______________________________________