Reading Python source file

I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

If you free the memory used for the source buffer before starting code generation you should be good.
On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka storchaka@gmail.com wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
[1] http://bugs.python.org/issue25643
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

On 17.11.15 05:00, Guido van Rossum wrote:
If you free the memory used for the source buffer before starting code generation you should be good.
Thank you. The buffer is freed just after the end of generating AST.
On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka storchaka@gmail.com wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
[1] http://bugs.python.org/issue25643
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

OK, but what are you going to do about the interactive REPL?
On Tue, Nov 17, 2015 at 7:40 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 05:00, Guido van Rossum wrote:
If you free the memory used for the source buffer before starting code generation you should be good.
Thank you. The buffer is freed just after the end of generating AST.
On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka storchaka@gmail.com wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
[1] http://bugs.python.org/issue25643
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

Oh, cool! Sorry for the disturbance.
On Tue, Nov 17, 2015 at 8:27 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 18:06, Guido van Rossum wrote:
OK, but what are you going to do about the interactive REPL?
Nothing (except some simplification). This is a separate branch of the code.
Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

On 2015-11-17 01:53, Serhiy Storchaka wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present). This kind of suggests that encodings like UTF-16 would cause a problem on such systems.
The encoding line also needs to be ASCII-compatible.
I believe that the recent thread "Support of UTF-16 and UTF-32 source encodings" also concluded that UTF-16 and UTF-32 shouldn't be supported.
This means that you could treat the first 2 lines as though they were some kind of extended ASCII (Latin-1?), the line ending being '\n' or '\r' or '\r\n'.
Once you'd identify the encoding, you could decode everything (including the shebang line) using that encoding.
(What should happen if the encoding line then decoded differently, i.e. encoding_line.decode(encoding) != encoding_line.decode('latin-1')?)

MRAB python@mrabarnett.plus.com writes:
As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present).
The UTF-8 BOM interferes with it on Mac OSX and Linux, at least.

On 17.11.15 05:05, MRAB wrote:
As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present). This kind of suggests that encodings like UTF-16 would cause a problem on such systems.
The encoding line also needs to be ASCII-compatible.
I believe that the recent thread "Support of UTF-16 and UTF-32 source encodings" also concluded that UTF-16 and UTF-32 shouldn't be supported.
This means that you could treat the first 2 lines as though they were some kind of extended ASCII (Latin-1?), the line ending being '\n' or '\r' or '\r\n'.
Once you'd identify the encoding, you could decode everything (including the shebang line) using that encoding.
Yes, that is what I were going to implement (and already halfway here). My question is whether it is worth to complicate the code further to preserve reading by the line. In any case after reading the first line that doesn't contain neither coding cookie, nor non-comment tokens, we need to wait the second line.
(What should happen if the encoding line then decoded differently, i.e. encoding_line.decode(encoding) != encoding_line.decode('latin-1')?)
The parser should got the line decoded with specified encoding.

On 17.11.2015 02:53, Serhiy Storchaka wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
A problem with this approach is that you can no longer fail early and detect indentation errors et al. while parsing the data (which may well come from a pipe).
Another related problem is that you have to wait for the full input data before you can start compiling the code.
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg mal@egenix.com wrote:
On 17.11.2015 02:53, Serhiy Storchaka wrote:
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:
- Reading from the string in memory.
- Interactive reading from the file.
- Reading from the file:
- Raw reading ignoring encoding in parser generator.
- Raw reading UTF-8 encoded file.
- Reading and recoding to UTF-8.
The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.
Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.
A problem with this approach is that you can no longer fail early and detect indentation errors et al. while parsing the data (which may well come from a pipe).
Oh, this use case I had forgotten about. I don't know how common or important it is though.
But more important is the interactive REPL, which parses your input fully each time you hit ENTER.
Another related problem is that you have to wait for the full input data before you can start compiling the code.
That's always the case -- we don't start compiling before we have the full parse tree.
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.

On 17.11.15 17:22, Guido van Rossum wrote:
But more important is the interactive REPL, which parses your input fully each time you hit ENTER.
Interactive REPL runs different code. It is simpler that the code for reading from file, because it have no care about BOM or coding cookie.

On 17.11.2015 16:22, Guido van Rossum wrote:
On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg mal@egenix.com wrote:
[moving from read source line by line to reading all in one go]
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.
Oh, that's easy:
elif sys.argv[0] == '-' and not (pyrun_as_string or pyrun_as_module): # Read the script from stdin pyrun_as_string = True pyrun_script = sys.stdin.read()
and then, later on:
# Run the script try: pyrun_execute_script(pyrun_script, mode) except Exception as reason: if pyrun_interactive: import traceback traceback.print_exc() pyrun_prompt(banner='') else: raise else: # Enter interactive mode, in case wanted if pyrun_interactive: pyrun_prompt()
The REPL is not affected by this, since we use the standard code.interact() for the prompt. This reads the entry line by line, joins the lines and tries to compile the entry every time it receives a new line until it succeeds or fails.
Serhiy's proposed change should not affect this mode of operation.

On Thu, Nov 19, 2015 at 1:47 AM, M.-A. Lemburg mal@egenix.com wrote:
On 17.11.2015 16:22, Guido van Rossum wrote:
On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg mal@egenix.com wrote:
[moving from read source line by line to reading all in one go]
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.
Oh, that's easy:
elif sys.argv[0] == '-' and not (pyrun_as_string or pyrun_as_module): # Read the script from stdin pyrun_as_string = True pyrun_script = sys.stdin.read()
and then, later on:
# Run the script try: pyrun_execute_script(pyrun_script, mode) except Exception as reason: if pyrun_interactive: import traceback traceback.print_exc() pyrun_prompt(banner='') else: raise else: # Enter interactive mode, in case wanted if pyrun_interactive: pyrun_prompt()
Yes, this makes sense.
The REPL is not affected by this, since we use the standard code.interact() for the prompt. This reads the entry line by line, joins the lines and tries to compile the entry every time it receives a new line until it succeeds or fails.
Heh, I wrote code.interact() as a poor-man's substitute for what the "real" REPL (implemented in C) does. :-) It usually ends up doing the same thing, but I'm sure there are edge cases where the "real" REPL is better. It doesn't re-parse after each line is read, it actually keeps the parser state and adds new tokens read from the tty. There is even a special grammar root symbol ('single') for this mode.
Serhiy's proposed change should not affect this mode of operation.
I sure hope not.
Though there is actually one case that IIRC doesn't work today: if sys.stdin is a stream that doesn't wrap a file descriptor. Would be nice to make that work. (Pretty esoteric use case though.)

On 17.11.15 11:59, M.-A. Lemburg wrote:
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 11:59, M.-A. Lemburg wrote:
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.
I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.

On 18 November 2015 at 02:50, Guido van Rossum guido@python.org wrote:
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 11:59, M.-A. Lemburg wrote:
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.
I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.
I think Serhiy was referring to the fact that importlib already reads the entire file before compiling it - since the import abstraction doesn't require modules to be files on disk, it uses the get_code() API on the module loader internally, which is typically implemented by calling get_source() and then passing the result to compile().
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.
Regards, Nick.

Aha, so the only code path that's being replaced is the code that reads the script file when invoking "python FILE" or "python <FILE" or "cat FILE | python" (but not "python" reading from a tty device)? That makes the whole endeavor even more attractive!
On Tue, Nov 17, 2015 at 6:31 PM, Nick Coghlan ncoghlan@gmail.com wrote:
On 18 November 2015 at 02:50, Guido van Rossum guido@python.org wrote:
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 11:59, M.-A. Lemburg wrote:
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.
We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.
I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.
I think Serhiy was referring to the fact that importlib already reads the entire file before compiling it - since the import abstraction doesn't require modules to be files on disk, it uses the get_code() API on the module loader internally, which is typically implemented by calling get_source() and then passing the result to compile().
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.
Regards, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 11/18/2015 03:31 AM, Nick Coghlan wrote:
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.
Reading the whole script in memory will incur an overhead when executing scripts that contain (potentially large) data embedded after the end of script source.
The technique of reading data from sys.argv[0] is probably obsolete now that Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that must support older Python versions. This feature doesn't affect imports and -c which are not expected to contain non-Python data.
Hrvoje

On Wed, Nov 18, 2015 at 4:15 AM, Hrvoje Niksic hrvoje.niksic@avl.com wrote:
On 11/18/2015 03:31 AM, Nick Coghlan wrote:
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.
Reading the whole script in memory will incur an overhead when executing scripts that contain (potentially large) data embedded after the end of script source.
The technique of reading data from sys.argv[0] is probably obsolete now that Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that must support older Python versions. This feature doesn't affect imports and -c which are not expected to contain non-Python data.
That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect).

On 11/18/2015 04:48 PM, Guido van Rossum wrote:
That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect).
If this never really worked in Python, feel free to drop the issue. I may be misremembering the language in which scripts I saw using this techniques years ago were written - most likely sh or Perl.
Hrvoje

On 18 November 2015 at 15:57, Hrvoje Niksic hrvoje.niksic@avl.com wrote:
If this never really worked in Python, feel free to drop the issue. I may be misremembering the language in which scripts I saw using this techniques years ago were written - most likely sh or Perl.
It was Perl. In the past I've tried to emulate this in Python and it's not been very successful, so I'd say it's not something to worry about here. Paul

Well, not quite the same thing, but https://github.com/kirbyfan64/pfbuild/blob/master/pfbuild embeds the compressed version of 16k LOC. Would it be affected negatively in any way be this?
Since all the data is on one line, I'd think the old (current) parser would end up reading in the whole line anyway.
On November 18, 2015 9:48:41 AM CST, Guido van Rossum guido@python.org wrote:
On Wed, Nov 18, 2015 at 4:15 AM, Hrvoje Niksic hrvoje.niksic@avl.com wrote:
On 11/18/2015 03:31 AM, Nick Coghlan wrote:
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in
*not*
reading the entire source file into memory first, so Serhiy's
proposed
simplification of the implementation makes sense to me.
Reading the whole script in memory will incur an overhead when
executing
scripts that contain (potentially large) data embedded after the end
of
script source.
The technique of reading data from sys.argv[0] is probably obsolete
now that
Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that
must
support older Python versions. This feature doesn't affect imports
and -c
which are not expected to contain non-Python data.
That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect).
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

On 19 November 2015 at 02:50, Ryan Gonzalez rymg19@gmail.com wrote:
Well, not quite the same thing, but https://github.com/kirbyfan64/pfbuild/blob/master/pfbuild embeds the compressed version of 16k LOC. Would it be affected negatively in any way be this?
Since all the data is on one line, I'd think the old (current) parser would end up reading in the whole line anyway.
Right. The other main "embedded binary blob" case I'm familiar with is get-pip.py, which embeds a base85 encoded copy of pip as a zip archive in a DATA variable, and there aren't any appending tricks there either - it's a normal variable, and the "if __name__ == '__main__':" block is the final two lines of the file, so Python already has to read the whole thing before it starts the process of unpacking it and executing it.
Two things worth keeping in mind here is that if a script is embedding enough data for reading the whole thing into memory it to pose a problem:
* compiling and executing it is likely to pose an even bigger problem than reading it in * downloading it in the first place is also likely to be annoying
We couldn't make a change like this in a maintenance release, but for a new feature release, the consistency gain is definitely worth it.
Cheers, Nick.

On 17.11.15 18:50, Guido van Rossum wrote:
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka storchaka@gmail.com wrote:
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.
I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.
Sorry for confusing. I meant parser level. File parser has a few bugs, that can cause that the source will be differently interpreted with file and string parsers. For example attached script produces different output, "ä" if executed as a script, and "À" if imported as a module.
And there is a question about the null byte. Now compile(), exec(), eval() raises an exception if the script contains the null byte. Formerly they accepted it, but the null byte ended the script. The behavior of file parser is more weird. The null byte makes parser to ignore the end of script including the newline byte [1]. E.g. "#\0\nprint('a')" is interpreted as "#print('a')". This is different from PyPy (and may be other implementations) that interprets the null byte just as ordinal character.
The question is wherever we should support the null byte in Python sources.

On Thu, Nov 19, 2015 at 10:51 PM, Serhiy Storchaka storchaka@gmail.com wrote:
Interestingly, the file linked in the last comment on that issue [1] ties in with another part of this thread, regarding binary blobs in Python scripts. It uses open(sys.argv[0],'rb') to find itself, and has the binary data embedded as a comment. (Though the particular technique strikes me as fragile; to prevent newlines from messing things up, \n becomes #% and \r becomes #$, but I don't see any protection against those sequences occurring naturally in the blob. Given that it's a bz2 archive, I would expect any two-byte sequence to be capable of occurring.) To be quite honest, I wouldn't mind if Python objected to this kind of code. If I were writing it myself, I'd use a triple-quoted string containing some kind of textualized version - either the repr of the string, or some kind of base 64 or base 85 encoding. Binary data embedded literally will prevent non-ASCII file encodings.
ChrisA

Yeah, let's kill this undefined behavior.
On Thu, Nov 19, 2015 at 4:10 AM, Chris Angelico rosuav@gmail.com wrote:
On Thu, Nov 19, 2015 at 10:51 PM, Serhiy Storchaka storchaka@gmail.com wrote:
Interestingly, the file linked in the last comment on that issue [1] ties in with another part of this thread, regarding binary blobs in Python scripts. It uses open(sys.argv[0],'rb') to find itself, and has the binary data embedded as a comment. (Though the particular technique strikes me as fragile; to prevent newlines from messing things up, \n becomes #% and \r becomes #$, but I don't see any protection against those sequences occurring naturally in the blob. Given that it's a bz2 archive, I would expect any two-byte sequence to be capable of occurring.) To be quite honest, I wouldn't mind if Python objected to this kind of code. If I were writing it myself, I'd use a triple-quoted string containing some kind of textualized version
- either the repr of the string, or some kind of base 64 or base 85
encoding. Binary data embedded literally will prevent non-ASCII file encodings.
ChrisA
[1] http://ftp.waf.io/pub/release/waf-1.7.16 _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

On Thu, Nov 19, 2015 at 3:51 AM, Serhiy Storchaka storchaka@gmail.com wrote:
On 17.11.15 18:50, Guido van Rossum wrote:
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka storchaka@gmail.com wrote:
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.
I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.
Sorry for confusing. I meant parser level. File parser has a few bugs, that can cause that the source will be differently interpreted with file and string parsers. For example attached script produces different output, "ä" if executed as a script, and "À" if imported as a module.
I see. Well, I trust you in this area (it's been too long since I wrote the original code for all that :-).
And there is a question about the null byte. Now compile(), exec(), eval() raises an exception if the script contains the null byte. Formerly they accepted it, but the null byte ended the script.
I like erroring out better.
The behavior of file parser is more weird. The null byte makes parser to ignore the end of script including the newline byte [1]. E.g. "#\0\nprint('a')" is interpreted as "#print('a')". This is different from PyPy (and may be other implementations) that interprets the null byte just as ordinal character.
Yeah, this is just poorly written old code that didn't expect anyone to care about null bytes. IMO here too the null byte should be an error.
The question is wherever we should support the null byte in Python sources.
Good luck, and thanks for working on this!
participants (10)
-
Chris Angelico
-
Guido van Rossum
-
Hrvoje Niksic
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Paul Moore
-
Random832
-
Ryan Gonzalez
-
Serhiy Storchaka