Mailman 3 Reading Python source file - Python-Dev

newer
Improving the reading part of REPL

Reading Python source file

older
Python stdlib ssl.SSLContext is...

Serhiy Storchaka

17 Nov 2015 17 Nov '15

1:53 a.m.

I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases: * Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8. The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code. Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line. [1] http://bugs.python.org/issue25643

Show replies by date

Guido van Rossum

17 Nov 17 Nov

3 a.m.

If you free the memory used for the source buffer before starting code generation you should be good. On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka wrote:

...

I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

[1] http://bugs.python.org/issue25643

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Serhiy Storchaka

3:40 p.m.

On 17.11.15 05:00, Guido van Rossum wrote:

...

If you free the memory used for the source buffer before starting code generation you should be good.

Thank you. The buffer is freed just after the end of generating AST.

...

On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka wrote:

...
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

[1] http://bugs.python.org/issue25643

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

Guido van Rossum

4:06 p.m.

OK, but what are you going to do about the interactive REPL? On Tue, Nov 17, 2015 at 7:40 AM, Serhiy Storchaka wrote:

...

On 17.11.15 05:00, Guido van Rossum wrote:

...
If you free the memory used for the source buffer before starting code generation you should be good.

Thank you. The buffer is freed just after the end of generating AST.

...
On Mon, Nov 16, 2015 at 5:53 PM, Serhiy Storchaka wrote:

...
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

[1] http://bugs.python.org/issue25643

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Serhiy Storchaka

4:27 p.m.

On 17.11.15 18:06, Guido van Rossum wrote:

...

OK, but what are you going to do about the interactive REPL?

Nothing (except some simplification). This is a separate branch of the code.

Guido van Rossum

4:51 p.m.

Oh, cool! Sorry for the disturbance. On Tue, Nov 17, 2015 at 8:27 AM, Serhiy Storchaka wrote:

...

On 17.11.15 18:06, Guido van Rossum wrote:

...
OK, but what are you going to do about the interactive REPL?

Nothing (except some simplification). This is a separate branch of the code.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

MRAB

3:05 a.m.

On 2015-11-17 01:53, Serhiy Storchaka wrote:

...

I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

[1] http://bugs.python.org/issue25643

As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present). This kind of suggests that encodings like UTF-16 would cause a problem on such systems. The encoding line also needs to be ASCII-compatible. I believe that the recent thread "Support of UTF-16 and UTF-32 source encodings" also concluded that UTF-16 and UTF-32 shouldn't be supported. This means that you could treat the first 2 lines as though they were some kind of extended ASCII (Latin-1?), the line ending being '\n' or '\r' or '\r\n'. Once you'd identify the encoding, you could decode everything (including the shebang line) using that encoding. (What should happen if the encoding line then decoded differently, i.e. encoding_line.decode(encoding) != encoding_line.decode('latin-1')?)

Random832

4:09 a.m.

MRAB writes:

...

As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present).

The UTF-8 BOM interferes with it on Mac OSX and Linux, at least.

Serhiy Storchaka

4:06 p.m.

On 17.11.15 05:05, MRAB wrote:

...

As I understand it, *nix expects the shebang to be b'#!', which means that the first line should be ASCII-compatible (it's possible that the UTF-8 BOM might be present). This kind of suggests that encodings like UTF-16 would cause a problem on such systems.

The encoding line also needs to be ASCII-compatible.

I believe that the recent thread "Support of UTF-16 and UTF-32 source encodings" also concluded that UTF-16 and UTF-32 shouldn't be supported.

This means that you could treat the first 2 lines as though they were some kind of extended ASCII (Latin-1?), the line ending being '\n' or '\r' or '\r\n'.

Once you'd identify the encoding, you could decode everything (including the shebang line) using that encoding.

Yes, that is what I were going to implement (and already halfway here). My question is whether it is worth to complicate the code further to preserve reading by the line. In any case after reading the first line that doesn't contain neither coding cookie, nor non-comment tokens, we need to wait the second line.

...

(What should happen if the encoding line then decoded differently, i.e. encoding_line.decode(encoding) != encoding_line.decode('latin-1')?)

The parser should got the line decoded with specified encoding.

M.-A. Lemburg

9:59 a.m.

On 17.11.2015 02:53, Serhiy Storchaka wrote:

...

I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

A problem with this approach is that you can no longer fail early and detect indentation errors et al. while parsing the data (which may well come from a pipe). Another related problem is that you have to wait for the full input data before you can start compiling the code. I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach. We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

...

[1] http://bugs.python.org/issue25643

-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 17 2015)

...

...
...
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/

2015-10-23: Released mxODBC Connect 2.1.5 ... http://egenix.com/go85 ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Guido van Rossum

3:22 p.m.

On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg wrote:

...

On 17.11.2015 02:53, Serhiy Storchaka wrote:

...
I'm working on rewriting Python tokenizer (in particular the part that reads and decodes Python source file). The code is complicated. For now there are such cases:

* Reading from the string in memory. * Interactive reading from the file. * Reading from the file: - Raw reading ignoring encoding in parser generator. - Raw reading UTF-8 encoded file. - Reading and recoding to UTF-8.

The file is read by the line. It makes hard to check correctness of the first line if the encoding is specified in the second line. And it makes very hard problems with null bytes and with desynchronizing buffered C and Python files. All this problems can be easily solved if read all Python source file in memory and then parse it as string. This would allow to drop a large complex and buggy part of code.

Are there disadvantages in this solution? As for memory consumption, the source text itself will consume only small part of the memory consumed by AST tree and other structures. As for performance, reading and decoding all file can be faster then by the line.

A problem with this approach is that you can no longer fail early and detect indentation errors et al. while parsing the data (which may well come from a pipe).

Oh, this use case I had forgotten about. I don't know how common or important it is though. But more important is the interactive REPL, which parses your input fully each time you hit ENTER.

...

Another related problem is that you have to wait for the full input data before you can start compiling the code.

That's always the case -- we don't start compiling before we have the full parse tree.

...

I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.

We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.

...

...
[1] http://bugs.python.org/issue25643

-- --Guido van Rossum (python.org/~guido)

Serhiy Storchaka

4:26 p.m.

On 17.11.15 17:22, Guido van Rossum wrote:

...

But more important is the interactive REPL, which parses your input fully each time you hit ENTER.

Interactive REPL runs different code. It is simpler that the code for reading from file, because it have no care about BOM or coding cookie.

M.-A. Lemburg

19 Nov 19 Nov

9:47 a.m.

On 17.11.2015 16:22, Guido van Rossum wrote:

...

On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg wrote:

...
...
[moving from read source line by line to reading all in one go] We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.

Oh, that's easy: elif sys.argv[0] == '-' and not (pyrun_as_string or pyrun_as_module): # Read the script from stdin pyrun_as_string = True pyrun_script = sys.stdin.read() and then, later on: # Run the script try: pyrun_execute_script(pyrun_script, mode) except Exception as reason: if pyrun_interactive: import traceback traceback.print_exc() pyrun_prompt(banner='') else: raise else: # Enter interactive mode, in case wanted if pyrun_interactive: pyrun_prompt() The REPL is not affected by this, since we use the standard code.interact() for the prompt. This reads the entry line by line, joins the lines and tries to compile the entry every time it receives a new line until it succeeds or fails. Serhiy's proposed change should not affect this mode of operation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 19 2015)

...

...
...
Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/

Guido van Rossum

5:41 p.m.

On Thu, Nov 19, 2015 at 1:47 AM, M.-A. Lemburg wrote:

...

On 17.11.2015 16:22, Guido van Rossum wrote:

...
On Tue, Nov 17, 2015 at 1:59 AM, M.-A. Lemburg wrote:

...
...
[moving from read source line by line to reading all in one go] We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Curious how you do it? I'd actually be quite disappointed if the amount of parsing done by the standard REPL went down.

Oh, that's easy:

elif sys.argv[0] == '-' and not (pyrun_as_string or pyrun_as_module): # Read the script from stdin pyrun_as_string = True pyrun_script = sys.stdin.read()

and then, later on:

# Run the script try: pyrun_execute_script(pyrun_script, mode) except Exception as reason: if pyrun_interactive: import traceback traceback.print_exc() pyrun_prompt(banner='') else: raise else: # Enter interactive mode, in case wanted if pyrun_interactive: pyrun_prompt()

Yes, this makes sense.

...

The REPL is not affected by this, since we use the standard code.interact() for the prompt. This reads the entry line by line, joins the lines and tries to compile the entry every time it receives a new line until it succeeds or fails.

Heh, I wrote code.interact() as a poor-man's substitute for what the "real" REPL (implemented in C) does. :-) It usually ends up doing the same thing, but I'm sure there are edge cases where the "real" REPL is better. It doesn't re-parse after each line is read, it actually keeps the parser state and adds new tokens read from the tty. There is even a special grammar root symbol ('single') for this mode.

...

Serhiy's proposed change should not affect this mode of operation.

I sure hope not. Though there is actually one case that IIRC doesn't work today: if sys.stdin is a stream that doesn't wrap a file descriptor. Would be nice to make that work. (Pretty esoteric use case though.) -- --Guido van Rossum (python.org/~guido)

Serhiy Storchaka

17 Nov 17 Nov

4:20 p.m.

On 17.11.15 11:59, M.-A. Lemburg wrote:

...

I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.

We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

Guido van Rossum

4:50 p.m.

On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka wrote:

...

On 17.11.15 11:59, M.-A. Lemburg wrote:

...
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.

We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL. -- --Guido van Rossum (python.org/~guido)

Nick Coghlan

18 Nov 18 Nov

2:31 a.m.

On 18 November 2015 at 02:50, Guido van Rossum wrote:

...

On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka wrote:

...
On 17.11.15 11:59, M.-A. Lemburg wrote:

...
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.

We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.

I think Serhiy was referring to the fact that importlib already reads the entire file before compiling it - since the import abstraction doesn't require modules to be files on disk, it uses the get_code() API on the module loader internally, which is typically implemented by calling get_source() and then passing the result to compile(). That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Guido van Rossum

2:46 a.m.

Aha, so the only code path that's being replaced is the code that reads the script file when invoking "python FILE" or "python wrote:

...

On 18 November 2015 at 02:50, Guido van Rossum wrote:

...
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka wrote:

...
On 17.11.15 11:59, M.-A. Lemburg wrote:

...
I don't think these situations are all that common, though, so reading in the full source code before compiling it sounds like a reasonable approach.

We use the same simplification in eGenix PyRun's emulation of the Python command line interface and it has so far not caused any problems.

Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.

I think Serhiy was referring to the fact that importlib already reads the entire file before compiling it - since the import abstraction doesn't require modules to be files on disk, it uses the get_code() API on the module loader internally, which is typically implemented by calling get_source() and then passing the result to compile().

That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.

Regards, Nick.

-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

-- --Guido van Rossum (python.org/~guido)

Hrvoje Niksic

12:15 p.m.

On 11/18/2015 03:31 AM, Nick Coghlan wrote:

...

That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.

Reading the whole script in memory will incur an overhead when executing scripts that contain (potentially large) data embedded after the end of script source. The technique of reading data from sys.argv[0] is probably obsolete now that Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that must support older Python versions. This feature doesn't affect imports and -c which are not expected to contain non-Python data. Hrvoje

Guido van Rossum

3:48 p.m.

On Wed, Nov 18, 2015 at 4:15 AM, Hrvoje Niksic wrote:

...

On 11/18/2015 03:31 AM, Nick Coghlan wrote:

...
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in *not* reading the entire source file into memory first, so Serhiy's proposed simplification of the implementation makes sense to me.

Reading the whole script in memory will incur an overhead when executing scripts that contain (potentially large) data embedded after the end of script source.

The technique of reading data from sys.argv[0] is probably obsolete now that Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that must support older Python versions. This feature doesn't affect imports and -c which are not expected to contain non-Python data.

That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect). -- --Guido van Rossum (python.org/~guido)

Hrvoje Niksic

3:57 p.m.

On 11/18/2015 04:48 PM, Guido van Rossum wrote:

...

That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect).

If this never really worked in Python, feel free to drop the issue. I may be misremembering the language in which scripts I saw using this techniques years ago were written - most likely sh or Perl. Hrvoje

Paul Moore

19 Nov 19 Nov

9:28 a.m.

On 18 November 2015 at 15:57, Hrvoje Niksic wrote:

...

If this never really worked in Python, feel free to drop the issue. I may be misremembering the language in which scripts I saw using this techniques years ago were written - most likely sh or Perl.

It was Perl. In the past I've tried to emulate this in Python and it's not been very successful, so I'd say it's not something to worry about here. Paul

Ryan Gonzalez

18 Nov 18 Nov

4:50 p.m.

Well, not quite the same thing, but https://github.com/kirbyfan64/pfbuild/blob/master/pfbuild embeds the compressed version of 16k LOC. Would it be affected negatively in any way be this? Since all the data is on one line, I'd think the old (current) parser would end up reading in the whole line anyway. On November 18, 2015 9:48:41 AM CST, Guido van Rossum wrote:

...

...
On 11/18/2015 03:31 AM, Nick Coghlan wrote:

...
That behaviour is then inherited at the command line by both the -m switch and the support for executing directories and zip archives. When we consider that the "-c" switch also executes an in-memory string, direct script execution is currently the odd one out in

*not*

...
reading the entire source file into memory first, so Serhiy's

On Wed, Nov 18, 2015 at 4:15 AM, Hrvoje Niksic wrote: proposed

...
...
simplification of the implementation makes sense to me.

Reading the whole script in memory will incur an overhead when executing scripts that contain (potentially large) data embedded after the end of script source.

The technique of reading data from sys.argv[0] is probably obsolete now that Python supports executing zipped archives, but it is popular in shell scripting and might still be used for self-extracting scripts that must support older Python versions. This feature doesn't affect imports and -c which are not expected to contain non-Python data.

That trick doesn't work unless the data looks like Python comments or data (e.g. a docstring). Python has always insisted on being able to parse until EOF. The only extreme case would be a small script followed by e.g. 4 GB of comments (where the old parser would indeed be more efficient). But unless you can point me to an occurrence of this in the wild I'm going to speculate that you just made this up based on the shell analogy (which isn't perfect).

-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com

-- Sent from my Nexus 5 with K-9 Mail. Please excuse my brevity.

Nick Coghlan

19 Nov 19 Nov

2:30 a.m.

On 19 November 2015 at 02:50, Ryan Gonzalez wrote:

...

Well, not quite the same thing, but https://github.com/kirbyfan64/pfbuild/blob/master/pfbuild embeds the compressed version of 16k LOC. Would it be affected negatively in any way be this?

Since all the data is on one line, I'd think the old (current) parser would end up reading in the whole line anyway.

Right. The other main "embedded binary blob" case I'm familiar with is get-pip.py, which embeds a base85 encoded copy of pip as a zip archive in a DATA variable, and there aren't any appending tricks there either - it's a normal variable, and the "if __name__ == '__main__':" block is the final two lines of the file, so Python already has to read the whole thing before it starts the process of unpacking it and executing it. Two things worth keeping in mind here is that if a script is embedding enough data for reading the whole thing into memory it to pose a problem: * compiling and executing it is likely to pose an even bigger problem than reading it in * downloading it in the first place is also likely to be annoying We couldn't make a change like this in a maintenance release, but for a new feature release, the consistency gain is definitely worth it. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Serhiy Storchaka

11:51 a.m.

On 17.11.15 18:50, Guido van Rossum wrote:

...

On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka wrote:

...
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.

Sorry for confusing. I meant parser level. File parser has a few bugs, that can cause that the source will be differently interpreted with file and string parsers. For example attached script produces different output, "ä" if executed as a script, and "Ã€" if imported as a module. And there is a question about the null byte. Now compile(), exec(), eval() raises an exception if the script contains the null byte. Formerly they accepted it, but the null byte ended the script. The behavior of file parser is more weird. The null byte makes parser to ignore the end of script including the newline byte [1]. E.g. "#\0\nprint('a')" is interpreted as "#print('a')". This is different from PyPy (and may be other implementations) that interprets the null byte just as ordinal character. The question is wherever we should support the null byte in Python sources. [1] http://bugs.python.org/issue20115

Chris Angelico

12:10 p.m.

On Thu, Nov 19, 2015 at 10:51 PM, Serhiy Storchaka wrote:

...

http://bugs.python.org/issue20115

Interestingly, the file linked in the last comment on that issue [1] ties in with another part of this thread, regarding binary blobs in Python scripts. It uses open(sys.argv[0],'rb') to find itself, and has the binary data embedded as a comment. (Though the particular technique strikes me as fragile; to prevent newlines from messing things up, \n becomes #% and \r becomes #$, but I don't see any protection against those sequences occurring naturally in the blob. Given that it's a bz2 archive, I would expect any two-byte sequence to be capable of occurring.) To be quite honest, I wouldn't mind if Python objected to this kind of code. If I were writing it myself, I'd use a triple-quoted string containing some kind of textualized version - either the repr of the string, or some kind of base 64 or base 85 encoding. Binary data embedded literally will prevent non-ASCII file encodings. ChrisA [1] http://ftp.waf.io/pub/release/waf-1.7.16

Guido van Rossum

6:28 p.m.

Yeah, let's kill this undefined behavior. On Thu, Nov 19, 2015 at 4:10 AM, Chris Angelico wrote:

...

On Thu, Nov 19, 2015 at 10:51 PM, Serhiy Storchaka wrote:

...
http://bugs.python.org/issue20115

Interestingly, the file linked in the last comment on that issue [1] ties in with another part of this thread, regarding binary blobs in Python scripts. It uses open(sys.argv[0],'rb') to find itself, and has the binary data embedded as a comment. (Though the particular technique strikes me as fragile; to prevent newlines from messing things up, \n becomes #% and \r becomes #$, but I don't see any protection against those sequences occurring naturally in the blob. Given that it's a bz2 archive, I would expect any two-byte sequence to be capable of occurring.) To be quite honest, I wouldn't mind if Python objected to this kind of code. If I were writing it myself, I'd use a triple-quoted string containing some kind of textualized version - either the repr of the string, or some kind of base 64 or base 85 encoding. Binary data embedded literally will prevent non-ASCII file encodings.

ChrisA

[1] http://ftp.waf.io/pub/release/waf-1.7.16 _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org

-- --Guido van Rossum (python.org/~guido)

Guido van Rossum

6:27 p.m.

On Thu, Nov 19, 2015 at 3:51 AM, Serhiy Storchaka wrote:

...

On 17.11.15 18:50, Guido van Rossum wrote:

...
On Tue, Nov 17, 2015 at 8:20 AM, Serhiy Storchaka wrote:

...
Current implementation of import system went the same way. As a result importing the script as a module and running it with command line can have different behaviours in corner cases.

I'm confused. *Of course* these two behaviors differ, since Python uses a different __name__. Not sure how this relates to the REPL.

Sorry for confusing. I meant parser level. File parser has a few bugs, that can cause that the source will be differently interpreted with file and string parsers. For example attached script produces different output, "ä" if executed as a script, and "Ã€" if imported as a module.

I see. Well, I trust you in this area (it's been too long since I wrote the original code for all that :-).

...

And there is a question about the null byte. Now compile(), exec(), eval() raises an exception if the script contains the null byte. Formerly they accepted it, but the null byte ended the script.

I like erroring out better.

...

The behavior of file parser is more weird. The null byte makes parser to ignore the end of script including the newline byte [1]. E.g. "#\0\nprint('a')" is interpreted as "#print('a')". This is different from PyPy (and may be other implementations) that interprets the null byte just as ordinal character.

Yeah, this is just poorly written old code that didn't expect anyone to care about null bytes. IMO here too the null byte should be an error.

...

The question is wherever we should support the null byte in Python sources.

[1] http://bugs.python.org/issue20115

Good luck, and thanks for working on this! -- --Guido van Rossum (python.org/~guido)

3080

Age (days ago)

3082

Last active (days ago)

List overview

Download

27 comments

10 participants

participants (10)

Chris Angelico
Guido van Rossum
Hrvoje Niksic
M.-A. Lemburg
MRAB
Nick Coghlan
Paul Moore
Random832
Ryan Gonzalez
Serhiy Storchaka

Reading Python source file

tags

participants (10)