Mailman 3 Future PEP: Include Fine Grained Error Locations in Tracebacks - Python-Dev

Future PEP: Include Fine Grained Error Locations in Tracebacks

older
Necessary or not for enhancement...

Pablo Galindo Salgado

May 7, 2021

9:45 p.m.

Hi there, We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP. The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example: Traceback (most recent call last): File "test.py", line 14, in <module> lel3(x) ^^^^^^^ File "test.py", line 12, in lel3 return lel2(x) / 23 ^^^^^^^ File "test.py", line 9, in lel2 return 25 + lel(x) + lel(x) ^^^^^^ File "test.py", line 6, in lel return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) ^^^^^^^^^^^^^^^^^^^^^ TypeError: 'NoneType' object is not subscriptable The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is: * If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB). One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535. Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets. There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view. Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature? Thanks in advance, Regards from cloudy London, Pablo Galindo Salgado

Attachments:

attachment.htm (text/html — 10.7 KB)

Show replies by date

Larry Hastings

May 2021

9:56 p.m.

On 5/7/21 2:45 PM, Pablo Galindo Salgado wrote:

...

Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Are lnotab entries required to be a fixed size? If not: if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column) I might even write four bytes instead of two in the latter case, //arry/

Gregory P. Smith

10:07 p.m.

On Fri, May 7, 2021 at 3:01 PM Larry Hastings <larry@hastings.org> wrote:

...

On 5/7/21 2:45 PM, Pablo Galindo Salgado wrote:

Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Are lnotab entries required to be a fixed size? If not:

if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column)

If non-fixed size is acceptable. use utf-8 to encode the column number as a single codepoint number into bytes and you don't even need to write your own encode/decode logic for a varint.

-gps

MRAB

10:19 p.m.

On 2021-05-07 22:56, Larry Hastings wrote:

...

On 5/7/21 2:45 PM, Pablo Galindo Salgado wrote:

...
Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Are lnotab entries required to be a fixed size? If not:

if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column)

I might even write four bytes instead of two in the latter case,

A slight improvement would be: if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column - 255)

Pablo Galindo Salgado

10:30 p.m.

This is actually a very good point. The only disadvantage is that it complicates the parsing a bit and we loose the possibility of indexing the table by instruction offset. On Fri, 7 May 2021 at 23:01, Larry Hastings <larry@hastings.org> wrote:

...

On 5/7/21 2:45 PM, Pablo Galindo Salgado wrote:

Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Are lnotab entries required to be a fixed size? If not:

if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column)

I might even write four bytes instead of two in the latter case,

*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/B3SFCZPX... Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

9:07 a.m.

You can certainly get fancy and apply delta encoding + entropy compression, such as done in Parquet, a high-performance data storage format: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-enco... (the linked paper from Lemire and Boytsov gives a lot of ideas) But it would be weird to apply such level of engineering when we never bothered compressing docstrings. Regards Antoine. On Fri, 7 May 2021 23:30:46 +0100 Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

This is actually a very good point. The only disadvantage is that it complicates the parsing a bit and we loose the possibility of indexing the table by instruction offset.

On Fri, 7 May 2021 at 23:01, Larry Hastings <larry@hastings.org> wrote:

...
On 5/7/21 2:45 PM, Pablo Galindo Salgado wrote:

Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Are lnotab entries required to be a fixed size? If not:

if column < 255: lnotab.write_one_byte(column) else: lnotab.write_one_byte(255) lnotab.write_two_bytes(column)

I might even write four bytes instead of two in the latter case,

*/arry* _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/B3SFCZPX... Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

10:01 p.m.

On Fri, 7 May 2021 22:45:38 +0100 Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

More generally, if some people in 2021 are still concerned with the size of pyc files (why not), how about introducing a new version of the pyc format with built-in LZ4 compression? LZ4 decompression is extremely fast on modern CPUs (several GB/s) and vendoring the C library should be simple. https://github.com/lz4/lz4 Regards Antoine.

Pablo Galindo Salgado

10:08 p.m.

Technically the main concern may be the size of the unmarshalled pyc files in memory, more than the storage size of disk. On Fri, 7 May 2021, 23:04 Antoine Pitrou, <antoine@python.org> wrote:

...

On Fri, 7 May 2021 22:45:38 +0100 Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

More generally, if some people in 2021 are still concerned with the size of pyc files (why not), how about introducing a new version of the pyc format with built-in LZ4 compression?

LZ4 decompression is extremely fast on modern CPUs (several GB/s) and vendoring the C library should be simple. https://github.com/lz4/lz4

Regards

Antoine.

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PQZ6OTWG... Code of Conduct: http://python.org/psf/codeofconduct/

Neil Schemenauer

2:58 a.m.

On 2021-05-07, Pablo Galindo Salgado wrote:

...

Technically the main concern may be the size of the unmarshalled pyc files in memory, more than the storage size of disk.

It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. One idea is something similar to the Facebook "not another freeze" PR but with a twist. Their approach was to dump out code objects so they could be loaded as if they were statically defined structures. Instead, could we dump out the pyc data in a format similar to Cap'n Proto? That way no unmarshal is needed. The VM would have to be extensively changed to run code in that format. That's the hard part. The benefit would be faster startup times. The unmarshal step is costly. It would mostly solve the concern about these larger linenum/colnum tables. We would only load that data into memory if the table is accessed.

Nathaniel Smith

5:33 a.m.

On Fri, May 7, 2021 at 8:14 PM Neil Schemenauer <nas-python@arctrix.com> wrote:

...

On 2021-05-07, Pablo Galindo Salgado wrote:

...
Technically the main concern may be the size of the unmarshalled pyc files in memory, more than the storage size of disk.

It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. One idea is something similar to the Facebook "not another freeze" PR but with a twist. Their approach was to dump out code objects so they could be loaded as if they were statically defined structures.

Instead, could we dump out the pyc data in a format similar to Cap'n Proto? That way no unmarshal is needed. The VM would have to be extensively changed to run code in that format. That's the hard part.

The benefit would be faster startup times. The unmarshal step is costly. It would mostly solve the concern about these larger linenum/colnum tables. We would only load that data into memory if the table is accessed.

A simpler version would be to pack just the docstrings/lnotab/column numbers into a separate part of the .pyc, and store a reference to the file + offset to load them lazily on demand. No need for mmap. Could also store them in memory, but with some cheap compression applied, and decompress on access. None of these get accessed often. -n -- Nathaniel J. Smith -- https://vorpus.org

Ammar Askar

3:46 p.m.

I really like this idea Nathaniel. We already have a section considering lazy-loading column information in the draft PEP but obviously it has a pretty high implementation complexity since it necessitates a change in the pyc format and the importlib machinery. Long term this might be the way to go for column information and line information to alleviate the memory burden.

Antoine Pitrou

8:29 a.m.

On Sat, 8 May 2021 02:58:40 +0000 Neil Schemenauer <nas-python@arctrix.com> wrote:

...

On 2021-05-07, Pablo Galindo Salgado wrote:

...
Technically the main concern may be the size of the unmarshalled pyc files in memory, more than the storage size of disk.

It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step.

What happens if another process mutates or truncates the file while the CPython VM is executing code from the mapped file? Crash?

...

Instead, could we dump out the pyc data in a format similar to Cap'n Proto? That way no unmarshal is needed.

How do you freeze PyObjects in Cap'n Proto so that no unmarshal is needed when loading them?

...

The benefit would be faster startup times. The unmarshal step is costly.

How costly? Do we have numbers?

...

It would mostly solve the concern about these larger linenum/colnum tables. We would only load that data into memory if the table is accessed.

Memory-mapped files are accessed with page granularity (4 kB on x86), so I'm not sure it's that simple. You would have to make sure to store those tables in separate sections distant from the actual code areas. Regards Antoine.

Jim J. Jewett

2:16 a.m.

Antoine Pitrou wrote:

...

On Sat, 8 May 2021 02:58:40 +0000 Neil Schemenauer nas-python@arctrix.com wrote:

...

...
It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. What happens if another process mutates or truncates the file while the CPython VM is executing code from the mapped file? Crash?

Why would this be any different than whatever happens now? Just because it is easier for another process to get (exclusive) access to the file if there is a longer delay between loading the first part of the file and going back for the docstrings and lnotab? -jJ

Richard Damon

2:32 a.m.

On 5/8/21 10:16 PM, Jim J. Jewett wrote:

...

Antoine Pitrou wrote:

...
On Sat, 8 May 2021 02:58:40 +0000 Neil Schemenauer nas-python@arctrix.com wrote:

...
It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. What happens if another process mutates or truncates the file while the CPython VM is executing code from the mapped file? Crash? Why would this be any different than whatever happens now? Just because it is easier for another process to get (exclusive) access to the file if there is a longer delay between loading the first part of the file and going back for the docstrings and lnotab?

-jJ

I think the issue being pointed out is that currently, when Python opens the .pyc file for reading and keeps the file handle open, that will block any other process from opening the file for writing, and thus can't change the contents under it. Once it is all done, it can release the lock as it won't need to read it again. if it mapped the file into its address space, it would need a similar sort of lock, but need to keep if for the FULL execution of the program, so that no other process could change the contents behind its back. I think normal mmapping doesn't do this, but if that sort of lock is available, then it probably should be used. -- Richard Damon

Antoine Pitrou

4:09 p.m.

On Sun, 09 May 2021 02:16:02 -0000 "Jim J. Jewett" <jimjjewett@gmail.com> wrote:

...

Antoine Pitrou wrote:

...
On Sat, 8 May 2021 02:58:40 +0000 Neil Schemenauer nas-python@arctrix.com wrote:

...
...
It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. What happens if another process mutates or truncates the file while the CPython VM is executing code from the mapped file? Crash?

Why would this be any different than whatever happens now?

What happens now is that the pyc file is transferred at once to memory using regular IO. So the chance is really slim that you read invalid data due to concurrent mutation. Regards Antoine.

Gregory P. Smith

5:23 p.m.

On Sun, May 9, 2021 at 9:13 AM Antoine Pitrou <antoine@python.org> wrote:

...

...
Antoine Pitrou wrote:

...
On Sat, 8 May 2021 02:58:40 +0000 Neil Schemenauer nas-python@arctrix.com wrote:

...
...
It would be cool if we could mmap the pyc files and have the VM run code without an unmarshal step. What happens if another process mutates or truncates the file while

On Sun, 09 May 2021 02:16:02 -0000 "Jim J. Jewett" <jimjjewett@gmail.com> wrote: the

...
...
CPython VM is executing code from the mapped file? Crash?

Why would this be any different than whatever happens now?

What happens now is that the pyc file is transferred at once to memory using regular IO. So the chance is really slim that you read invalid data due to concurrent mutation.

concurrent mutation isn't even what I was talking about. We don't protect against that today as that isn't a concern. But POSIX semantics on the bulk of systems where this would ever matter do software updates by moving new files into place. Because that is an idempotent inode change. So the existing open file already in the process of being read is not changed. But as soon as you do a new open call on the pathname you get a different file than the last time that path was opened. This is not theoretical. I've seen production problems as a result (zipimport - https://bugs.python.org/issue19081) making the incorrect assumption that they can reopen a file that they've read once at a later point in time. So if we do open files later, we must code defensively and assume they might not contain what we thought. We already have this problem with source code lines displayed in tracebacks today as those are read on demand. But as that is debugging information only the wrong source lines being shown next to the filename + linenumber in a traceback is something people just learn to ignore in these situations. We have the data to prevent this, we just never have. https://bugs.python.org/issue44091 filed to track that. Given this context, M.-A. Lemburg's alternative idea could have some merit as it would synchronize our source skew behavior with our additional debugging information behavior. My initial reaction is that it's falling into the trap of bundling too into one place though. quoting M.-A. Lemburg:

...

Create a new file format which supports enhanced debugging. This would include the source code in a indexed format, the AST and mappings between byte code, AST node, lines and columns.

Python would then only use and load this file when it needs to print a traceback - much like it does today with the source code.

The advantage is that you can add even more useful information for debugging while not making the default code distribution format take more memory (both disk and RAM).

Realistically: This is going to take more disk space in the common case because in addition to the py, pyc, pyc.opt-1, pyc.opt-2 that some distros apparently include all of today, there'd be a new pyc.debuginfo to go along side it. The only benefit is that it isn't resident in ram. And someone *could* choose to filter these out of their distro or container or whatever-the-heck-their-package-format-is. But I really doubt that'll be the default. Not having debugging information when a problem you're trying to hunt down and reproduce but only happens once in a blue moon is extraordinarily frustrating. Which is why people who value engineering time deploy with debugging info. There are environments where people intentionally do not deploy source code. But do want to get debugging data from tracebacks that they can then correlate to their sources later for analysis (they're tracking exactly which versions of pycs from which versions of sources were deployed). It'd be a shame to exclude column information for this scenario. -gps

Gregory P. Smith

10:15 p.m.

On Fri, May 7, 2021 at 2:50 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

An additional cost to this is things that parse text tracebacks not knowing how to handle it and things that log tracebacks generating additional output. We should provide a way for people to disable the feature on a process as part of this while they address tooling and logging issues. (via the usual set of command line flag + python env var + runtime API) The cost of this is having the start column number and end column number

...

information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Neither of those is large. While I'd lean towards uint8_t instead of uint16_t because not even humans can understand a 255 character line so why bother being pretty about such a thing... Just document the caveat and move on with the lower value. A future pyc format could change it if a compelling argument were ever found.

...

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

A compromise if you want to handle longer lines: A single uint16_t. Represent the start column in the 9 bits and width in the other 7 bits. (or any variations thereof) it's all a matter of what tradeoff you want to make for space reasons. encoding as start + width instead of start + end is likely better anyways if you care about compression as the width byte will usually be small and thus be friendlier to compression. I'd personally ignore compression entirely. Overall doing this is going to be a big win for developer productivity! -Greg

Pablo Galindo Salgado

10:24 p.m.

Thanks a lot Gregory for the comments! An additional cost to this is things that parse text tracebacks not knowing

...

how to handle it and things that log tracebacks generating additional output.

We should provide a way for people to disable the feature on a process as

...

part of this while they address tooling and logging issues. (via the usual set of command line flag + python env var + runtime API)

Absolutely! We were thinking about that and that's easy enough as that is a single conditional on the display function + the extra init configuration. Neither of those is large. While I'd lean towards uint8_t instead of

...

uint16_t because not even humans can understand a 255 character line so why bother being pretty about such a thing... Just document the caveat and move on with the lower value. A future pyc format could change it if a compelling argument were ever found.

I very much agree with you here but is worth noting that I have heard the counter-argument that the longer the line is, the more important may be to distinguish what part of the line is wrong. A compromise if you want to handle longer lines: A single uint16_t.

...

Represent the start column in the 9 bits and width in the other 7 bits. (or any variations thereof) it's all a matter of what tradeoff you want to make for space reasons. encoding as start + width instead of start + end is likely better anyways if you care about compression as the width byte will usually be small and thus be friendlier to compression. I'd personally ignore compression entirely.

I would personally prefer not to implement very tricky compression algorithms because tools may need to parse this and I don't want to complicate the logic a lot. Handling lnotab is already a bit painful and when bugs ocur it makes debugging very tricky. Having the possibility to index something based on the index of the instruction is quite a good API in my opinion. Overall doing this is going to be a big win for developer productivity! Thanks! We think that this has a lot of potential indeed! :) Pablo

Gregory P. Smith

10:35 p.m.

On Fri, May 7, 2021 at 3:24 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Thanks a lot Gregory for the comments!

An additional cost to this is things that parse text tracebacks not

...
knowing how to handle it and things that log tracebacks generating additional output.

We should provide a way for people to disable the feature on a process as

...
part of this while they address tooling and logging issues. (via the usual set of command line flag + python env var + runtime API)

Absolutely! We were thinking about that and that's easy enough as that is a single conditional on the display function + the extra init configuration.

Neither of those is large. While I'd lean towards uint8_t instead of

...
uint16_t because not even humans can understand a 255 character line so why bother being pretty about such a thing... Just document the caveat and move on with the lower value. A future pyc format could change it if a compelling argument were ever found.

I very much agree with you here but is worth noting that I have heard the counter-argument that the longer the line is, the more important may be to distinguish what part of the line is wrong.

haha, true... Does our parser even have a maximum line length? (I'm not suggesting being unlimited or matching that if huge, 64k is already ridiculous)

...

A compromise if you want to handle longer lines: A single uint16_t.

...
Represent the start column in the 9 bits and width in the other 7 bits. (or any variations thereof) it's all a matter of what tradeoff you want to make for space reasons. encoding as start + width instead of start + end is likely better anyways if you care about compression as the width byte will usually be small and thus be friendlier to compression. I'd personally ignore compression entirely.

I would personally prefer not to implement very tricky compression algorithms because tools may need to parse this and I don't want to complicate the logic a lot. Handling lnotab is already a bit painful and when bugs ocur it makes debugging very tricky. Having the possibility to index something based on the index of the instruction is quite a good API in my opinion.

Overall doing this is going to be a big win for developer productivity!

Thanks! We think that this has a lot of potential indeed! :)

Pablo

Pablo Galindo Salgado

10:39 p.m.

...

haha, true... Does our parser even have a maximum line length? (I'm not suggesting being unlimited or matching that if huge, 64k is already ridiculous

We use py_ssize_t in some places but at the end of the day the lines and columns have a limit of INT_MAX IIRC On Fri, 7 May 2021 at 23:35, Gregory P. Smith <greg@krypto.org> wrote:

...

On Fri, May 7, 2021 at 3:24 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Thanks a lot Gregory for the comments!

An additional cost to this is things that parse text tracebacks not

...
knowing how to handle it and things that log tracebacks generating additional output.

We should provide a way for people to disable the feature on a process as

...
part of this while they address tooling and logging issues. (via the usual set of command line flag + python env var + runtime API)

Absolutely! We were thinking about that and that's easy enough as that is a single conditional on the display function + the extra init configuration.

Neither of those is large. While I'd lean towards uint8_t instead of

...
uint16_t because not even humans can understand a 255 character line so why bother being pretty about such a thing... Just document the caveat and move on with the lower value. A future pyc format could change it if a compelling argument were ever found.

I very much agree with you here but is worth noting that I have heard the counter-argument that the longer the line is, the more important may be to distinguish what part of the line is wrong.

haha, true... Does our parser even have a maximum line length? (I'm not suggesting being unlimited or matching that if huge, 64k is already ridiculous)

...
A compromise if you want to handle longer lines: A single uint16_t.

...
Represent the start column in the 9 bits and width in the other 7 bits. (or any variations thereof) it's all a matter of what tradeoff you want to make for space reasons. encoding as start + width instead of start + end is likely better anyways if you care about compression as the width byte will usually be small and thus be friendlier to compression. I'd personally ignore compression entirely.

I would personally prefer not to implement very tricky compression algorithms because tools may need to parse this and I don't want to complicate the logic a lot. Handling lnotab is already a bit painful and when bugs ocur it makes debugging very tricky. Having the possibility to index something based on the index of the instruction is quite a good API in my opinion.

Overall doing this is going to be a big win for developer productivity!

Thanks! We think that this has a lot of potential indeed! :)

Pablo

Irit Katriel

10:20 p.m.

On Fri, May 7, 2021 at 10:52 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

The cost of this is having the start column number and end column number information for every bytecode instruction

Is it really every instruction? Or only those that can raise exceptions?

Pablo Galindo Salgado

10:39 p.m.

Thanks, Irit for your comment!

...

Is it really every instruction? Or only those that can raise exceptions?

Technically only the ones that can raise exceptions, but the majority can and optimizing this to only restrict to the set that can raise exceptions has the danger than the mapping needs to be maintained for new instructions and that if some instruction starts raising exceptions while it didn't before then it can introduce subtle bugs. On the other hand I think the stronger argument to do this on every instruction is that there is a lot of tools that can find this information quite useful such as coverage tools, profilers, state inspection tools and more. For example, a coverage tool will be able to tell you what part of x = f(x) if g(x) else y(x) actually was executed, while currently, it will highlight the full line. Although in this case these instructions can raise exceptions and would be covered, the distinction is different and both criteria could lead to a different subset. In short, that may be an optimization but I think I would prefer to avoid that complexity taking into account the other problems that can raise and the extra complication On Fri, 7 May 2021 at 23:21, Irit Katriel <iritkatriel@googlemail.com> wrote:

...

On Fri, May 7, 2021 at 10:52 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
The cost of this is having the start column number and end column number information for every bytecode instruction

Is it really every instruction? Or only those that can raise exceptions?

Antoine Pitrou

8:33 a.m.

On Fri, 7 May 2021 23:20:55 +0100 Irit Katriel via Python-Dev <python-dev@python.org> wrote:

...

On Fri, May 7, 2021 at 10:52 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
The cost of this is having the start column number and end column number information for every bytecode instruction

Is it really every instruction? Or only those that can raise exceptions?

I think almost any instruction can be interrupted with KeyboardInterrupt (or any other asynchronously-raised exception). Regards Antoine.

MRAB

10:27 p.m.

On 2021-05-07 22:45, Pablo Galindo Salgado wrote:

...

Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

  File "test.py", line 14, in <module>

    lel3(x)

    ^^^^^^^

  File "test.py", line 12, in lel3

    return lel2(x) / 23

           ^^^^^^^

  File "test.py", line 9, in lel2

    return 25 + lel(x) + lel(x)

                ^^^^^^

  File "test.py", line 6, in lel

    return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

                         ^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

[snip]How common are lines are longer than 255 characters, anyway? One thought: could the stored column position not include the indentation? Would that help?

Pablo Galindo Salgado

10:41 p.m.

...

One thought: could the stored column position not include the indentation? Would that help?

The compiler doesn't have access easy access to the source unfortunately so we don't know how much is the indentation. This can make life a bit harder for other tools, although it can make it easier for reporting the exception as the current traceback display removes indentation. On Fri, 7 May 2021 at 23:37, MRAB <python@mrabarnett.plus.com> wrote:

...

On 2021-05-07 22:45, Pablo Galindo Salgado wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

[snip]How common are lines are longer than 255 characters, anyway?

One thought: could the stored column position not include the indentation? Would that help? _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHF3PMCJ... Code of Conduct: http://python.org/psf/codeofconduct/

Nick Coghlan

12:16 a.m.

On Sat, 8 May 2021, 8:53 am Pablo Galindo Salgado, <pablogsal@gmail.com> wrote:

...

...
One thought: could the stored column position not include the indentation? Would that help?

The compiler doesn't have access easy access to the source unfortunately so we don't know how much is the indentation. This can make life a bit harder for other tools, although it can make it easier for reporting the exception as the current traceback display removes indentation.

If the lnotab format (or a new data structure on the code object) could store a line indent offset for each line, each instruction within a line would only need to record the offset from the end of the indentation. If we assume "deeply indented code" is the most likely source of excessively long lines rather than "long expressions and other one line statements produced by code generators" it may be worth it, but I'm not sure that's actually true. If we instead assume long lines are likely to come from code generators, then we can impose the 255 column limit, and breaking lines at 255 code points to improve tracebacks would become a quality of implementation issue for code generators. The latter assumption seems more likely to be true to me, and if the deep indentation case does come up, the line offset idea could be pursued later. Cheers, Nick.

...

On Fri, 7 May 2021 at 23:37, MRAB <python@mrabarnett.plus.com> wrote:

...
On 2021-05-07 22:45, Pablo Galindo Salgado wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end

column number > information for every bytecode instruction > and this is what we want to discuss (there is also some stack cost to > re-raise exceptions but that's not a big problem in > any case). Given that column numbers are not very big compared with line > numbers, we plan to store these as unsigned chars > or unsigned shorts. We ran some experiments over the standard library > and we found that the overhead of all pyc files is: > > * If we use shorts, the total overhead is ~3% (total size 28MB and the > extra size is 0.88 MB). > * If we use chars. the total overhead is ~1.5% (total size 28 MB and the > extra size is 0.44MB). > > One of the disadvantages of using chars is that we can only report > columns from 1 to 255 so if an error happens in a column > bigger than that then we would have to exclude it (and not show the > highlighting) for that frame. Unsigned short will allow > the values to go from 0 to 65535. > [snip]How common are lines are longer than 255 characters, anyway?

One thought: could the stored column position not include the indentation? Would that help? _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHF3PMCJ... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OKQYNAI2... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

12:43 a.m.

Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised: Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type: BEFORE: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever. On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

Pablo Galindo Salgado

12:56 a.m.

One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%. On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

Pablo Galindo Salgado

2:27 a.m.

Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data. Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode. Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120). On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

Brett Cannon

6:41 p.m.

On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...

Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...

Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data. As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters. -Brett

...

On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

6:55 p.m.

Hi Brett, Just to be clear, .pyo files have not existed for a while:

...

https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix. This only kicks in at the -OO level. I will correct the PEP so it reflex this more exactly. I personally prefer the idea of dropping the data with -OO since if you're

...

stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data. Thanks for the clarifications! On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...

On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

Gregory P. Smith

8:13 p.m.

On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if you're

...
stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists. -gps

...

Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

8:31 p.m.

...

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable: * The data in pyc files * Printing the exception highlighting Printing the exception highlighting can be disabled via combo of environment variable / -X option but collecting the data can only be disabled by -OO. The reason is that this will end in pyc files so when the data is not there, a different kind of pyc files need to be produced and I really don't want to have another set of pyc file extension just to deactivate this. Notice that also a configure time variable won't work because it will cause crashes when reading pyc files produced by the interpreter compiled without the flag. On Sat, 8 May 2021 at 21:13, Gregory P. Smith <greg@krypto.org> wrote:

...

On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if

...
you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-gps

...
Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

> Hi there, > > We are preparing a PEP and we would like to start some early > discussion about one of the main aspects of the PEP. > > The work we are preparing is to allow the interpreter to produce > more fine-grained error messages, pointing to > the source associated to the instructions that are failing. For > example: > > Traceback (most recent call last): > > File "test.py", line 14, in <module> > > lel3(x) > > ^^^^^^^ > > File "test.py", line 12, in lel3 > > return lel2(x) / 23 > > ^^^^^^^ > > File "test.py", line 9, in lel2 > > return 25 + lel(x) + lel(x) > > ^^^^^^ > > File "test.py", line 6, in lel > > return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) > > ^^^^^^^^^^^^^^^^^^^^^ > > TypeError: 'NoneType' object is not subscriptable > > The cost of this is having the start column number and end > column number information for every bytecode instruction > and this is what we want to discuss (there is also some stack cost > to re-raise exceptions but that's not a big problem in > any case). Given that column numbers are not very big compared with > line numbers, we plan to store these as unsigned chars > or unsigned shorts. We ran some experiments over the standard > library and we found that the overhead of all pyc files is: > > * If we use shorts, the total overhead is ~3% (total size 28MB and > the extra size is 0.88 MB). > * If we use chars. the total overhead is ~1.5% (total size 28 MB and > the extra size is 0.44MB). > > One of the disadvantages of using chars is that we can only report > columns from 1 to 255 so if an error happens in a column > bigger than that then we would have to exclude it (and not show the > highlighting) for that frame. Unsigned short will allow > the values to go from 0 to 65535. > > Unfortunately these numbers are not easily compressible, as every > instruction would have very different offsets. > > There is also the possibility of not doing this based on some build > flag on when using -O to allow users to opt out, but given the fact > that these numbers can be quite useful to other tools like coverage > measuring tools, tracers, profilers and the such adding conditional > logic to many places would complicate the implementation > considerably and will potentially reduce the usability of those tools so we > prefer > not to have the conditional logic. We believe this is extra cost is > very much worth the better error reporting but we understand and respect > other points of view. > > Does anyone see a better way to encode this information **without > complicating a lot the implementation**? What are people thoughts on the > feature? > > Thanks in advance, > > Regards from cloudy London, > Pablo Galindo Salgado > > _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Gregory P. Smith

8:45 p.m.

On Sat, May 8, 2021 at 1:32 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

nit: I wouldn't choose the word "sensible" given that -OO is already fundamentally unusable without knowing if any code in your entire transitive dependencies might depend on the presence of docstrings...

...

* The data in pyc files * Printing the exception highlighting

Printing the exception highlighting can be disabled via combo of environment variable / -X option but collecting the data can only be disabled by -OO. The reason is that this will end in pyc files so when the data is not there, a different kind of pyc files need to be produced and I really don't want to have another set of pyc file extension just to deactivate this. Notice that also a configure time variable won't work because it will cause crashes when reading pyc files produced by the interpreter compiled without the flag.

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

...

On Sat, 8 May 2021 at 21:13, Gregory P. Smith <greg@krypto.org> wrote:

...
On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if

...
you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-gps

...
Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
One last note for clarity: that's the increase of size in the stdlib, the increase of size for pyc files goes from 28.471296MB to 34.750464MB, which is an increase of 22%.

On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

> Some update on the numbers. We have made some draft implementation > to corroborate the > numbers with some more realistic tests and seems that our original > calculations were wrong. > The actual increase in size is quite bigger than previously > advertised: > > Using bytes object to encode the final object and marshalling that > to disk (so using uint8_t) as the underlying > type: > > BEFORE: > > ❯ ./python -m compileall -r 1000 Lib > /dev/null > ❯ du -h Lib -c --max-depth=0 > 70M Lib > 70M total > > AFTER: > ❯ ./python -m compileall -r 1000 Lib > /dev/null > ❯ du -h Lib -c --max-depth=0 > 76M Lib > 76M total > > So that's an increase of 8.56 % over the original value. This is > storing the start offset and end offset with no compression > whatsoever. > > On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < > pablogsal@gmail.com> wrote: > >> Hi there, >> >> We are preparing a PEP and we would like to start some early >> discussion about one of the main aspects of the PEP. >> >> The work we are preparing is to allow the interpreter to produce >> more fine-grained error messages, pointing to >> the source associated to the instructions that are failing. For >> example: >> >> Traceback (most recent call last): >> >> File "test.py", line 14, in <module> >> >> lel3(x) >> >> ^^^^^^^ >> >> File "test.py", line 12, in lel3 >> >> return lel2(x) / 23 >> >> ^^^^^^^ >> >> File "test.py", line 9, in lel2 >> >> return 25 + lel(x) + lel(x) >> >> ^^^^^^ >> >> File "test.py", line 6, in lel >> >> return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) >> >> ^^^^^^^^^^^^^^^^^^^^^ >> >> TypeError: 'NoneType' object is not subscriptable >> >> The cost of this is having the start column number and end >> column number information for every bytecode instruction >> and this is what we want to discuss (there is also some stack cost >> to re-raise exceptions but that's not a big problem in >> any case). Given that column numbers are not very big compared with >> line numbers, we plan to store these as unsigned chars >> or unsigned shorts. We ran some experiments over the standard >> library and we found that the overhead of all pyc files is: >> >> * If we use shorts, the total overhead is ~3% (total size 28MB and >> the extra size is 0.88 MB). >> * If we use chars. the total overhead is ~1.5% (total size 28 MB >> and the extra size is 0.44MB). >> >> One of the disadvantages of using chars is that we can only report >> columns from 1 to 255 so if an error happens in a column >> bigger than that then we would have to exclude it (and not show the >> highlighting) for that frame. Unsigned short will allow >> the values to go from 0 to 65535. >> >> Unfortunately these numbers are not easily compressible, as every >> instruction would have very different offsets. >> >> There is also the possibility of not doing this based on some build >> flag on when using -O to allow users to opt out, but given the fact >> that these numbers can be quite useful to other tools like coverage >> measuring tools, tracers, profilers and the such adding conditional >> logic to many places would complicate the implementation >> considerably and will potentially reduce the usability of those tools so we >> prefer >> not to have the conditional logic. We believe this is extra cost is >> very much worth the better error reporting but we understand and respect >> other points of view. >> >> Does anyone see a better way to encode this information **without >> complicating a lot the implementation**? What are people thoughts on the >> feature? >> >> Thanks in advance, >> >> Regards from cloudy London, >> Pablo Galindo Salgado >> >> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

8:50 p.m.

...

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

That could work, but in my personal opinion, I would prefer not to do that as it complicates things and I think is overkill. On Sat, 8 May 2021 at 21:45, Gregory P. Smith <greg@krypto.org> wrote:

...

On Sat, May 8, 2021 at 1:32 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

nit: I wouldn't choose the word "sensible" given that -OO is already fundamentally unusable without knowing if any code in your entire transitive dependencies might depend on the presence of docstrings...

...
* The data in pyc files * Printing the exception highlighting

Printing the exception highlighting can be disabled via combo of environment variable / -X option but collecting the data can only be disabled by -OO. The reason is that this will end in pyc files so when the data is not there, a different kind of pyc files need to be produced and I really don't want to have another set of pyc file extension just to deactivate this. Notice that also a configure time variable won't work because it will cause crashes when reading pyc files produced by the interpreter compiled without the flag.

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

...
On Sat, 8 May 2021 at 21:13, Gregory P. Smith <greg@krypto.org> wrote:

...
On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if

...
you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-gps

...
Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

> One last note for clarity: that's the increase of size in the > stdlib, the increase of size > for pyc files goes from 28.471296MB to 34.750464MB, which is an > increase of 22%. > > On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < > pablogsal@gmail.com> wrote: > >> Some update on the numbers. We have made some draft implementation >> to corroborate the >> numbers with some more realistic tests and seems that our original >> calculations were wrong. >> The actual increase in size is quite bigger than previously >> advertised: >> >> Using bytes object to encode the final object and marshalling that >> to disk (so using uint8_t) as the underlying >> type: >> >> BEFORE: >> >> ❯ ./python -m compileall -r 1000 Lib > /dev/null >> ❯ du -h Lib -c --max-depth=0 >> 70M Lib >> 70M total >> >> AFTER: >> ❯ ./python -m compileall -r 1000 Lib > /dev/null >> ❯ du -h Lib -c --max-depth=0 >> 76M Lib >> 76M total >> >> So that's an increase of 8.56 % over the original value. This is >> storing the start offset and end offset with no compression >> whatsoever. >> >> On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < >> pablogsal@gmail.com> wrote: >> >>> Hi there, >>> >>> We are preparing a PEP and we would like to start some early >>> discussion about one of the main aspects of the PEP. >>> >>> The work we are preparing is to allow the interpreter to produce >>> more fine-grained error messages, pointing to >>> the source associated to the instructions that are failing. For >>> example: >>> >>> Traceback (most recent call last): >>> >>> File "test.py", line 14, in <module> >>> >>> lel3(x) >>> >>> ^^^^^^^ >>> >>> File "test.py", line 12, in lel3 >>> >>> return lel2(x) / 23 >>> >>> ^^^^^^^ >>> >>> File "test.py", line 9, in lel2 >>> >>> return 25 + lel(x) + lel(x) >>> >>> ^^^^^^ >>> >>> File "test.py", line 6, in lel >>> >>> return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) >>> >>> ^^^^^^^^^^^^^^^^^^^^^ >>> >>> TypeError: 'NoneType' object is not subscriptable >>> >>> The cost of this is having the start column number and end >>> column number information for every bytecode instruction >>> and this is what we want to discuss (there is also some stack cost >>> to re-raise exceptions but that's not a big problem in >>> any case). Given that column numbers are not very big compared >>> with line numbers, we plan to store these as unsigned chars >>> or unsigned shorts. We ran some experiments over the standard >>> library and we found that the overhead of all pyc files is: >>> >>> * If we use shorts, the total overhead is ~3% (total size 28MB and >>> the extra size is 0.88 MB). >>> * If we use chars. the total overhead is ~1.5% (total size 28 MB >>> and the extra size is 0.44MB). >>> >>> One of the disadvantages of using chars is that we can only report >>> columns from 1 to 255 so if an error happens in a column >>> bigger than that then we would have to exclude it (and not show >>> the highlighting) for that frame. Unsigned short will allow >>> the values to go from 0 to 65535. >>> >>> Unfortunately these numbers are not easily compressible, as every >>> instruction would have very different offsets. >>> >>> There is also the possibility of not doing this based on some >>> build flag on when using -O to allow users to opt out, but given the fact >>> that these numbers can be quite useful to other tools like >>> coverage measuring tools, tracers, profilers and the such adding conditional >>> logic to many places would complicate the implementation >>> considerably and will potentially reduce the usability of those tools so we >>> prefer >>> not to have the conditional logic. We believe this is extra cost >>> is very much worth the better error reporting but we understand and respect >>> other points of view. >>> >>> Does anyone see a better way to encode this information **without >>> complicating a lot the implementation**? What are people thoughts on the >>> feature? >>> >>> Thanks in advance, >>> >>> Regards from cloudy London, >>> Pablo Galindo Salgado >>> >>> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

8:59 p.m.

...

That could work, but in my personal opinion, I would prefer not to do that as it complicates things and I think is overkill.

Let me expand on this: I recognize the problem that -OO can be quite unusable if some of your dependencies depend on docstrings and that It would be good to separate this from that option, but I am afraid of the following: - New APIs in the marshal module and other places to pass down the extra information to read/write or not the extra information. - Complication of the pyc format with more entries in the header. - Complication of the implementation. Given that the reasons to deactivate this option exist, but I expect them to be very rare, I would prefer to maximize simplicity and maintainability. On Sat, 8 May 2021 at 21:50, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

...
I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

That could work, but in my personal opinion, I would prefer not to do that as it complicates things and I think is overkill.

On Sat, 8 May 2021 at 21:45, Gregory P. Smith <greg@krypto.org> wrote:

...
On Sat, May 8, 2021 at 1:32 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

nit: I wouldn't choose the word "sensible" given that -OO is already fundamentally unusable without knowing if any code in your entire transitive dependencies might depend on the presence of docstrings...

...
* The data in pyc files * Printing the exception highlighting

Printing the exception highlighting can be disabled via combo of environment variable / -X option but collecting the data can only be disabled by -OO. The reason is that this will end in pyc files so when the data is not there, a different kind of pyc files need to be produced and I really don't want to have another set of pyc file extension just to deactivate this. Notice that also a configure time variable won't work because it will cause crashes when reading pyc files produced by the interpreter compiled without the flag.

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

...
On Sat, 8 May 2021 at 21:13, Gregory P. Smith <greg@krypto.org> wrote:

...
On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if

...
you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-gps

...
Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

> Although we were originally not sympathetic with it, we may need to > offer an opt-out mechanism for those users that care about the impact of > the overhead of the new data in pyc files > and in in-memory code objectsas was suggested by some folks (Thomas, > Yury, and others). For this, we could propose that the functionality will > be deactivated along with the extra > information when Python is executed in optimized mode (``python > -O``) and therefore pyo files will not have the overhead associated with > the extra required data. >

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

> Notice that Python > already strips docstrings in this mode so it would be "aligned" > with the current mechanism of optimized mode. >

This only kicks in at the -OO level.

> > Although this complicates the implementation, it certainly is still > much easier than dealing with compression (and more useful for those that > don't want the feature). Notice that we also > expect pessimistic results from compression as offsets would be > quite random (although predominantly in the range 10 - 120). >

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

> > On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado < > pablogsal@gmail.com> wrote: > >> One last note for clarity: that's the increase of size in the >> stdlib, the increase of size >> for pyc files goes from 28.471296MB to 34.750464MB, which is an >> increase of 22%. >> >> On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < >> pablogsal@gmail.com> wrote: >> >>> Some update on the numbers. We have made some draft implementation >>> to corroborate the >>> numbers with some more realistic tests and seems that our original >>> calculations were wrong. >>> The actual increase in size is quite bigger than previously >>> advertised: >>> >>> Using bytes object to encode the final object and marshalling that >>> to disk (so using uint8_t) as the underlying >>> type: >>> >>> BEFORE: >>> >>> ❯ ./python -m compileall -r 1000 Lib > /dev/null >>> ❯ du -h Lib -c --max-depth=0 >>> 70M Lib >>> 70M total >>> >>> AFTER: >>> ❯ ./python -m compileall -r 1000 Lib > /dev/null >>> ❯ du -h Lib -c --max-depth=0 >>> 76M Lib >>> 76M total >>> >>> So that's an increase of 8.56 % over the original value. This is >>> storing the start offset and end offset with no compression >>> whatsoever. >>> >>> On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < >>> pablogsal@gmail.com> wrote: >>> >>>> Hi there, >>>> >>>> We are preparing a PEP and we would like to start some early >>>> discussion about one of the main aspects of the PEP. >>>> >>>> The work we are preparing is to allow the interpreter to produce >>>> more fine-grained error messages, pointing to >>>> the source associated to the instructions that are failing. For >>>> example: >>>> >>>> Traceback (most recent call last): >>>> >>>> File "test.py", line 14, in <module> >>>> >>>> lel3(x) >>>> >>>> ^^^^^^^ >>>> >>>> File "test.py", line 12, in lel3 >>>> >>>> return lel2(x) / 23 >>>> >>>> ^^^^^^^ >>>> >>>> File "test.py", line 9, in lel2 >>>> >>>> return 25 + lel(x) + lel(x) >>>> >>>> ^^^^^^ >>>> >>>> File "test.py", line 6, in lel >>>> >>>> return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) >>>> >>>> ^^^^^^^^^^^^^^^^^^^^^ >>>> >>>> TypeError: 'NoneType' object is not subscriptable >>>> >>>> The cost of this is having the start column number and end >>>> column number information for every bytecode instruction >>>> and this is what we want to discuss (there is also some stack >>>> cost to re-raise exceptions but that's not a big problem in >>>> any case). Given that column numbers are not very big compared >>>> with line numbers, we plan to store these as unsigned chars >>>> or unsigned shorts. We ran some experiments over the standard >>>> library and we found that the overhead of all pyc files is: >>>> >>>> * If we use shorts, the total overhead is ~3% (total size 28MB >>>> and the extra size is 0.88 MB). >>>> * If we use chars. the total overhead is ~1.5% (total size 28 MB >>>> and the extra size is 0.44MB). >>>> >>>> One of the disadvantages of using chars is that we can only >>>> report columns from 1 to 255 so if an error happens in a column >>>> bigger than that then we would have to exclude it (and not show >>>> the highlighting) for that frame. Unsigned short will allow >>>> the values to go from 0 to 65535. >>>> >>>> Unfortunately these numbers are not easily compressible, as every >>>> instruction would have very different offsets. >>>> >>>> There is also the possibility of not doing this based on some >>>> build flag on when using -O to allow users to opt out, but given the fact >>>> that these numbers can be quite useful to other tools like >>>> coverage measuring tools, tracers, profilers and the such adding conditional >>>> logic to many places would complicate the implementation >>>> considerably and will potentially reduce the usability of those tools so we >>>> prefer >>>> not to have the conditional logic. We believe this is extra cost >>>> is very much worth the better error reporting but we understand and respect >>>> other points of view. >>>> >>>> Does anyone see a better way to encode this information **without >>>> complicating a lot the implementation**? What are people thoughts on the >>>> feature? >>>> >>>> Thanks in advance, >>>> >>>> Regards from cloudy London, >>>> Pablo Galindo Salgado >>>> >>>> _______________________________________________ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-leave@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... > Code of Conduct: http://python.org/psf/codeofconduct/ > _______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Antoine Pitrou

4:12 p.m.

On Sat, 8 May 2021 21:59:49 +0100 Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

...
That could work, but in my personal opinion, I would prefer not to do that as it complicates things and I think is overkill.

Let me expand on this:

I recognize the problem that -OO can be quite unusable if some of your dependencies depend on docstrings and that It would be good to separate this from that option, but I am afraid of the following:

- New APIs in the marshal module and other places to pass down the extra information to read/write or not the extra information. - Complication of the pyc format with more entries in the header. - Complication of the implementation.

Given that the reasons to deactivate this option exist, but I expect them to be very rare, I would prefer to maximize simplicity and maintainability.

Agreed with Pablo. Also, once we add a configuration option it becomes delicate to later remove it. Regards Antoine.

Pablo Galindo Salgado

9:24 p.m.

...

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

Greg, what do you think if instead of not writing it to the pyc file with -OO or adding a header entry to decide to read/write, we place None in the field? That way we can leverage the option that we intend to add to deactivate displaying the traceback new information to reduce the data in the pyc files. The only problem is that there will be still a tiny bit of overhead: an extra object per code object (None), but that's much much better than something that scales with the number of instructions :) What's your opinion on this? On Sat, 8 May 2021 at 21:45, Gregory P. Smith <greg@krypto.org> wrote:

...

On Sat, May 8, 2021 at 1:32 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

nit: I wouldn't choose the word "sensible" given that -OO is already fundamentally unusable without knowing if any code in your entire transitive dependencies might depend on the presence of docstrings...

...
* The data in pyc files * Printing the exception highlighting

Printing the exception highlighting can be disabled via combo of environment variable / -X option but collecting the data can only be disabled by -OO. The reason is that this will end in pyc files so when the data is not there, a different kind of pyc files need to be produced and I really don't want to have another set of pyc file extension just to deactivate this. Notice that also a configure time variable won't work because it will cause crashes when reading pyc files produced by the interpreter compiled without the flag.

I don't think the optional existence of column number information needs a different kind of pyc file. Just a flag in a pyc file's header at most. It isn't a new type of file.

...
On Sat, 8 May 2021 at 21:13, Gregory P. Smith <greg@krypto.org> wrote:

...
On Sat, May 8, 2021 at 11:58 AM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Hi Brett,

Just to be clear, .pyo files have not existed for a while:

...
https://www.python.org/dev/peps/pep-0488/.

Whoops, my bad, I wanted to refer to the pyc files that are generated with -OO, which have the "opt-2" prefix.

This only kicks in at the -OO level.

I will correct the PEP so it reflex this more exactly.

I personally prefer the idea of dropping the data with -OO since if

...
you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

This is indeed the plan, sorry for the confusion. The opt-out mechanism is using -OO, precisely as we are already dropping other data.

We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc"strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-gps

...
Thanks for the clarifications!

On Sat, 8 May 2021 at 19:41, Brett Cannon <brett@python.org> wrote:

...
On Fri, May 7, 2021 at 7:31 PM Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

...
Although we were originally not sympathetic with it, we may need to offer an opt-out mechanism for those users that care about the impact of the overhead of the new data in pyc files and in in-memory code objectsas was suggested by some folks (Thomas, Yury, and others). For this, we could propose that the functionality will be deactivated along with the extra information when Python is executed in optimized mode (``python -O``) and therefore pyo files will not have the overhead associated with the extra required data.

Just to be clear, .pyo files have not existed for a while: https://www.python.org/dev/peps/pep-0488/.

...
Notice that Python already strips docstrings in this mode so it would be "aligned" with the current mechanism of optimized mode.

This only kicks in at the -OO level.

...
Although this complicates the implementation, it certainly is still much easier than dealing with compression (and more useful for those that don't want the feature). Notice that we also expect pessimistic results from compression as offsets would be quite random (although predominantly in the range 10 - 120).

I personally prefer the idea of dropping the data with -OO since if you're stripping out docstrings you're already hurting introspection capabilities in the name of memory. Or one could go as far as to introduce -Os to do -OO plus dropping this extra data.

As for .pyc file size, I personally wouldn't worry about it. If someone is that space-constrained they either aren't using .pyc files or are only shipping a single set of .pyc files under -OO and skipping source code. And .pyc files are an implementation detail of CPython so there shouldn't be too much of a concern for other interpreters.

-Brett

...
On Sat, 8 May 2021 at 01:56, Pablo Galindo Salgado < pablogsal@gmail.com> wrote:

> One last note for clarity: that's the increase of size in the > stdlib, the increase of size > for pyc files goes from 28.471296MB to 34.750464MB, which is an > increase of 22%. > > On Sat, 8 May 2021 at 01:43, Pablo Galindo Salgado < > pablogsal@gmail.com> wrote: > >> Some update on the numbers. We have made some draft implementation >> to corroborate the >> numbers with some more realistic tests and seems that our original >> calculations were wrong. >> The actual increase in size is quite bigger than previously >> advertised: >> >> Using bytes object to encode the final object and marshalling that >> to disk (so using uint8_t) as the underlying >> type: >> >> BEFORE: >> >> ❯ ./python -m compileall -r 1000 Lib > /dev/null >> ❯ du -h Lib -c --max-depth=0 >> 70M Lib >> 70M total >> >> AFTER: >> ❯ ./python -m compileall -r 1000 Lib > /dev/null >> ❯ du -h Lib -c --max-depth=0 >> 76M Lib >> 76M total >> >> So that's an increase of 8.56 % over the original value. This is >> storing the start offset and end offset with no compression >> whatsoever. >> >> On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado < >> pablogsal@gmail.com> wrote: >> >>> Hi there, >>> >>> We are preparing a PEP and we would like to start some early >>> discussion about one of the main aspects of the PEP. >>> >>> The work we are preparing is to allow the interpreter to produce >>> more fine-grained error messages, pointing to >>> the source associated to the instructions that are failing. For >>> example: >>> >>> Traceback (most recent call last): >>> >>> File "test.py", line 14, in <module> >>> >>> lel3(x) >>> >>> ^^^^^^^ >>> >>> File "test.py", line 12, in lel3 >>> >>> return lel2(x) / 23 >>> >>> ^^^^^^^ >>> >>> File "test.py", line 9, in lel2 >>> >>> return 25 + lel(x) + lel(x) >>> >>> ^^^^^^ >>> >>> File "test.py", line 6, in lel >>> >>> return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) >>> >>> ^^^^^^^^^^^^^^^^^^^^^ >>> >>> TypeError: 'NoneType' object is not subscriptable >>> >>> The cost of this is having the start column number and end >>> column number information for every bytecode instruction >>> and this is what we want to discuss (there is also some stack cost >>> to re-raise exceptions but that's not a big problem in >>> any case). Given that column numbers are not very big compared >>> with line numbers, we plan to store these as unsigned chars >>> or unsigned shorts. We ran some experiments over the standard >>> library and we found that the overhead of all pyc files is: >>> >>> * If we use shorts, the total overhead is ~3% (total size 28MB and >>> the extra size is 0.88 MB). >>> * If we use chars. the total overhead is ~1.5% (total size 28 MB >>> and the extra size is 0.44MB). >>> >>> One of the disadvantages of using chars is that we can only report >>> columns from 1 to 255 so if an error happens in a column >>> bigger than that then we would have to exclude it (and not show >>> the highlighting) for that frame. Unsigned short will allow >>> the values to go from 0 to 65535. >>> >>> Unfortunately these numbers are not easily compressible, as every >>> instruction would have very different offsets. >>> >>> There is also the possibility of not doing this based on some >>> build flag on when using -O to allow users to opt out, but given the fact >>> that these numbers can be quite useful to other tools like >>> coverage measuring tools, tracers, profilers and the such adding conditional >>> logic to many places would complicate the implementation >>> considerably and will potentially reduce the usability of those tools so we >>> prefer >>> not to have the conditional logic. We believe this is extra cost >>> is very much worth the better error reporting but we understand and respect >>> other points of view. >>> >>> Does anyone see a better way to encode this information **without >>> complicating a lot the implementation**? What are people thoughts on the >>> feature? >>> >>> Thanks in advance, >>> >>> Regards from cloudy London, >>> Pablo Galindo Salgado >>> >>> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JUXUC7TY... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________

Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PDWYJ55Z... Code of Conduct: http://python.org/psf/codeofconduct/

Ethan Furman

9:01 p.m.

On 5/8/21 1:31 PM, Pablo Galindo Salgado wrote:

...

...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc" strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

* The data in pyc files * Printing the exception highlighting

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings. -- ~Ethan~

Pablo Galindo Salgado

9:07 p.m.

...

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data? On Sat, 8 May 2021 at 22:05, Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 5/8/21 1:31 PM, Pablo Galindo Salgado wrote:

...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc" strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

* The data in pyc files * Printing the exception highlighting

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

-- ~Ethan~ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BEE4BGUZ... Code of Conduct: http://python.org/psf/codeofconduct/

Jonathan Goble

9:37 p.m.

On Sat, May 8, 2021 at 5:08 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

...
Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data?

What if I want to keep asserts and docstrings but don't want the extra data? Or actually, consider this. I *need* to keep asserts (because rightly or wrongly, I have a dependency, or my own code, that relies on them), but I *don't* want docstrings (because they're huge and I don't want the overhead in production), and I *don't* want the extra data in production either. Now what? I think what this illustrates is that the entire concept of optimizations in Python needs a complete rethink. It's already fundamentally broken for someone who wants to keep asserts but remove docstrings. Adding a third layer to this is a perfect opportunity to reconsider the whole paradigm. I'm getting off-topic here, and this should probably be a thread of its own, but perhaps what we should introduce is a compiler directive, similar to future statements but not that, that one can place at the top of a source file to tell the compiler "this file depends on asserts, don't optimize them out". Same for each thing that can be optimized that has a runtime behavior effect, including docstrings. This would be minimally disruptive since we can then stay at only two optimization levels and put column info at whichever level we feel makes sense, but (provided the compiler directives are used properly) information a particular file requires to function correctly will never be removed from that file even if the process-wide optimization level calls for it. I see no reason code with asserts in one file and optimized code without asserts in another file can't interact, and no reason code with docstrings and optimized code without docstrings can't interact. Soft keywords would make this compiler directive much easier, as it doesn't have to be shoehorned into the import syntax (to suggest a bikeshed color, perhaps "retain asserts, docstrings"?)

Gregory P. Smith

10:01 p.m.

On Sat, May 8, 2021 at 2:40 PM Jonathan Goble <jcgoble3@gmail.com> wrote:

...

On Sat, May 8, 2021 at 5:08 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data?

What if I want to keep asserts and docstrings but don't want the extra data?

Or actually, consider this. I *need* to keep asserts (because rightly or wrongly, I have a dependency, or my own code, that relies on them), but I *don't* want docstrings (because they're huge and I don't want the overhead in production), and I *don't* want the extra data in production either.

Now what?

I think what this illustrates is that the entire concept of optimizations in Python needs a complete rethink. It's already fundamentally broken for someone who wants to keep asserts but remove docstrings. Adding a third layer to this is a perfect opportunity to reconsider the whole paradigm.

Reconsidering "the whole paradigm" is always possible, but is a much larger effort. It should not be something that blocks this enhancement from happening. We have discussed the -O mess before, on list and at summits and sprints. -OO and the __pycache__ and longer .pyc names and versioned names were among the results of that. But we opted not to try and make life even more complicated by expanding the test matrix of possible generated bytecode even larger. I'm getting off-topic here, and this should probably be a thread of its

...

own, but perhaps what we should introduce is a compiler directive, similar to future statements but not that, that one can place at the top of a source file to tell the compiler "this file depends on asserts, don't optimize them out". Same for each thing that can be optimized that has a runtime behavior effect, including docstrings.

This idea has merit. Worth keeping in mind for the future. But agreed, this goes beyond this threads topic so I'll leave it at that.

Gregory P. Smith

9:55 p.m.

On Sat, May 8, 2021 at 2:09 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

...
Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data?

...

Greg, what do you think if instead of not writing it to the pyc file with -OO or adding a header entry to decide to read/write, we place None in the field? That way we can leverage the option that we intend to add to deactivate displaying the

exactly my theme. our existing -O and -OO already don't serve all user needs. (I've witnessed people who need asserts but don't want docstrings wasting ram jump through hacky hoops to do that). Complicating these options more by combining additional actions on them them doesn't help. The reason we have -O and -OO generate their own special opt-1 and opt-2 pyc files is because both of those change the generated bytecode and overall flow of the program by omitting instructions and data. code using those will run slightly faster as there are fewer instructions. The change we're talking about here doesn't do that. It just adds additional metadata to whatever instructions are generated. So it doesn't feel -O related. While some people aren't going to like the overhead, I'm happy not offering the choice. traceback new information to reduce the data in the pyc files. The only problem

...

is that there will be still a tiny bit of overhead: an extra object per code object (None), but that's much much better than something that scales with the number of instructions :)

What's your opinion on this?

I don't understand the pyc structure enough to comment on how that works, but that sounds fine from a way to store less data if these are stored as a side table rather than intermingled with each instruction itself. *If anyone even cares about storing less data.* I would not activate generation of that in py_compile and compileall based on the -X flag to disable display of tracebacks though. A flag changing a setting of the current runtime regarding traceback printing detail level should not change the metadata in pyc files it emits. I realize -O and -OO behave this way, but I don't view those as a great example. We're not writing new uniquely named pyc files, I suggest making this an explicit option for py_compile and compileall if we're going to support generation of pyc files without column data at all. I'm unclear on what the specific goals are with all of these option possibilities. Who non-hypothetically cares about a 22% pyc file size increase? I don't think we should be concerned. I'm in favor of always writing them and the 20% size increase that results in. If pyc size is an issue that should be its own separate enhancement PEP. When it comes to pyc files there is more data we may want to store in the future for performance reasons - I don't see them shrinking without an independent effort. Caring about additional data retained in memory at runtime makes more sense to me as ram cost is much greater than storage cost and is paid repeatedly per process. Storing an additional reference to None on code objects where a column information table is perfectly fine. That can be a -X style interpreter startup option. It isn't something that needs to impacted by the pyc files. Pass that option to the interpreter, and it just discards column info tables on code objects after loading them or compiling them. If people want to optimize for a shared pyc situation with memory mapping techniques, that is also something that should be a separate enhancement PEP and not involved here. People writing code to use the column information should always check it for None first, that'd be something we document with the new feature. -gps

...

On Sat, 8 May 2021 at 22:05, Ethan Furman <ethan@stoneleaf.us> wrote:

...
On 5/8/21 1:31 PM, Pablo Galindo Salgado wrote:

...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc" strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two things to disable:

* The data in pyc files * Printing the exception highlighting

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

-- ~Ethan~ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BEE4BGUZ... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WK4KXZPO... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

10:09 p.m.

Thanks Greg for the great, detailed response I think I understand now better your proposal and I think is a good idea and I would like to explore that. I have some questions: * One problem I see is that that will make the constructor of the code object depend on global options in the interpreter. Someone using the C-API and passing down that attribute will be surprised to find that it was modified by a global. I am not saying is bad but I can see some problems with that. * The alternative is to modify all calls to the code object constructor. This is easy to do in the compiler because code objects are constructed very close where the meta data is crated but is going to be a pain in other places, because the code objects are constructed in places where we would either need new APIs or to hide global state as the previous point. * Another alternative is to walk the graph and strip the fields but that's going to have a performance impact. I think that if we decide to offer an opt out, this is actually one of the best options but I am still slightly concerned about the extra complexity, potential new APIs and maintainability. On Sat, 8 May 2021, 22:55 Gregory P. Smith, <greg@krypto.org> wrote:

...

On Sat, May 8, 2021 at 2:09 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data?

exactly my theme. our existing -O and -OO already don't serve all user needs. (I've witnessed people who need asserts but don't want docstrings wasting ram jump through hacky hoops to do that). Complicating these options more by combining additional actions on them them doesn't help.

The reason we have -O and -OO generate their own special opt-1 and opt-2 pyc files is because both of those change the generated bytecode and overall flow of the program by omitting instructions and data. code using those will run slightly faster as there are fewer instructions.

The change we're talking about here doesn't do that. It just adds additional metadata to whatever instructions are generated. So it doesn't feel -O related.

While some people aren't going to like the overhead, I'm happy not offering the choice.

...
Greg, what do you think if instead of not writing it to the pyc file with -OO or adding a header entry to decide to read/write, we place None in the field? That way we can leverage the option that we intend to add to deactivate displaying the traceback new information to reduce the data in the pyc files. The only problem is that there will be still a tiny bit of overhead: an extra object per code object (None), but that's much much better than something that scales with the number of instructions :)

What's your opinion on this?

I don't understand the pyc structure enough to comment on how that works, but that sounds fine from a way to store less data if these are stored as a side table rather than intermingled with each instruction itself. *If anyone even cares about storing less data.* I would not activate generation of that in py_compile and compileall based on the -X flag to disable display of tracebacks though. A flag changing a setting of the current runtime regarding traceback printing detail level should not change the metadata in pyc files it emits. I realize -O and -OO behave this way, but I don't view those as a great example. We're not writing new uniquely named pyc files, I suggest making this an explicit option for py_compile and compileall if we're going to support generation of pyc files without column data at all.

I'm unclear on what the specific goals are with all of these option possibilities.

Who non-hypothetically cares about a 22% pyc file size increase? I don't think we should be concerned. I'm in favor of always writing them and the 20% size increase that results in. If pyc size is an issue that should be its own separate enhancement PEP. When it comes to pyc files there is more data we may want to store in the future for performance reasons - I don't see them shrinking without an independent effort.

Caring about additional data retained in memory at runtime makes more sense to me as ram cost is much greater than storage cost and is paid repeatedly per process. Storing an additional reference to None on code objects where a column information table is perfectly fine. That can be a -X style interpreter startup option. It isn't something that needs to impacted by the pyc files. Pass that option to the interpreter, and it just discards column info tables on code objects after loading them or compiling them. If people want to optimize for a shared pyc situation with memory mapping techniques, that is also something that should be a separate enhancement PEP and not involved here. People writing code to use the column information should always check it for None first, that'd be something we document with the new feature.

-gps

...
On Sat, 8 May 2021 at 22:05, Ethan Furman <ethan@stoneleaf.us> wrote:

...
...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc" strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two

On 5/8/21 1:31 PM, Pablo Galindo Salgado wrote: things to disable:

...
* The data in pyc files * Printing the exception highlighting

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

-- ~Ethan~ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BEE4BGUZ... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WK4KXZPO... Code of Conduct: http://python.org/psf/codeofconduct/

Brett Cannon

10:10 p.m.

On Sat, May 8, 2021 at 2:59 PM Gregory P. Smith <greg@krypto.org> wrote:

...

On Sat, May 8, 2021 at 2:09 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
...
Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

What if someone wants to keep asserts but do not want the extra data?

exactly my theme. our existing -O and -OO already don't serve all user needs. (I've witnessed people who need asserts but don't want docstrings wasting ram jump through hacky hoops to do that). Complicating these options more by combining additional actions on them them doesn't help.

The reason we have -O and -OO generate their own special opt-1 and opt-2 pyc files is because both of those change the generated bytecode and overall flow of the program by omitting instructions and data. code using those will run slightly faster as there are fewer instructions.

The change we're talking about here doesn't do that. It just adds additional metadata to whatever instructions are generated. So it doesn't feel -O related.

While I'm the opposite. 😄 Metadata that is not necessary for CPython to function and whose primary driver is better exception tracebacks totally falls into the same camp as "I don't need docstrings" to me.

...

While some people aren't going to like the overhead, I'm happy not offering the choice.

...
Greg, what do you think if instead of not writing it to the pyc file with -OO or adding a header entry to decide to read/write, we place None in the field? That way we can leverage the option that we intend to add to deactivate displaying the traceback new information to reduce the data in the pyc files. The only problem is that there will be still a tiny bit of overhead: an extra object per code object (None), but that's much much better than something that scales with the number of instructions :)

What's your opinion on this?

I don't understand the pyc structure enough to comment on how that works,

Code to read a .pyc file and use it: https://github.com/python/cpython/blob/a0bd9e9c11f5f52c7ddd19144c8230da016b5... (I'd explain more but it is the weekend and I technically shouldn't be reading this thread 😉). -Brett

...

but that sounds fine from a way to store less data if these are stored as a side table rather than intermingled with each instruction itself. *If anyone even cares about storing less data.* I would not activate generation of that in py_compile and compileall based on the -X flag to disable display of tracebacks though. A flag changing a setting of the current runtime regarding traceback printing detail level should not change the metadata in pyc files it emits. I realize -O and -OO behave this way, but I don't view those as a great example. We're not writing new uniquely named pyc files, I suggest making this an explicit option for py_compile and compileall if we're going to support generation of pyc files without column data at all.

I'm unclear on what the specific goals are with all of these option possibilities.

Who non-hypothetically cares about a 22% pyc file size increase? I don't think we should be concerned. I'm in favor of always writing them and the 20% size increase that results in. If pyc size is an issue that should be its own separate enhancement PEP. When it comes to pyc files there is more data we may want to store in the future for performance reasons - I don't see them shrinking without an independent effort.

Caring about additional data retained in memory at runtime makes more sense to me as ram cost is much greater than storage cost and is paid repeatedly per process. Storing an additional reference to None on code objects where a column information table is perfectly fine. That can be a -X style interpreter startup option. It isn't something that needs to impacted by the pyc files. Pass that option to the interpreter, and it just discards column info tables on code objects after loading them or compiling them. If people want to optimize for a shared pyc situation with memory mapping techniques, that is also something that should be a separate enhancement PEP and not involved here. People writing code to use the column information should always check it for None first, that'd be something we document with the new feature.

-gps

...
On Sat, 8 May 2021 at 22:05, Ethan Furman <ethan@stoneleaf.us> wrote:

...
...
...
We can't piggy back on -OO as the only way to disable this, it needs to have an option of its own. -OO is unusable as code that relies on "doc" strings as application data such as http://www.dabeaz.com/ply/ply.html exists.

-OO is the only sensible way to disable the data. There are two

On 5/8/21 1:31 PM, Pablo Galindo Salgado wrote: things to disable:

...
* The data in pyc files * Printing the exception highlighting

Why not put in it -O instead? Then -O means lose asserts and lose fine-grained tracebacks, while -OO continues to also strip out doc strings.

-- ~Ethan~ _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BEE4BGUZ... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WK4KXZPO... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Q5RCT3VM... Code of Conduct: http://python.org/psf/codeofconduct/

M.-A. Lemburg

10 a.m.

On 08.05.2021 23:55, Gregory P. Smith wrote:

...

Who non-hypothetically cares about a 22% pyc file size increase? I don't think we should be concerned. I'm in favor of always writing them and the 20% size increase that results in. If pyc size is an issue that should be its own separate enhancement PEP. When it comes to pyc files there is more data we may want to store in the future for performance reasons - I don't see them shrinking without an independent effort.

Caring about additional data retained in memory at runtime makes more sense to me as ram cost is much greater than storage cost and is paid repeatedly per process. Storing an additional reference to None on code objects where a column information table is perfectly fine. That can be a -X style interpreter startup option. It isn't something that needs to impacted by the pyc files. Pass that option to the interpreter, and it just discards column info tables on code objects after loading them or compiling them. If people want to optimize for a shared pyc situation with memory mapping techniques, that is also something that should be a separate enhancement PEP and not involved here. People writing code to use the column information should always check it for None first, that'd be something we document with the new feature.

I do care about both the increase in PYC size as well as the increase in memory usage. When using Python in containers both are relevant and so I'd like an option to switch this whole mechanism off that's independent from optimization settings. This idea is more about debugging during development and doesn't really have much to do with optimization used for production use of Python, so a separate flag or perhaps use of -v would the more intuitive approach. Alternative idea: Create a new file format which supports enhanced debugging. This would include the source code in a indexed format, the AST and mappings between byte code, AST node, lines and columns. Python would then only use and load this file when it needs to print a traceback - much like it does today with the source code. The advantage is that you can add even more useful information for debugging while not making the default code distribution format take more memory (both disk and RAM). BTW: For better readability, I'd also not output the ^^^^ lines for every stack level in the traceback, but just the last one, since it's usually clear where the call to the next stack level happens in the upper ones. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, May 09 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/

Larry Hastings

12:22 p.m.

On 5/9/21 3:00 AM, M.-A. Lemburg wrote:

...

BTW: For better readability, I'd also not output the ^^^^ lines for every stack level in the traceback, but just the last one, since it's usually clear where the call to the next stack level happens in the upper ones.

Playing devil's advocate: in the un-usual case, where it may be ambiguous where the call came from, outputting the ^^^^ lines could be a real life-saver. I concede this is rare, //arry/

M.-A. Lemburg

7:28 a.m.

On 09.05.2021 14:22, Larry Hastings wrote:

...

On 5/9/21 3:00 AM, M.-A. Lemburg wrote:

...
BTW: For better readability, I'd also not output the ^^^^ lines for every stack level in the traceback, but just the last one, since it's usually clear where the call to the next stack level happens in the upper ones.

Playing devil's advocate: in the un-usual case, where it may be ambiguous where the call came from, outputting the ^^^^ lines could be a real life-saver.

I concede this is rare,

I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-) PS: It looks like the discussion has wondered off to Discourse now. Should we continue there ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, May 10 2021)

...

...
...
Python Projects, Coaching and Support ... https://www.egenix.com/ Python Product Development ... https://consulting.egenix.com/

Henk-Jaap Wagenaar

8:30 a.m.

On Mon, 10 May 2021 at 08:34, M.-A. Lemburg <mal@egenix.com> wrote:

...

[...]

PS: It looks like the discussion has wondered off to Discourse

...

now. Should we continue there ?

-- Marc-Andre Lemburg eGenix.com

Pablo seems to want to redirect the discussion there yes, in particular to: https://discuss.python.org/t/pep-657-include-fine-grained-error-locations-in... On Sun, 9 May 2021 at 16:25, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

[...] The discussion is happening in the discourse server:

https://discuss.python.org/t/pep-657-include-fine-grained-error-locations-in...

To avoid splitting the discussion, *please redirect your comments there* instead of replying to this thread.

Thanks!

Regards from sunny London, Pablo Galindo Salgado

Terry Reedy

9:34 a.m.

On 5/10/2021 3:28 AM, M.-A. Lemburg wrote:

...

I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-)

If the slice were instead marked with color tagging, as I hope will be possible in IDLE and other IDEs, then no extra lines well be needed -- Terry Jan Reedy

Steven D'Aprano

10:07 a.m.

On Mon, May 10, 2021 at 05:34:12AM -0400, Terry Reedy wrote:

...

On 5/10/2021 3:28 AM, M.-A. Lemburg wrote:

...
I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-)

If the slice were instead marked with color tagging, as I hope will be possible in IDLE and other IDEs, then no extra lines well be needed

That's great for people using IDLE, but for those using the vanilla Python interpreter, M-A.L makes a good point about increasing the vertical size of the traceback which will almost always be ignored. Its especially the case for beginners. Its hard enough to get newbies to read *any* of the traceback. Anything which increases the visual noise of that is going to make it harder. -- Steve

Irit Katriel

10:39 a.m.

Another alternative is instead of File blah.py line 3: return x/0 ^^^ to have File blah.py line 3 cols 12-14: x/0 On Mon, May 10, 2021 at 11:12 AM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Mon, May 10, 2021 at 05:34:12AM -0400, Terry Reedy wrote:

...
On 5/10/2021 3:28 AM, M.-A. Lemburg wrote:

...
I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-)

If the slice were instead marked with color tagging, as I hope will be possible in IDLE and other IDEs, then no extra lines well be needed

That's great for people using IDLE, but for those using the vanilla Python interpreter, M-A.L makes a good point about increasing the vertical size of the traceback which will almost always be ignored.

Its especially the case for beginners. Its hard enough to get newbies to read *any* of the traceback. Anything which increases the visual noise of that is going to make it harder.

-- Steve _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3ADVDPF4... Code of Conduct: http://python.org/psf/codeofconduct/

Pablo Galindo Salgado

10:53 a.m.

That is going to be very hard to read, unfortunately. Especially when the line is not simple. Highlighting the range is quite a fundamental part of the proposal and is driven by the great welcoming of highlighting ranges for syntax errors, which many users have reached to say that they find it extremely useful as a visual feature. Also, people with vision problems have mentioned how important having a highlighting section under the code to quickly understand the problem. On Mon, 10 May 2021 at 11:46, Irit Katriel via Python-Dev < python-dev@python.org> wrote:

...

Another alternative is instead of

File blah.py line 3: return x/0 ^^^

to have

File blah.py line 3 cols 12-14: x/0

On Mon, May 10, 2021 at 11:12 AM Steven D'Aprano <steve@pearwood.info> wrote:

...
On Mon, May 10, 2021 at 05:34:12AM -0400, Terry Reedy wrote:

...
On 5/10/2021 3:28 AM, M.-A. Lemburg wrote:

...
I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-)

If the slice were instead marked with color tagging, as I hope will be possible in IDLE and other IDEs, then no extra lines well be needed

That's great for people using IDLE, but for those using the vanilla Python interpreter, M-A.L makes a good point about increasing the vertical size of the traceback which will almost always be ignored.

Its especially the case for beginners. Its hard enough to get newbies to read *any* of the traceback. Anything which increases the visual noise of that is going to make it harder.

-- Steve _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3ADVDPF4... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YWNDAKU7... Code of Conduct: http://python.org/psf/codeofconduct/

Terry Reedy

1:44 a.m.

On 5/10/2021 6:07 AM, Steven D'Aprano wrote:

...

On Mon, May 10, 2021 at 05:34:12AM -0400, Terry Reedy wrote:

...
On 5/10/2021 3:28 AM, M.-A. Lemburg wrote:

...
I'm mostly thinking of tracebacks which go >10 levels deep, which is rather common in larger applications. For those tracebacks, the top entries are mostly noise you never look at when debugging. The proposal now adds another 10 extra lines to jump over :-)

If the slice were instead marked with color tagging, as I hope will be possible in IDLE and other IDEs, then no extra lines well be needed

That's great for people using IDLE, but for those using the vanilla Python interpreter, M-A.L makes a good point about increasing the vertical size of the traceback which will almost always be ignored.

The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal. Making the enhancement available to nonstandard python-coded interfaces is a separate issue.

...

Its especially the case for beginners. Its hard enough to get newbies to read *any* of the traceback. Anything which increases the visual noise of that is going to make it harder.

-- Terry Jan Reedy

Steven D'Aprano

7:35 a.m.

On Mon, May 10, 2021 at 09:44:05PM -0400, Terry Reedy wrote:

...

The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal.

This is what is called "scope creep", although in this case perhaps "scope gallop" is more appropriate *wink* Supporting coloured output out of the box would be nice but if we want to do it properly, we would have to support at least ANSI-compatible terminals and Windows. And once we support it in tracebacks, you know people will say "if Python can print coloured text in a traceback, why can't I print coloured text in my own output?" and so that's going to rapidly end up needing something like colorama. -- Steve

Baptiste Carvello

8:57 a.m.

Le 11/05/2021 à 09:35, Steven D'Aprano a écrit :

...

On Mon, May 10, 2021 at 09:44:05PM -0400, Terry Reedy wrote:

...
The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal.

This is what is called "scope creep", although in this case perhaps "scope gallop" is more appropriate *wink* [...]

Also: people paste tracebacks into issue reports, so all information has to survive copy-pasting. Cheers, Baptiste

Mike Miller

10:23 p.m.

On 5/11/21 1:57 AM, Baptiste Carvello wrote:

...

Le 11/05/2021 à 09:35, Steven D'Aprano a écrit :

...
On Mon, May 10, 2021 at 09:44:05PM -0400, Terry Reedy wrote:

...
The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal.

This is what is called "scope creep", although in this case perhaps "scope gallop" is more appropriate *wink* [...]

Also: people paste tracebacks into issue reports, so all information has to survive copy-pasting.

The first ANSI standard supported underlined text, didn't it? The VT100 did. That would make it part of the 40+ year old subset from the late 70's. While color might stand out more, underline suits the problem well, also without increasing the line count. There are a number of terminal emulators that support rich text copies, but not all of them. This is added information however, so it not being copy-pastable everywhere shouldn't be a blocking requirement imho. -Mike

Gregory P. Smith

11:01 p.m.

On Tue, May 11, 2021 at 3:33 PM Mike Miller <python-dev@mgmiller.net> wrote:

...

On 5/11/21 1:57 AM, Baptiste Carvello wrote:

...
Le 11/05/2021 à 09:35, Steven D'Aprano a écrit :

...
On Mon, May 10, 2021 at 09:44:05PM -0400, Terry Reedy wrote:

...
The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal.

This is what is called "scope creep", although in this case perhaps "scope gallop" is more appropriate *wink* [...]

Also: people paste tracebacks into issue reports, so all information has to survive copy-pasting.

The first ANSI standard supported underlined text, didn't it? The VT100 did. That would make it part of the 40+ year old subset from the late 70's.

While color might stand out more, underline suits the problem well, also without increasing the line count.

There are a number of terminal emulators that support rich text copies, but not all of them. This is added information however, so it not being copy-pastable everywhere shouldn't be a blocking requirement imho.

fancier REPL frontends have supported things like highlighting and such in their tracebacks, I expect they'll adopt column information and render it as such. There's a difference between tracebacks dumped as plain text (utf-8) by traceback.print_exc() appearing on stderr or directed into log files and what can be displayed within a terminal. It is highly unusual to emit terminal control characters into log files. -G

...

-Mike _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/W44D2BWW... Code of Conduct: http://python.org/psf/codeofconduct/

Guido van Rossum

11:12 p.m.

On Tue, May 11, 2021 at 4:07 PM Gregory P. Smith <greg@krypto.org> wrote:

...

On Tue, May 11, 2021 at 3:33 PM Mike Miller <python-dev@mgmiller.net> wrote:

...
On 5/11/21 1:57 AM, Baptiste Carvello wrote:

...
Le 11/05/2021 à 09:35, Steven D'Aprano a écrit :

...
On Mon, May 10, 2021 at 09:44:05PM -0400, Terry Reedy wrote:

...
The vanilla interpreter could be updated to recognize when it is running on a similated 35-year-old terminal that implements ansi-vt100 color codes rather than a similated 40+-year-old black-and-white teletype-like terminal.

This is what is called "scope creep", although in this case perhaps "scope gallop" is more appropriate *wink* [...]

Also: people paste tracebacks into issue reports, so all information has to survive copy-pasting.

The first ANSI standard supported underlined text, didn't it? The VT100 did. That would make it part of the 40+ year old subset from the late 70's.

While color might stand out more, underline suits the problem well, also without increasing the line count.

There are a number of terminal emulators that support rich text copies, but not all of them. This is added information however, so it not being copy-pastable everywhere shouldn't be a blocking requirement imho.

fancier REPL frontends have supported things like highlighting and such in their tracebacks, I expect they'll adopt column information and render it as such.

There's a difference between tracebacks dumped as plain text (utf-8) by traceback.print_exc() appearing on stderr or directed into log files and what can be displayed within a terminal. It is highly unusual to emit terminal control characters into log files.

And yet it happens all the time. :-( Let's not risk that happening. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Mike Miller

1:03 a.m.

On 2021-05-11 16:12, Guido van Rossum wrote:

...

On Tue, May 11, 2021 at 4:07 PM Gregory P. Smith <greg@krypto.org There's a difference between tracebacks dumped as plain text (utf-8) by traceback.print_exc() appearing on stderr or directed into log files and what can be displayed within a terminal. It is highly unusual to emit terminal control characters into log files.

And yet it happens all the time. :-( Let's not risk that happening.

os.isatty() is helpful in that situation, no? -Mike

Barry

2:11 p.m.

...

On 13 May 2021, at 02:09, Mike Miller <python-dev@mgmiller.net> wrote:

...
On 2021-05-11 16:12, Guido van Rossum wrote: On Tue, May 11, 2021 at 4:07 PM Gregory P. Smith <greg@krypto.org There's a difference between tracebacks dumped as plain text (utf-8) by traceback.print_exc() appearing on stderr or directed into log files and what can be displayed within a terminal. It is highly unusual to emit terminal control characters into log files. And yet it happens all the time. :-( Let's not risk that happening.

os.isatty() is helpful in that situation,

Most tools that support colour output allow you to customise the colours and have a always-colour, never-colour, auto-colour option. Isatty() is useful for the auto. Barry

...

-Mike

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QT2CTAPV... Code of Conduct: http://python.org/psf/codeofconduct/

Ethan Furman

3:20 p.m.

On 5/9/21 3:00 AM, M.-A. Lemburg wrote:

...

BTW: For better readability, I'd also not output the ^^^^ lines for every stack level in the traceback, but just the last one, since it's usually clear where the call to the next stack level happens in the upper ones.

Usually, sure -- but in the unusual case those carets at every level can save a lot of time and frustration, and that is the goal of this enhancement, yes? I know I have experienced that ambiguity more than once. -- ~Ethan~

Chris Jerdonek

1:02 a.m.

On Fri, May 7, 2021 at 5:44 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

To know what compression methods might be effective, I’m wondering if it could be useful to see separate histograms of, say, the start column number and width over the code base. Or for people that really want to dig in, maybe access to the set of all pairs could help. (E.g. maybe a histogram of pairs could also reveal something.) —Chris

...

On Fri, 7 May 2021 at 22:45, Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QDEKMTZR... Code of Conduct: http://python.org/psf/codeofconduct/

Steven D'Aprano

1:36 a.m.

On Fri, May 07, 2021 at 06:02:51PM -0700, Chris Jerdonek wrote:

...

To know what compression methods might be effective, I’m wondering if it could be useful to see separate histograms of, say, the start column number and width over the code base. Or for people that really want to dig in, maybe access to the set of all pairs could help. (E.g. maybe a histogram of pairs could also reveal something.)

I think this is over-analysing. Do we need to micro-optimize the compression algorithm? Let's make the choice simple: live with the size increase, or swap to LZ4 compression as Antoine suggested. Analysis paralysis is a real risk here. If there are implementations which cannot support either (MicroPython?) they should be free to continue doing things the old way. In other words, "fine grained error messages" should be a quality of implementation feature rather than a language guarantee. I understand that the plan is to make this feature optional in any case, to allow third-party tools to catch up. If people really want to do that histogram analysis so that they can optimize the choice of compression algorithm, of course they are free to do so. But the PEP authors should not feel that they are obliged to do so, and we should avoid the temptation to bikeshed over compressors. (For what it's worth, I like this proposed feature, I don't care about a 20-25% increase in pyc file size, but if this leads to adding LZ4 compression to the stdlib, I like it even more :-) -- Steve

Chris Jerdonek

2:13 a.m.

On Fri, May 7, 2021 at 6:39 PM Steven D'Aprano <steve@pearwood.info> wrote:

...

On Fri, May 07, 2021 at 06:02:51PM -0700, Chris Jerdonek wrote:

...
To know what compression methods might be effective, I’m wondering if it could be useful to see separate histograms of, say, the start column number and width over the code base. Or for people that really want to dig in, maybe access to the set of all pairs could help. (E.g. maybe a histogram of pairs could also reveal something.)

I think this is over-analysing. Do we need to micro-optimize the compression algorithm? Let's make the choice simple: live with the size increase, or swap to LZ4 compression as Antoine suggested. Analysis paralysis is a real risk here.

If there are implementations which cannot support either (MicroPython?) they should be free to continue doing things the old way. In other words, "fine grained error messages" should be a quality of implementation feature rather than a language guarantee.

I understand that the plan is to make this feature optional in any case, to allow third-party tools to catch up.

If people really want to do that histogram analysis so that they can optimize the choice of compression algorithm, of course they are free to do so. But the PEP authors should not feel that they are obliged to do so, and we should avoid the temptation to bikeshed over compressors.

I'm not sure why you're sounding so negative. Pablo asked for ideas in his first message to the list: On Fri, May 7, 2021 at 2:53 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Does anyone see a better way to encode this information **without complicating a lot the implementation**?

Maybe a large gain can be made with a simple tweak to how the pair is encoded, but there's no way to know without seeing the distribution. Also, my reply wasn't about the pyc files on disk but about their representation in memory, which Pablo later said may be the main concern. So it's not compression algorithms like LZ4 so much as a method of encoding. --Chris

...

(For what it's worth, I like this proposed feature, I don't care about a 20-25% increase in pyc file size, but if this leads to adding LZ4 compression to the stdlib, I like it even more :-)

-- Steve _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6H2XSRMA... Code of Conduct: http://python.org/psf/codeofconduct/

...

Steven D'Aprano

12:04 a.m.

Hi Chris, On Fri, May 07, 2021 at 07:13:16PM -0700, Chris Jerdonek wrote:

...

I'm not sure why you're sounding so negative. Pablo asked for ideas in his first message to the list:

I know that Pablo asked for ideas, but that doesn't mean that we are obliged to agree with every idea. This is a discussion list which means we discuss ideas, both to agree and disagree. I don't think I'm being negative. I'm very positive about this proposal, and I don't want to see it get bogged down with bike-shedding about the precise compression/encoding algorithm used. If Pablo, or any other volunteer such as yourself, wants to go down that track to investigate the data distribution, I'm not going to tell them that they must not. Go for it! But I'd rather not make this a mandatory prerequisite for the PEP. [...]

...

my reply wasn't about the pyc files on disk but about their representation in memory, which Pablo later said may be the main concern. So it's not compression algorithms like LZ4 so much as a method of encoding.

Okay, thanks for the clarification. -- Steve

MRAB

1:35 a.m.

On 2021-05-08 01:43, Pablo Galindo Salgado wrote:

...

Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

[snip] I'm wondering if it's possible to compromise with one position that's not as complete but still gives a good hint: For example: File "test.py", line 6, in lel return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) ^ TypeError: 'NoneType' object is not subscriptable That at least tells you which subscript raised the exception. Another example: Traceback (most recent call last): File "test.py", line 4, in <module> print(1 / x + 1 / y) ^ ZeroDivisionError: division by zero as distinct from: Traceback (most recent call last): File "test.py", line 4, in <module> print(1 / x + 1 / y) ^ ZeroDivisionError: division by zero

Pablo Galindo Salgado

2:09 a.m.

...

I'm wondering if it's possible to compromise with one position that's not as complete but still gives a good hint:

Even if is possible, it will be quite less useful (a lot of users wanted to highlight full ranges for syntax errors, and that change was very well received when we announce it in 3.10) and most importantly, will render the feature much less useful for other tools such as profilers, coverage tools, and the like. It will also make the feature less useful for people that want to display even more information such as error reporting tools, IDEs....etc On Sat, 8 May 2021 at 02:41, MRAB <python@mrabarnett.plus.com> wrote:

...

On 2021-05-08 01:43, Pablo Galindo Salgado wrote:

...
Some update on the numbers. We have made some draft implementation to corroborate the numbers with some more realistic tests and seems that our original calculations were wrong. The actual increase in size is quite bigger than previously advertised:

Using bytes object to encode the final object and marshalling that to disk (so using uint8_t) as the underlying type:

BEFORE:

❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 70M Lib 70M total

AFTER: ❯ ./python -m compileall -r 1000 Lib > /dev/null ❯ du -h Lib -c --max-depth=0 76M Lib 76M total

So that's an increase of 8.56 % over the original value. This is storing the start offset and end offset with no compression whatsoever.

[snip]

I'm wondering if it's possible to compromise with one position that's not as complete but still gives a good hint:

For example:

File "test.py", line 6, in lel return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e) ^

TypeError: 'NoneType' object is not subscriptable

That at least tells you which subscript raised the exception.

Another example:

Traceback (most recent call last): File "test.py", line 4, in <module> print(1 / x + 1 / y) ^ ZeroDivisionError: division by zero

as distinct from:

Traceback (most recent call last): File "test.py", line 4, in <module> print(1 / x + 1 / y) ^ ZeroDivisionError: division by zero _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4RGQALI6... Code of Conduct: http://python.org/psf/codeofconduct/

Devin Jeanpierre

4:53 p.m.

...

What are people thoughts on the feature?

I'm +1, this level of detail in the bytecode is very useful. My main interest is actually from the AST though. :) In order to be in the bytecode, one assumes it must first be in the AST. That information is incredibly useful for refactoring tools like https://github.com/ssbr/refex (n.b. author=me) or https://github.com/gristlabs/asttokens (which refex builds on). Currently, asttokens actually attempts to re-discover that kind of information after the fact, which is error-prone and difficult. This could also be useful for finer-grained code coverage tracking and/or debugging. One can actually imagine highlighting the spans of code which were only partially executed: e.g. if only x() were ever executed in "x() and y()" . Ned Batchelder once did wild hacks in this space, and maybe this proposal could lead in the future to something non-hacky? https://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.ht... I say "in the future" because it doesn't just automatically work, since as I understand it, coverage currently doesn't track spans, but lines hit by the line-based debugger. Something else is needed to be able to track which spans were hit rather than which lines, and it may be similarly hacky if it's isolated to coveragepy. If, for example, enough were exposed to let a debugger skip to bytecode for the next different (sub) span, then this would be useful for both coverage and actual debugging as you step through an expression. This is probably way out of scope for your PEP, but even so, the feature may be laying some useful ground work here. -- Devin On Fri, May 7, 2021 at 2:52 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...

Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DB3RTYBF... Code of Conduct: http://python.org/psf/codeofconduct/

Jelle Zijlstra

5:23 p.m.

El sáb, 8 may 2021 a las 10:00, Devin Jeanpierre (<jeanpierreda@gmail.com>) escribió:

...

...
What are people thoughts on the feature?

I'm +1, this level of detail in the bytecode is very useful. My main interest is actually from the AST though. :) In order to be in the bytecode, one assumes it must first be in the AST. That information is incredibly useful for refactoring tools like https://github.com/ssbr/refex (n.b. author=me) or https://github.com/gristlabs/asttokens (which refex builds on). Currently, asttokens actually attempts to re-discover that kind of information after the fact, which is error-prone and difficult.

The AST already has column offsets ( https://docs.python.org/3.10/library/ast.html#ast.AST.col_offset).

...

This could also be useful for finer-grained code coverage tracking and/or debugging. One can actually imagine highlighting the spans of code which were only partially executed: e.g. if only x() were ever executed in "x() and y()" . Ned Batchelder once did wild hacks in this space, and maybe this proposal could lead in the future to something non-hacky? https://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.ht... I say "in the future" because it doesn't just automatically work, since as I understand it, coverage currently doesn't track spans, but lines hit by the line-based debugger. Something else is needed to be able to track which spans were hit rather than which lines, and it may be similarly hacky if it's isolated to coveragepy. If, for example, enough were exposed to let a debugger skip to bytecode for the next different (sub) span, then this would be useful for both coverage and actual debugging as you step through an expression. This is probably way out of scope for your PEP, but even so, the feature may be laying some useful ground work here.

-- Devin

On Fri, May 7, 2021 at 2:52 PM Pablo Galindo Salgado <pablogsal@gmail.com> wrote:

...
Hi there,

We are preparing a PEP and we would like to start some early discussion about one of the main aspects of the PEP.

The work we are preparing is to allow the interpreter to produce more fine-grained error messages, pointing to the source associated to the instructions that are failing. For example:

Traceback (most recent call last):

File "test.py", line 14, in <module>

lel3(x)

^^^^^^^

File "test.py", line 12, in lel3

return lel2(x) / 23

^^^^^^^

File "test.py", line 9, in lel2

return 25 + lel(x) + lel(x)

^^^^^^

File "test.py", line 6, in lel

return 1 + foo(a,b,c=x['z']['x']['y']['z']['y'], d=e)

^^^^^^^^^^^^^^^^^^^^^

TypeError: 'NoneType' object is not subscriptable

The cost of this is having the start column number and end column number information for every bytecode instruction and this is what we want to discuss (there is also some stack cost to re-raise exceptions but that's not a big problem in any case). Given that column numbers are not very big compared with line numbers, we plan to store these as unsigned chars or unsigned shorts. We ran some experiments over the standard library and we found that the overhead of all pyc files is:

* If we use shorts, the total overhead is ~3% (total size 28MB and the extra size is 0.88 MB). * If we use chars. the total overhead is ~1.5% (total size 28 MB and the extra size is 0.44MB).

One of the disadvantages of using chars is that we can only report columns from 1 to 255 so if an error happens in a column bigger than that then we would have to exclude it (and not show the highlighting) for that frame. Unsigned short will allow the values to go from 0 to 65535.

Unfortunately these numbers are not easily compressible, as every instruction would have very different offsets.

There is also the possibility of not doing this based on some build flag on when using -O to allow users to opt out, but given the fact that these numbers can be quite useful to other tools like coverage measuring tools, tracers, profilers and the such adding conditional logic to many places would complicate the implementation considerably and will potentially reduce the usability of those tools so we prefer not to have the conditional logic. We believe this is extra cost is very much worth the better error reporting but we understand and respect other points of view.

Does anyone see a better way to encode this information **without complicating a lot the implementation**? What are people thoughts on the feature?

Thanks in advance,

Regards from cloudy London, Pablo Galindo Salgado

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DB3RTYBF... Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Y4R44A4J... Code of Conduct: http://python.org/psf/codeofconduct/

1383

Age (days ago)

1389

Last active (days ago)

List overview

Download

69 comments

26 participants

participants (26)

Ammar Askar
Antoine Pitrou
Baptiste Carvello
Barry
Brett Cannon
Chris Jerdonek
Devin Jeanpierre
Ethan Furman
Gregory P. Smith
Guido van Rossum
Henk-Jaap Wagenaar
Irit Katriel
Jelle Zijlstra
Jim J. Jewett
Jonathan Goble
Larry Hastings
M.-A. Lemburg
Mike Miller
MRAB
Nathaniel Smith
Neil Schemenauer
Nick Coghlan
Pablo Galindo Salgado
Richard Damon
Steven D'Aprano
Terry Reedy

Future PEP: Include Fine Grained Error Locations in Tracebacks

Baptiste Carvello

tags

participants (26)