PEP proposal to limit various aspects of a Python program to one million.
Hi Everyone, I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module. Any thoughts or feedback? The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst Cheers, Mark. Full text ********* PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History: Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation. The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive. This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed. Motivation ========== There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe. It is inefficient as actual values rarely need more than a dozen or so bits to represent them. It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`. For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters. Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to: * Frame objects * Object headers * Code objects There is also the potential for a more efficient instruction format, speeding up interpreter dispatch. Rationale ========= Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language. By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators. One million ----------- The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code. A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines). The value "one million" is very easy to remember. Isn't this "640K ought to be enough for anybody" again? ------------------------------------------------------- The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code. While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code. Specification ============= This PR proposes that the following language features and runtime values be limited to one million. * The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter. The advantages for CPython of imposing these limits: ---------------------------------------------------- Line of code in a module and code object restrictions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When compiling source code to bytecode or modifying bytecode for profiling or debugging, an intermediate form is required. The limiting line numbers and operands to 20 bits, instructions can be represented in a compact 64 bit form allowing very fast passes over the instruction sequence. Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster. Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This limit has to the potential to reduce the size of object headers considerably. Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header. For example, a super-compact header format might look like this: .. code-block:: struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ } This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes. Note that there are two ways to use a 32 bit refcount on a 64 bit machine. One is to limit each sub-interpreter to 32Gb of memory. The other is to use a saturating reference count, which would be a little bit slower, but allow unlimited memory allocation. Backwards Compatibility ======================= It is hypothetically possible that some machine generated code exceeds one or more of the above limits. The author believes that to be highly unlikely and easily fixed by modifying the output stage of the code generator. Security Implications ===================== Minimal. This reduces the attack surface of any Python virtual machine by a small amount. Reference Implementation ======================== None, as yet. This will be implemented in CPython, once the PEP has been accepted. Rejected Ideas ============== None, as yet. Open Issues =========== None, as yet. References ========== .. [1] The Java Virtual Machine specification https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf Copyright ========= This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:
I would not have such a small limit. I can envisage generating code from a log then evaluating that code. 1 million lines could be small, given the speed of the interpreter on modern machines. One might want to generate data as a Python file rather than a pile and load that as a module. There might well be alternate methods, but if it works and is quickly debugged at small scale, then you put a possible extra barrier in the way. Python is still a scripting language, why limit a "quick hack". On Tue, Dec 3, 2019, 4:22 PM Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Full text *********
PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History:
Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation.
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
Motivation ==========
There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe.
It is inefficient as actual values rarely need more than a dozen or so bits to represent them.
It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`.
For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters.
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to:
* Frame objects * Object headers * Code objects
There is also the potential for a more efficient instruction format, speeding up interpreter dispatch.
Rationale =========
Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language.
By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators.
One million -----------
The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code.
A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines).
The value "one million" is very easy to remember.
Isn't this "640K ought to be enough for anybody" again? -------------------------------------------------------
The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code.
While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter.
The advantages for CPython of imposing these limits: ----------------------------------------------------
Line of code in a module and code object restrictions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When compiling source code to bytecode or modifying bytecode for profiling or debugging, an intermediate form is required. The limiting line numbers and operands to 20 bits, instructions can be represented in a compact 64 bit form allowing very fast passes over the instruction sequence.
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header.
For example, a super-compact header format might look like this:
.. code-block::
struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ }
This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes.
Note that there are two ways to use a 32 bit refcount on a 64 bit machine. One is to limit each sub-interpreter to 32Gb of memory. The other is to use a saturating reference count, which would be a little bit slower, but allow unlimited memory allocation.
Backwards Compatibility =======================
It is hypothetically possible that some machine generated code exceeds one or more of the above limits. The author believes that to be highly unlikely and easily fixed by modifying the output stage of the code generator.
Security Implications =====================
Minimal. This reduces the attack surface of any Python virtual machine by a small amount.
Reference Implementation ========================
None, as yet. This will be implemented in CPython, once the PEP has been accepted.
Rejected Ideas ==============
None, as yet.
Open Issues ===========
None, as yet.
References ==========
.. [1] The Java Virtual Machine specification
https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf
Copyright =========
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QM4QUJOB... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, Dec 4, 2019 at 3:20 AM Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The main justification for these is performance, if I'm reading this correctly. Have you measured anything to find out just how much you gain? Arbitrary limits are always annoying when you hit them, so it would be nice to see how much benefit there is first. ChrisA
On 03Dec2019 0815, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
I assume you're aiming for acceptance in just under four months? :)
Any thoughts or feedback?
It's actually not an unreasonable idea, to be fair. Picking an arbitrary limit less than 2**32 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 10**6 (such as if/else depth and recursion limits). That said, I don't really want to impact edge-case usage, and I'm all too familiar with other examples of arbitrary limits (no file system would need a path longer than 260 characters, right? :o) ). Some comments on the specific items, assuming we're not just going to reject this out of hand.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module
This one feels the most arbitrary. What if I have a million blank lines or comments? We still need the correct line number to be stored, which means our lineno fields still have to go beyond 10**6. Limiting total lines in a module to 10**6 is certainly too small.
* The number of bytecode instructions in a code object.
Seems reasonable.
* The sum of local variables and stack usage for a code object.
I suspect our effective limit is already lower than 10**6 here anyway - do we know what it actually is?
* The number of distinct names in a code object
SGTM.
* The number of constants in a code object.
SGTM.
* The number of classes in a running interpreter.
I'm a little hesitant on this one, but perhaps there's a way to use a sentinel for class_id (in your later struct) for when someone exceeds this limit? The benefits seem worthwhile here even without the rest of the PEP.
* The number of live coroutines in a running interpreter.
SGTM. At this point we're probably putting serious pressure on kernel wait objects/FDs anyway, and if you're not waiting then you're probably not efficiently using coroutines anyway.
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
We can measure this - how common are EXTENDED_ARG instructions? ISTR we checked this when switching to 16-bit instructions and it was worth it, but I'm not sure whether we also considered 32-bit instructions at that time.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
This would be awesome, and I *think* it's ABI compatible (as the affected fields are all put behind the PyObject* that gets returned, right?). If so, I think it's worth calling that out in the text, as it's not immediately obvious. Cheers, Steve
On Tue, Dec 3, 2019, at 12:22, Steve Dower wrote:
* The number of constants in a code object.
SGTM.
Two things... First, for this one in particular, the number of constants in a code object is hard to predict. For example, recently (I want to say 3.7), the number of constants generated for most code was reduced by removing duplicate constants and intermediate constants that are optimized away by constant-folding. Adding the "x in [list or set literal]" optimization introduces a tuple or frozenset constant (and removes the constants for the items) You also mentioned names - it's easy to imagine an implementation whose version of a code object shares names and constants in a single array with a single limit, or which has constants split up per type [and maybe doesn't implement tuple and frozenset constants at all] and shares names and string constants. Or one which does not store names at all for local variables in optimized code. I also don't think characteristics of cpython-specific structures like code objects should be where limits are defined. One of the purposes of specifying a formal limit is to give other implementations a clear target for minimum support. In C, there's a long list of limits like these [some of which are quite a bit smaller than what typical implementations support], but all of them are for characteristics that can actually be determined by looking at the source code for a program.
On 03/12/2019 5:22 pm, Steve Dower wrote:
On 03Dec2019 0815, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
I assume you're aiming for acceptance in just under four months? :)
Why not? I'm an optimist at heart :)
Any thoughts or feedback?
It's actually not an unreasonable idea, to be fair. Picking an arbitrary limit less than 2**32 is certainly safer for many reasons, and very unlikely to impact real usage. We already have some real limits well below 10**6 (such as if/else depth and recursion limits).
That said, I don't really want to impact edge-case usage, and I'm all too familiar with other examples of arbitrary limits (no file system would need a path longer than 260 characters, right? :o) ).
Some comments on the specific items, assuming we're not just going to reject this out of hand.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module
This one feels the most arbitrary. What if I have a million blank lines or comments? We still need the correct line number to be stored, which means our lineno fields still have to go beyond 10**6. Limiting total lines in a module to 10**6 is certainly too small.
* The number of bytecode instructions in a code object.
Seems reasonable.
* The sum of local variables and stack usage for a code object.
I suspect our effective limit is already lower than 10**6 here anyway - do we know what it actually is?
* The number of distinct names in a code object
SGTM.
* The number of constants in a code object.
SGTM.
* The number of classes in a running interpreter.
I'm a little hesitant on this one, but perhaps there's a way to use a sentinel for class_id (in your later struct) for when someone exceeds this limit? The benefits seem worthwhile here even without the rest of the PEP.
* The number of live coroutines in a running interpreter.
SGTM. At this point we're probably putting serious pressure on kernel wait objects/FDs anyway, and if you're not waiting then you're probably not efficiently using coroutines anyway.
From my limited googling, linux has a hard limit of about 600k file descriptors across all processes. So, 1M is well past any reasonable per-process limit. My impression is that the limits are lower on Windows, is that right?
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
We can measure this - how common are EXTENDED_ARG instructions? ISTR we checked this when switching to 16-bit instructions and it was worth it, but I'm not sure whether we also considered 32-bit instructions at that time.
The main benefit of 32 bit instructions is super-instructions, but removing EXTENDED_ARG does streamline instruction decoding a bit.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
This would be awesome, and I *think* it's ABI compatible (as the affected fields are all put behind the PyObject* that gets returned, right?). If so, I think it's worth calling that out in the text, as it's not immediately obvious.
Cheers, Steve
On Thu, Dec 5, 2019 at 5:38 AM Mark Shannon <mark@hotpy.org> wrote:
From my limited googling, linux has a hard limit of about 600k file descriptors across all processes. So, 1M is well past any reasonable per-process limit. My impression is that the limits are lower on Windows, is that right?
Linux does limit the total number of file descriptors across all processes, but the limit is configurable at runtime. 600k is the default limit, but you can always make it larger (and people do). In my limited experimentation with Windows, it doesn't seem to impose any a priori limit on how many sockets you can have open. When I wrote a simple process that opens as many sockets as it can in a loop, I didn't get any error; eventually the machine just locked up. (I guess this is another example of why it can be better to have explicit limits!) -n -- Nathaniel J. Smith -- https://vorpus.org
On 03/12/2019 16:15, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Full text *********
PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History:
Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation.
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
Motivation ==========
There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe.
It is inefficient as actual values rarely need more than a dozen or so bits to represent them.
It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`.
For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters.
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to:
* Frame objects * Object headers * Code objects
There is also the potential for a more efficient instruction format, speeding up interpreter dispatch.
Rationale =========
Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language.
By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators.
One million -----------
The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code.
A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines).
OK, let me stop you here. If you have twenty bits of information, you'll be fitting them into a 32-bit word anyway. Anything else will be more or less inefficient to access, depending on your processor. You aren't going to save anything there. If you have plans to use the spare bits for something else, please don't. I've seen this done in two major architectures (status flags for both the IBM System/370 and ARM 2 and 3 architectures lived in the top bits of the program counter), and it was acknowledged to be a major mistake both times. Aside from limiting your expansion (Who would ever want more than 24 bits of address space? Everyone, it turns out :-), every access you make to that word is going to need to mask out some bits of the word. You would take an efficiency hit on every access.
Isn't this "640K ought to be enough for anybody" again? -------------------------------------------------------
The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code.
While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code.
I can absolutely guarantee that this will come back and bite you. Someone out there will be doing something more complicated than you think is plausible, and eventually someone will hit your limits. It may not take as long as you think, either. -- Rhodri James *-* Kynesim Ltd
Oddly, I did not get Mark's original e-mail, but am seeing replies here. Piggybacking off of James' email here... On 03/12/2019 16:15, Mark Shannon wrote: > Hi Everyone, > > I am proposing a new PEP, still in draft form, to impose a limit of one > million on various aspects of Python programs, such as the lines of code > per module. My main concern about this PEP is it doesn't specify the behavior when a given limit is exceeded. Whether you choose 10 lines or 10 billion lines as the rule, someone annoying (like me) is going to want to know what's going to happen if I break the rule. Non-exhaustively, you could: 1. Say the behavior is implementation defined 2. Physically prohibit the limit from being exceeded (limited by construction/physics) 3. Generate a warning 4. Raise an exception early (during parse/analysis/bytecode generation) 5. Raise an exception during runtime The first two will keep people who hate limits happy, but essentially give the limit no teeth. The last three are meaningful but will upset people when a previously valid program breaks. 1. The C and C++ standards are littered with limits (many of which you have to violate to create a real-world program) that ultimately specify that the resulting behavior is "implementation defined." Most general-purpose compilers have reasonable implementations (e.g. I can actually end my file without a newline and not have it call abort() or execve("/usr/bin/nethack"), behaviors both allowed by the C99 standard). You could go this route, but the end result isn't much better than not having done the PEP in the first place (beyond having an Ivory Tower to sit upon and taunt the unwashed masses, "I told you so," when you do decide to break their code). Don't go this route unless absolutely necessary. Of course, the C/C++ standard isn't for an implementation; this PEP has the luxury of addressing a single implementation (CPython). 2. Many of Java's limits are by construction. You can't exceed 2**16 bytecode instructions for a method because they only allocated a uint16_t (u2 in the classfile spec) for the program counter in various places. (Bizarrely, the size of the method itself is stored as a uint32_t/u4.) I believe these limits are less useful because you'll never hit them in a running program; you simply can't create an invalid program. This would be like saying the size of Python bytecode is limited to the number of particles in the universe (~10**80). You don't have to specify the consequences because physics won't let you violate them. This is more useful for documenting format limits, but probably doesn't achieve what you're trying to achieve. 3. Realistically, this is probably what you'd have to do in the first version for PEP adoption to get non-readers of python-dev@ ready, but, again, it doesn't achieve what you're setting out to do. We'd still accept programs that exceed these limits, and whatever optimizations that depend on these limits being in place wouldn't work. Which brings us to the real meat, 4&5. Some limits don't really distinguish between these cases. Exceeding the total bytecode size for a module, for example, would have to fail at bytecode generation time (ignoring truly irrational behavior like silently truncating the bytecode). But others aren't so cut-and-dry. For example, a module that is compliant except for a single function that contains too many local variables. Whether you do 4 or 5 isn't so obvious: Pros of choosing 4 (exception at load): * I'm alerted of errors early, before I start a 90-hour compute job, only to have it crash in the write_output() function. * Don't have to keep a poisoned function that your optimizers have to special case. Pros of choosing 5 (exception at runtime): * If I never call that function (maybe it's something in a library I don't use), I don't get penalized. * In line with other Python (mis-)behaviors, e.g. raising NameError() at runtime if you typo a variable name. On Tue 12/03/19, 10:05 AM, "Rhodri James" <rhodri@kynesim.co.uk> wrote: On 03/12/2019 16:15, Mark Shannon wrote: > Isn't this "640K ought to be enough for anybody" again? > ------------------------------------------------------- > > The infamous 640K memory limit was a limit on machine usable resources. > The proposed one million limit is a limit on human generated code. > > While it is possible that generated code could exceed the limit, > it is easy for a code generator to modify its output to conform. > The author has hit the 64K limit in the JVM on at least two occasions > when generating Java code. > The workarounds were relatively straightforward and > probably wouldn't have been necessary with a limit of one million > bytecodes or lines of code. I can absolutely guarantee that this will come back and bite you. Someone out there will be doing something more complicated than you think is plausible, and eventually someone will hit your limits. It may not take as long as you think, either. I'm in between Rhodri and Mark here. I've also been bitten by the 64k JVM bytecode limit when generating code, but I did *not* find it so easy to work around. What was a dumb translator suddenly had to get a lot more smarts. Having predictable behavior *is* important, though, and having limits with specified behavior when those limits are exceeded helps. Keep in mind that I'm going to be annoyed when I hit those limits, so having an engineering justification for why the limit was set to a certain value will go a long way into buying you credibility. One million does not feel credible -- that's "we're setting a limit because we couldn't be bothered to figure out what the limit should be." OTOH, 16,777,215 (2**24-1) does feel credible -- that's "no processor is capable of holding this many TLB entries in the level 2 cache with retpolines active without introducing extreme swapping on write-limited SSDs, but you can get around it if you're willing to adjust this constant and recompile." Or whatever. (Ok, don't BS us like I just did, but you get the idea. :-) ) Dave
I'm going to second Chris' comment about efficiency. The purposes of this PEP (as I read it) are: (1) Security (less chance of code intentionally or accidentally exceeding low-level machine limits that allow a security exploit); (2) Improved memory use; (3) And such improved memory use will lead to faster code. 1 and 2 seem to be obviously true, but like Chris, I think its a bit much to expect us to take 3 on faith until after the PEP is accepted:
Reference Implementation ========================
None, as yet. This will be implemented in CPython, once the PEP has been accepted.
I think the change you are asking for is akin to asking us to accept the GILectomy on the promise that "trust me, it will speed up CPython, no reference implementation is needed". It's a big thing to ask. Most of us a Python programmers, not experts on the minutia of the interaction between C code and CPU cache locality and prediction etc. Speaking for myself, all I know is that it is *really hard* to predict what will and won't be faster: improving memory locality will speed things up but doing more work on every access will slow things down so I'd like to see something more than just an assertion that this will speed things up. Another question: how will this change effect CPython on less common CPUs like Atom etc? As for the other objections I've seen so far, I think they are specious. (Sorry guys, I just think you are being knee-jerk naysayers. Convince me I am wrong.) Obviously none of them is going to apply to hand-written code. Despite Rhodri's skepticism I don't think there is any really question of hand-written code hitting the limits of a million constants or a million local variables *per function*. I just grepped the 3.8 stdlib for "class" and came up with fewer than 22000 instances of the word: [steve@ando cpython]$ grep -r "class" Lib/ | wc -l 21522 That includes docstrings, comments, the word "subclass" etc, but let's pretend that they're all actual classes. And let's round it up to 25000, and assume that there's another 25000 classes built into the interpreter, AND then *quadruple* that number to allow for third party libraries, that comes to just 250,000 classes. So we could load the entire stdlib and all our third-party libraries at once, and still be able to write 750,000 classes in your own code before hitting the limit. Paddy: if you are generating a script from a log file, and it hits the million line boundary, it's easy to split it across multiple files. Your objection why limit a "quick hack" has a simple answer: limiting that quick hack allows Python code to be quicker, more memory efficient and safer. If the cost of this is that you have to generate "quick hack" machine-generated Python scripts in multiple million-lines each files instead of one ginormous file, then that's a tradeoff well worth making. Random832, I think the intention of this PEP is to specify limits for *CPython* specifically, not other implementations. Mark, can you clarify? I don't understand why Steve Downer raises the issue of a million blank lines or comments. Machine generated code surely doesn't need blank lines. Blank lines could be stripped out; comments could be moved to another file. I see no real difficulty here. -- Steven
Steven D'Aprano wrote:
I'm going to second Chris' comment about efficiency. The purposes of this PEP (as I read it) are: (1) Security (less chance of code intentionally or accidentally exceeding low-level machine limits that allow a security exploit); (2) Improved memory use; (3) And such improved memory use will lead to faster code. 1 and 2 seem to be obviously true, but like Chris, I think its a bit much to expect us to take 3 on faith until after the PEP is accepted:
Reference Implementation None, as yet. This will be implemented in CPython, once the PEP has been accepted. I think the change you are asking for is akin to asking us to accept the GILectomy on the promise that "trust me, it will speed up CPython, no reference implementation is needed". It's a big thing to ask. Most of us a Python programmers, not experts on the minutia of the interaction between C code and CPU cache locality and prediction etc.
While I personally am okay putting in limits where necessary for security or if there's a clear performance win and the margin for people is high enough, I agree with Steven that the PEP currently doesn't lay that out yet beyond conjecture that this should have some benefit. I can also see the argument that having a language standard versus it be something that's per-interpreter so one can know their code will run everywhere, but I would also argue most auto-generated code is probably not for a library that's going to approach any limit being proposed and it's more going to be app code which is going to be more tied to a specific interpreter. And in that case I would rather let the interpreters manage their own limits as their performance characteristics will be different thus these caps may bring them no benefit. And so artificial constraints in the name of interpreter consistency goes against "practicality beats purity".
The number of live coroutines in a running interpreter.
Could you further elaborate on what is meant by "live coroutines"? My guesses (roughly from most likely to least likely) would be: 1) All known coroutine objects in a state of either CORO_RUNNING or CORO_SUSPENDED, but *not* CORO_CREATED or CORO_CLOSED. 2) Coroutine objects that are currently running, in a state of CORO_RUNNING 3) Coroutines objects being awaited, in a state of CORO_SUSPENDED 4) All known coroutine objects in a state of either CORO_CREATED, CORO_RUNNING, or CORO_SUSPENDED 5) All known coroutine objects in any state 6) Total declared coroutines Just so we're all on the same page, I'm referring to a "coroutine object" as the object returned from the call `coro()`; whereas a "coroutine" is the coroutine function/method `async def coro` (or the deprecated generator-based coroutines). It probably wouldn't be as much of a concern to only allow 1M running coroutines at one time, but it might be an issue to only allow for there to be 1M known coroutine objects in any state within a given interpreter. Particularly for servers that run nearly indefinitely and handle a significant number of concurrent requests. On Tue, Dec 3, 2019 at 11:24 AM Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Full text *********
PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History:
Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation.
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
Motivation ==========
There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe.
It is inefficient as actual values rarely need more than a dozen or so bits to represent them.
It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`.
For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters.
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to:
* Frame objects * Object headers * Code objects
There is also the potential for a more efficient instruction format, speeding up interpreter dispatch.
Rationale =========
Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language.
By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators.
One million -----------
The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code.
A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines).
The value "one million" is very easy to remember.
Isn't this "640K ought to be enough for anybody" again? -------------------------------------------------------
The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code.
While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter.
The advantages for CPython of imposing these limits: ----------------------------------------------------
Line of code in a module and code object restrictions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When compiling source code to bytecode or modifying bytecode for profiling or debugging, an intermediate form is required. The limiting line numbers and operands to 20 bits, instructions can be represented in a compact 64 bit form allowing very fast passes over the instruction sequence.
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header.
For example, a super-compact header format might look like this:
.. code-block::
struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ }
This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes.
Note that there are two ways to use a 32 bit refcount on a 64 bit machine. One is to limit each sub-interpreter to 32Gb of memory. The other is to use a saturating reference count, which would be a little bit slower, but allow unlimited memory allocation.
Backwards Compatibility =======================
It is hypothetically possible that some machine generated code exceeds one or more of the above limits. The author believes that to be highly unlikely and easily fixed by modifying the output stage of the code generator.
Security Implications =====================
Minimal. This reduces the attack surface of any Python virtual machine by a small amount.
Reference Implementation ========================
None, as yet. This will be implemented in CPython, once the PEP has been accepted.
Rejected Ideas ==============
None, as yet.
Open Issues ===========
None, as yet.
References ==========
.. [1] The Java Virtual Machine specification
https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf
Copyright =========
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QM4QUJOB... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Dec 3, 2019 at 8:21 AM Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Full text *********
PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History:
Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation.
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
Motivation ==========
There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe.
It is inefficient as actual values rarely need more than a dozen or so bits to represent them.
It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`.
For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters.
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to:
* Frame objects * Object headers * Code objects
There is also the potential for a more efficient instruction format, speeding up interpreter dispatch.
Rationale =========
Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language.
By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators.
One million -----------
The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code.
A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines).
The value "one million" is very easy to remember.
Isn't this "640K ought to be enough for anybody" again? -------------------------------------------------------
The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code.
While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter.
The advantages for CPython of imposing these limits: ----------------------------------------------------
Line of code in a module and code object restrictions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When compiling source code to bytecode or modifying bytecode for profiling or debugging, an intermediate form is required. The limiting line numbers and operands to 20 bits, instructions can be represented in a compact 64 bit form allowing very fast passes over the instruction sequence.
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header.
For example, a super-compact header format might look like this:
.. code-block::
struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ }
This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes.
Note that there are two ways to use a 32 bit refcount on a 64 bit machine. One is to limit each sub-interpreter to 32Gb of memory. The other is to use a saturating reference count, which would be a little bit slower, but allow unlimited memory allocation.
Backwards Compatibility =======================
It is hypothetically possible that some machine generated code exceeds one or more of the above limits. The author believes that to be highly unlikely and easily fixed by modifying the output stage of the code generator.
Security Implications =====================
Minimal. This reduces the attack surface of any Python virtual machine by a small amount.
Reference Implementation ========================
None, as yet. This will be implemented in CPython, once the PEP has been accepted.
Rejected Ideas ==============
None, as yet.
Open Issues ===========
None, as yet.
References ==========
.. [1] The Java Virtual Machine specification
Overall I *like* the idea of limits... *But...* in my experience, limits like this tend to impact generated source code or generated bytecode, and thus any program that transitively uses those. Hard limits within the Javaish world have been *a major pain* on the Android platform for example. I wouldn't call workarounds straightforward when it comes to total number of classes or methods in a process. If we're to adopt limits where there were previously none, we need to do it via a multi-release deprecation cycle feedback loop to give people a way to find report use cases that exceed the limits in real world practical applications. So the limits can be reconsidered or the recommended workarounds tested and agreed upon. -gps
On Wed, Dec 4, 2019 at 1:31 PM Gregory P. Smith <greg@krypto.org> wrote:
Overall I like the idea of limits... But... in my experience, limits like this tend to impact generated source code or generated bytecode, and thus any program that transitively uses those.
Overall, I *dislike* the idea of limits, but accept them when there's a true and demonstrable benefit :) I've worked with a lot of systems - languages (or interpreters/compilers), file formats, etc, etc, etc - that have arbitrary limits in them. The usual problem is that, a few years down the track, what used to be "wow that's crazy huge" becomes "ugh now I'm hitting this silly limit". For instance, PostgreSQL is limited to 1600 columns per table. Is that an insanely high limit that you'll never hit, or an actual real limiting factor? Integer sizes are a classic example of this. Is it acceptable to limit your integers to 2^16? 2^32? 2^64? Python made the choice to NOT limit its integers, and I haven't heard of any non-toy examples where an attacker causes you to evaluate 2**2**100 and eats up all your RAM. OTOH, being able to do arbitrary precision arithmetic and not worry about an arbitrary limit to your precision is a very good thing. IMO the limit is, in itself, a bad thing. If it's guaranteeing that some exploit can't bring down your system, sure. If it permits a significant and measurable performance benefit, sure. But the advantage isn't the limit itself - it's what the limit enables. Which I'd like to see more evidence of. :) ChrisA
On Wed, Dec 04, 2019 at 01:47:53PM +1100, Chris Angelico wrote:
Integer sizes are a classic example of this. Is it acceptable to limit your integers to 2^16? 2^32? 2^64? Python made the choice to NOT limit its integers, and I haven't heard of any non-toy examples where an attacker causes you to evaluate 2**2**100 and eats up all your RAM.
Does self-inflicted attacks count? I've managed to bring down a production machine, causing data loss, *twice* by thoughtlessly running something like 10**100**100 at the interactive interpreter. (Neither case was a server, just a desktop machine, but the data loss was still very real.)
OTOH, being able to do arbitrary precision arithmetic and not worry about an arbitrary limit to your precision is a very good thing.
I'll remind you of Guido's long-ago experience with ABC, which used arbitrary precision rationals (fractions) as their numeric type. That sounds all well and good, until you try doing a bunch of calculations and your numbers start growing to unlimited size. Do you really want a hundred billion digits of precision for a calculation based on measurements made to one decimal place? -- Steven
On Wed, Dec 4, 2019 at 3:16 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Dec 04, 2019 at 01:47:53PM +1100, Chris Angelico wrote:
Integer sizes are a classic example of this. Is it acceptable to limit your integers to 2^16? 2^32? 2^64? Python made the choice to NOT limit its integers, and I haven't heard of any non-toy examples where an attacker causes you to evaluate 2**2**100 and eats up all your RAM.
Does self-inflicted attacks count? I've managed to bring down a production machine, causing data loss, *twice* by thoughtlessly running something like 10**100**100 at the interactive interpreter. (Neither case was a server, just a desktop machine, but the data loss was still very real.)
Hmm, and you couldn't Ctrl-C it? I tried and was able to. There ARE a few situations where I'd rather get a simple and clean MemoryError than have it drive my system into the swapper, but there are at least as many situations where you'd rather be able to use virtual memory instead of being forced to manually break a job up. But even there, you can't enshrine a limit in the language definition, since the actual threshold depends on the running system. (And can be far better enforced externally, at least on a Unix-like OS.)
OTOH, being able to do arbitrary precision arithmetic and not worry about an arbitrary limit to your precision is a very good thing.
I'll remind you of Guido's long-ago experience with ABC, which used arbitrary precision rationals (fractions) as their numeric type. That sounds all well and good, until you try doing a bunch of calculations and your numbers start growing to unlimited size. Do you really want a hundred billion digits of precision for a calculation based on measurements made to one decimal place?
Sometimes, yes! But if I don't, it's more likely that I want to choose a limit within the program, rather than run into a hard limit defined by the language. I've done a lot of work with fractions.Fraction and made good use of its immense precision. The Python float type gives a significant tradeoff in terms of performance vs precision. But decimal.Decimal lets you choose exactly how much precision to retain, rather than baking it into the language as "no more than 1,000,000 digits of precision, ever". The solution to "do you really want a hundred billion digits of precision" is "use the decimal context to choose", not "hard-code a limit". The value of the hard-coded limit in a float is that floats are way WAY faster than Decimals. ChrisA
On Wed, 4 Dec 2019 at 05:41, Chris Angelico <rosuav@gmail.com> wrote:
On Wed, Dec 4, 2019 at 3:16 PM Steven D'Aprano <steve@pearwood.info> wrote:
On Wed, Dec 04, 2019 at 01:47:53PM +1100, Chris Angelico wrote:
Integer sizes are a classic example of this. Is it acceptable to limit your integers to 2^16? 2^32? 2^64? Python made the choice to NOT limit its integers, and I haven't heard of any non-toy examples where an attacker causes you to evaluate 2**2**100 and eats up all your RAM.
Does self-inflicted attacks count? I've managed to bring down a production machine, causing data loss, *twice* by thoughtlessly running something like 10**100**100 at the interactive interpreter. (Neither case was a server, just a desktop machine, but the data loss was still very real.)
Hmm, and you couldn't Ctrl-C it? I tried and was able to.
I don't know if this is OS-dependent but I think that maybe there has been an improvement in recent CPython (3.8?) for using Ctrl-C in these cases. Certainly in the past I've seen situations where creating an absurdly large integer cannot be interrupted before it is too late and the system needs a hard reboot. This is actually a common source of bugs in SymPy e.g.: https://github.com/sympy/sympy/issues/17609#issuecomment-531327039 Those bugs in SymPy can be fixed in SymPy which is uniquely in a position to be able to represent large exponent operations without actually evaluating them in dense integer format. I would have thought though that on the spectrum of Python usage SymPy would be very much at the end that really wants to use enormous integers so the fact that it needs to be limited there makes me wonder who does really want to evaluate them. Note that CPython's implementation of large integers is not as optimised as gmp so I think that if someone was using Python for incredibly large integer calculations then they would be well advised not to use plain int in their calculations anyway (SymPy will try to use gmpy/gmpy2 if available).
There ARE a few situations where I'd rather get a simple and clean MemoryError than have it drive my system into the swapper, but there are at least as many situations where you'd rather be able to use virtual memory instead of being forced to manually break a job up. But even there, you can't enshrine a limit in the language definition, since the actual threshold depends on the running system. (And can be far better enforced externally, at least on a Unix-like OS.)
Another possibility is to have a configurable limit like the recursion limit so that users can increase it when they want to. The default limit can be something larger than most people would ever want but small enough that on typical hardware you can't bork the system in a single arithmetic operation. Then the default level and configurability of the limit can be implementation defined. -- Oscar
Abdur-Rahmaan Janhangeer http://www.pythonmembers.club | https://github.com/Abdur-rahmaanJ Mauritius On Wed, 4 Dec 2019, 06:52 Chris Angelico, <rosuav@gmail.com> wrote:
Python made the choice to NOT limit its integers, and I haven't heard of any non-toy examples where an attacker causes you to evaluate 2**2**100 and eats up all your RAM.
Happened with an IRC bot of mine, allowed calculations (hey you need to keep rolling out new features). Someone calculated-crashed the bot by overstepping that limit (was the day i learnt that feature of Py)
On 04/12/2019 2:31 am, Gregory P. Smith wrote:
On Tue, Dec 3, 2019 at 8:21 AM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
[snip]
Overall I /like/ the idea of limits... /But.../ in my experience, limits like this tend to impact generated source code or generated bytecode, and thus any program that transitively uses those.
Hard limits within the Javaish world have been *a major pain* on the Android platform for example. I wouldn't call workarounds straightforward when it comes to total number of classes or methods in a process.
Do you have any numbers? 1M is a lot bigger then 64K, but real world numbers would be helpful.
If we're to adopt limits where there were previously none, we need to do it via a multi-release deprecation cycle feedback loop to give people a way to find report use cases that exceed the limits in real world practical applications. So the limits can be reconsidered or the recommended workarounds tested and agreed upon.
-gps
On Thu, Dec 5, 2019, 5:53 PM Mark Shannon <mark@hotpy.org> wrote:
On 04/12/2019 2:31 am, Gregory P. Smith wrote:
On Tue, Dec 3, 2019 at 8:21 AM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of
one
million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
[snip]
Overall I /like/ the idea of limits... /But.../ in my experience, limits like this tend to impact generated source code or generated bytecode, and thus any program that transitively uses those.
Hard limits within the Javaish world have been *a major pain* on the Android platform for example. I wouldn't call workarounds straightforward when it comes to total number of classes or methods in a process.
Do you have any numbers? 1M is a lot bigger then 64K, but real world numbers would be helpful.
I guess the relevant case in question is with Facebook patching the limit of 65,000 classes in Android : https://m.facebook.com/notes/facebook-engineering/under-the-hood-dalvik-patc...
If we're to adopt limits where there were previously none, we need to do it via a multi-release deprecation cycle feedback loop to give people a way to find report use cases that exceed the limits in real world practical applications. So the limits can be reconsidered or the recommended workarounds tested and agreed upon.
-gps
Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECKES7IP... Code of Conduct: http://python.org/psf/codeofconduct/
On 05/12/2019 12:45 pm, Karthikeyan wrote:
On Thu, Dec 5, 2019, 5:53 PM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
On 04/12/2019 2:31 am, Gregory P. Smith wrote: > > > On Tue, Dec 3, 2019 at 8:21 AM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org> > <mailto:mark@hotpy.org <mailto:mark@hotpy.org>>> wrote: > > Hi Everyone, > > I am proposing a new PEP, still in draft form, to impose a limit of one > million on various aspects of Python programs, such as the lines of > code > per module. > > Any thoughts or feedback? >
[snip]
> > > Overall I /like/ the idea of limits... /But.../ in my experience, limits > like this tend to impact generated source code or generated bytecode, > and thus any program that transitively uses those. > > Hard limits within the Javaish world have been *a major pain* on the > Android platform for example. I wouldn't call workarounds > straightforward when it comes to total number of classes or methods in a > process.
Do you have any numbers? 1M is a lot bigger then 64K, but real world numbers would be helpful.
I guess the relevant case in question is with Facebook patching the limit of 65,000 classes in Android : https://m.facebook.com/notes/facebook-engineering/under-the-hood-dalvik-patc...
Is that the correct link? That seems to be an issue with an internal buffer size, not the limit on the number of classes.
> > If we're to adopt limits where there were previously none, we need to do > it via a multi-release deprecation cycle feedback loop to give people a > way to find report use cases that exceed the limits in real world > practical applications. So the limits can be reconsidered or the > recommended workarounds tested and agreed upon. > > -gps _______________________________________________ Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org> https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECKES7IP... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, Dec 5, 2019 at 6:23 PM Mark Shannon <mark@hotpy.org> wrote:
On Thu, Dec 5, 2019, 5:53 PM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
On 04/12/2019 2:31 am, Gregory P. Smith wrote: > > > On Tue, Dec 3, 2019 at 8:21 AM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org> > <mailto:mark@hotpy.org <mailto:mark@hotpy.org>>> wrote: > > Hi Everyone, > > I am proposing a new PEP, still in draft form, to impose a limit of one > million on various aspects of Python programs, such as the lines of > code > per module. > > Any thoughts or feedback? >
[snip]
> > > Overall I /like/ the idea of limits... /But.../ in my experience, limits > like this tend to impact generated source code or generated bytecode, > and thus any program that transitively uses those. > > Hard limits within the Javaish world have been *a major pain* on
On 05/12/2019 12:45 pm, Karthikeyan wrote: the
> Android platform for example. I wouldn't call workarounds > straightforward when it comes to total number of classes or methods in a > process.
Do you have any numbers? 1M is a lot bigger then 64K, but real world numbers would be helpful.
I guess the relevant case in question is with Facebook patching the limit of 65,000 classes in Android :
https://m.facebook.com/notes/facebook-engineering/under-the-hood-dalvik-patc...
Is that the correct link? That seems to be an issue with an internal buffer size, not the limit on the number of classes.
Sorry, it should have been about the number of methods limit in Android that is around 65,000 methods : https://developer.android.com/studio/build/multidex . I guess facebook worked around the limit but couldn't find a reliable source for it. There was also a post on Facebook iOS app with large number of classes but not essentially hitting a limit on the iOS platform : https://quellish.tumblr.com/post/126712999812/how-on-earth-the-facebook-ios-... . I guess it's the number referred but I could be mistaken.
> > If we're to adopt limits where there were previously none, we need to do > it via a multi-release deprecation cycle feedback loop to give people a > way to find report use cases that exceed the limits in real world > practical applications. So the limits can be reconsidered or the > recommended workarounds tested and agreed upon. > > -gps _______________________________________________ Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org> https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/ECKES7IP...
Code of Conduct: http://python.org/psf/codeofconduct/
-- Regards, Karthikeyan S
On Tue, Dec 3, 2019 at 8:20 AM Mark Shannon <mark@hotpy.org> wrote:
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
The basic idea makes sense to me. Well-defined limits that can be supported properly are better than vague limits that are supported by wishful thinking.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
This language is probably too broad... for example, there's certainly a limit on how many objects can be alive at the same time due to the physical limits of memory, but that limit is way higher than a million.
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object.
These are all attributes of source files, so sure, a million is plenty, and the interpreter spends a ton of time manipulating tables of these things.
* The number of classes in a running interpreter.
This one isn't as obvious to me... classes are basically just objects of type 'type', and there is definitely code out there that creates classes dynamically. A million still seems like a lot, and I'm not saying I'd *recommend* a design that involves creating millions of different type objects, but it might exist already.
* The number of live coroutines in a running interpreter.
I don't get this one. I'm not thinking of any motivation (the interpreter doesn't track live coroutines differently from any other object), and the limit seems dangerously low. A million coroutines only requires a few gigabytes of RAM, and there are definitely people who run single process systems with >1e6 concurrent tasks (random example: https://goroutines.com/10m) I don't know if there's anyone doing this in *Python right now, due to Python's performance limitations, but it's nowhere near as silly as a function with a million local variables.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header.
For example, a super-compact header format might look like this:
.. code-block::
struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ }
This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes.
In this example, I can't figure out how you'd map your 26 bit class_id to a class object. On a 32-bit system it would be fine, you just need 64 byte alignment, but you're talking about 64-bit systems, so... I know you aren't suggesting classes should have 2**(64 - 26) = ~3x10**11 byte alignment :-) -n -- Nathaniel J. Smith -- https://vorpus.org
* The number of classes in a running interpreter. * The number of live coroutines in a running interpreter. These two can (and coroutines actually are always) dynamically generated - and it is not hard to imagine scenarios were 1 million for these would easily be beaten. I don't know the data structures needed for those, but it would be much saner to keep both limited to 2**32 (due to reasons on artificial limiting the word length as put by others), or at least a much higher count. On Tue, 3 Dec 2019 at 13:22, Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Full text *********
PEP: 1000000 Title: The one million limit Author: Mark Shannon <mark@hotpy.org> Status: Active Type: Enhancement Content-Type: text/x-rst Created: 03-Dec-2019 Post-History:
Abstract ======== This PR proposes a limit of one million (1 000 000) for various aspects of Python code and its implementation.
The Python language does not specify limits for many of its features. Not having any limit to these values seems to enhance programmer freedom, at least superficially, but in practice the CPython VM and other Python virtual machines have implicit limits or are forced to assume that the limits are astronomical, which is expensive.
This PR lists a number of features which are to have a limit of one million. If a language feature is not listed but appears unlimited and must be finite, for physical reasons if no other, then a limit of one million should be assumed.
Motivation ==========
There are many values that need to be represented in a virtual machine. If no limit is specified for these values, then the representation must either be inefficient or vulnerable to overflow. The CPython virtual machine represents values like line numbers, stack offsets and instruction offsets by 32 bit values. This is inefficient, and potentially unsafe.
It is inefficient as actual values rarely need more than a dozen or so bits to represent them.
It is unsafe as malicious or poorly generated code could cause values to exceed 2\ :sup:`32`.
For example, line numbers are represented by 32 bit values internally. This is inefficient, given that modules almost never exceed a few thousand lines. Despite being inefficent, is is still vulnerable to overflow as it is easy for an attacker to created a module with billions of newline characters.
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking). Being able to safely store important values in 20 bits would allow memory savings in several data structures including, but not limited to:
* Frame objects * Object headers * Code objects
There is also the potential for a more efficient instruction format, speeding up interpreter dispatch.
Rationale =========
Imposing a limit on values such as lines of code in a module, and the number of local variables, has significant advantages for ease of implementation and efficiency of virtual machines. If the limit is sufficiently large, there is no adverse effect on users of the language.
By selecting a fixed but large limit for these values, it is possible to have both safety and efficiency whilst causing no inconvience to human programmers and only very rare problems for code generators.
One million -----------
The Java Virtual Machine (JVM) [1]_ specifies a limit of 2\ :sup:`16`-1 (65535) for many program elements similar to those covered here. This limit enables limited values to fit in 16 bits, which is a very efficient machine representation. However, this limit is quite easily exceeded in practice by code generators and the author is aware of existing Python code that already exceeds 2\ :sup:`16` lines of code.
A limit of one million fits into 20 bits which, although not as convenient for machine representation, is still reasonably compact. Three signed valuses in the range -1000_000 to +1000_000 can fit into a 64 bit word. A limit of one million is small enough for efficiency advantages (only 20 bits), but large enough not to impact users (no one has ever written a module of one million lines).
The value "one million" is very easy to remember.
Isn't this "640K ought to be enough for anybody" again? -------------------------------------------------------
The infamous 640K memory limit was a limit on machine usable resources. The proposed one million limit is a limit on human generated code.
While it is possible that generated code could exceed the limit, it is easy for a code generator to modify its output to conform. The author has hit the 64K limit in the JVM on at least two occasions when generating Java code. The workarounds were relatively straightforward and probably wouldn't have been necessary with a limit of one million bytecodes or lines of code.
Specification =============
This PR proposes that the following language features and runtime values be limited to one million.
* The number of source code lines in a module * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter.
The advantages for CPython of imposing these limits: ----------------------------------------------------
Line of code in a module and code object restrictions. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When compiling source code to bytecode or modifying bytecode for profiling or debugging, an intermediate form is required. The limiting line numbers and operands to 20 bits, instructions can be represented in a compact 64 bit form allowing very fast passes over the instruction sequence.
Having 20 bit operands (21 bits for relative branches) allows instructions to fit into 32 bits without needing additional ``EXTENDED_ARG`` instructions. This improves dispatch, as the operand is strictly local to the instruction. Using super-instructions would make that the 32 bit format almost as compact as the 16 bit format, and significantly faster.
Total number of classes in a running interpreter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This limit has to the potential to reduce the size of object headers considerably.
Currently objects have a two word header, for objects without references (int, float, str, etc.) or a four word header for objects with references. By reducing the maximum number of classes, the space for the class reference can be reduced from 64 bits to fewer than 32 bits allowing a much more compact header.
For example, a super-compact header format might look like this:
.. code-block::
struct header { uint32_t gc_flags:6; /* Needs finalisation, might be part of a cycle, etc. */ uint32_t class_id:26; /* Can be efficiently mapped to address by ensuring suitable alignment of classes */ uint32_t refcount; /* Limited memory or saturating */ }
This format would reduce the size of a Python object without slots, on a 64 bit machine, from 40 to 16 bytes.
Note that there are two ways to use a 32 bit refcount on a 64 bit machine. One is to limit each sub-interpreter to 32Gb of memory. The other is to use a saturating reference count, which would be a little bit slower, but allow unlimited memory allocation.
Backwards Compatibility =======================
It is hypothetically possible that some machine generated code exceeds one or more of the above limits. The author believes that to be highly unlikely and easily fixed by modifying the output stage of the code generator.
Security Implications =====================
Minimal. This reduces the attack surface of any Python virtual machine by a small amount.
Reference Implementation ========================
None, as yet. This will be implemented in CPython, once the PEP has been accepted.
Rejected Ideas ==============
None, as yet.
Open Issues ===========
None, as yet.
References ==========
.. [1] The Java Virtual Machine specification
https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf
Copyright =========
This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.
.. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QM4QUJOB... Code of Conduct: http://python.org/psf/codeofconduct/
I am overwhelmed by this thread (and a few other things in real life) but here are some thoughts. 1. It seems the PEP doesn't sufficiently show that there is a problem to be solved. There are claims of inefficiency but these aren't substantiated and I kind of doubt that e.g. representing line numbers in 32 bits rather than 20 bits is a problem. 2. I have handled complaints in the past about existing (accidental) limits that caused problems for generated code. People occasionally generate *really* wacky code (IIRC the most recent case was a team that was generating Python code from machine learning models they had developed using other software) and as long as it works I don't want to limit such applications. 3. Is it easy to work around a limit? Even if it is, it may be a huge pain. I've heard of a limit of 65,000 methods in Java on Android, and my understanding was that it was actually a huge pain for both the toolchain maintainers and app developers (IIRC the toolchain had special tricks to work around it, but those required app developers to change their workflow). Yes, 65,000 is a lot smaller than a million, but in a different context the same concern applies. 4. What does Python currently do if you approach or exceed one of these limits? I tried a simple experiment, eval(str(list(range(2000000)))), and this completes in a few seconds, even though the source code is a single 16 Mbyte-long line. 5. On the other hand, the current parser cannot handle more than 100 nested parentheses, and I've not heard complaints about this. I suspect the number of nested indent levels is similarly constrained by the parser. The default function call recursion limit is set to 1000 and bumping it significantly risks segfaults. So clearly some limits exist and are apparently acceptable. 6. In Linux and other UNIX-y systems, there are many per-process or per-user limits, and they can be tuned -- the user (using sudo) can change many of those limits, the sysadmin can change the defaults within some range, and sometimes the kernel can be recompiled with different absolute limits (not an option for most users or even sysadmins). These limits are also quite varied -- the maximum number of open file descriptors is different than the maximum pipe buffer size. This is of course as it should be -- the limits exist to protect the OS and other users/processes from runaway code and intentional attacks on resources. (And yet, fork bombs exist, and it's easy to fill up a filesystem...) I take from this that limits are useful, may have to be overridable, and should have values that make sense given the resource they guard. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
Hi Guido, On 04/12/2019 3:51 pm, Guido van Rossum wrote:
I am overwhelmed by this thread (and a few other things in real life) but here are some thoughts.
1. It seems the PEP doesn't sufficiently show that there is a problem to be solved. There are claims of inefficiency but these aren't substantiated and I kind of doubt that e.g. representing line numbers in 32 bits rather than 20 bits is a problem.
Fundamentally this is not about the immediate performance gains, but about the potential gains from not having to support huge, vaguely defined limits that are never needed in practice. Regarding line numbers, decoding the line number table for exception tracebacks, profiling and debugging is expensive and the cost is linear in the size of the code object. So, the performance benefit would be largest for the code that is nearest to the limits.
2. I have handled complaints in the past about existing (accidental) limits that caused problems for generated code. People occasionally generate *really* wacky code (IIRC the most recent case was a team that was generating Python code from machine learning models they had developed using other software) and as long as it works I don't want to limit such applications.
The key word here is "occasionally". How much do we want to increase the costs of every Python user for the very rare code generator that might bump into a limit?
3. Is it easy to work around a limit? Even if it is, it may be a huge pain. I've heard of a limit of 65,000 methods in Java on Android, and my understanding was that it was actually a huge pain for both the toolchain maintainers and app developers (IIRC the toolchain had special tricks to work around it, but those required app developers to change their workflow). Yes, 65,000 is a lot smaller than a million, but in a different context the same concern applies.
64k *methods* is much, much less than 1M *classes*. At 6 methods per class, it is 100 times less. The largest Python code bases, that I am aware of, are at JP Morgan, with something like 36M LOC and Bank of America with a similar number. Assuming a code base of 50M loc, *and* that all the code would be loaded into a single application (I sincerely hope that isn't the case) *and* that each class is only 100 lines, even then there would only be 500,000 classes. If a single application has 500k classes, I don't think that a limit of 1M classes would be its biggest problem :)
4. What does Python currently do if you approach or exceed one of these limits? I tried a simple experiment, eval(str(list(range(2000000)))), and this completes in a few seconds, even though the source code is a single 16 Mbyte-long line.
You can have lines as long as you like :)
5. On the other hand, the current parser cannot handle more than 100 nested parentheses, and I've not heard complaints about this. I suspect the number of nested indent levels is similarly constrained by the parser. The default function call recursion limit is set to 1000 and bumping it significantly risks segfaults. So clearly some limits exist and are apparently acceptable.
6. In Linux and other UNIX-y systems, there are many per-process or per-user limits, and they can be tuned -- the user (using sudo) can change many of those limits, the sysadmin can change the defaults within some range, and sometimes the kernel can be recompiled with different absolute limits (not an option for most users or even sysadmins). These limits are also quite varied -- the maximum number of open file descriptors is different than the maximum pipe buffer size. This is of course as it should be -- the limits exist to protect the OS and other users/processes from runaway code and intentional attacks on resources. (And yet, fork bombs exist, and it's easy to fill up a filesystem...) I take from this that limits are useful, may have to be overridable, and should have values that make sense given the resource they guard.
Being able to dynamically *reduce* a limit from one million seems like a good idea.
-- --Guido van Rossum (python.org/~guido <http://python.org/~guido>) /Pronouns: he/him //(why is my pronoun here?)/ <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
Assuming a code base of 50M loc, *and* that all the code would be loaded into a single application (I sincerely hope that isn't the case) *and* that each class is only 100 lines, even then there would only be 500,000 classes. If a single application has 500k classes, I don't think that a limit of 1M classes would be its biggest problem :)
It is more like 1 million calls to `type` adding some linear combination of attributes to a base class. Think of a persistently running server that would create dynamic named tuples lazily. (I am working on code that does that, but with currently 5-6 attributes - that gives me up to 64 classes, but if I had 20 attributes this code would hit that limit - (if one would use the lib in a persistent server, that is :-) ) Anyway, not happening soon - I am just writting to say that one million classes does not mean 1 million hard-codeed 100 LoC classes, rather, it is 1 million calls to "namedtuple". On Thu, 5 Dec 2019 at 11:30, Mark Shannon <mark@hotpy.org> wrote:
Hi Guido,
On 04/12/2019 3:51 pm, Guido van Rossum wrote:
I am overwhelmed by this thread (and a few other things in real life) but here are some thoughts.
1. It seems the PEP doesn't sufficiently show that there is a problem to be solved. There are claims of inefficiency but these aren't substantiated and I kind of doubt that e.g. representing line numbers in 32 bits rather than 20 bits is a problem.
Fundamentally this is not about the immediate performance gains, but about the potential gains from not having to support huge, vaguely defined limits that are never needed in practice.
Regarding line numbers, decoding the line number table for exception tracebacks, profiling and debugging is expensive and the cost is linear in the size of the code object. So, the performance benefit would be largest for the code that is nearest to the limits.
2. I have handled complaints in the past about existing (accidental) limits that caused problems for generated code. People occasionally generate *really* wacky code (IIRC the most recent case was a team that was generating Python code from machine learning models they had developed using other software) and as long as it works I don't want to limit such applications.
The key word here is "occasionally". How much do we want to increase the costs of every Python user for the very rare code generator that might bump into a limit?
3. Is it easy to work around a limit? Even if it is, it may be a huge pain. I've heard of a limit of 65,000 methods in Java on Android, and my understanding was that it was actually a huge pain for both the toolchain maintainers and app developers (IIRC the toolchain had special tricks to work around it, but those required app developers to change their workflow). Yes, 65,000 is a lot smaller than a million, but in a different context the same concern applies.
64k *methods* is much, much less than 1M *classes*. At 6 methods per class, it is 100 times less.
The largest Python code bases, that I am aware of, are at JP Morgan, with something like 36M LOC and Bank of America with a similar number.
Assuming a code base of 50M loc, *and* that all the code would be loaded into a single application (I sincerely hope that isn't the case) *and* that each class is only 100 lines, even then there would only be 500,000 classes. If a single application has 500k classes, I don't think that a limit of 1M classes would be its biggest problem :)
4. What does Python currently do if you approach or exceed one of these limits? I tried a simple experiment, eval(str(list(range(2000000)))), and this completes in a few seconds, even though the source code is a single 16 Mbyte-long line.
You can have lines as long as you like :)
5. On the other hand, the current parser cannot handle more than 100 nested parentheses, and I've not heard complaints about this. I suspect the number of nested indent levels is similarly constrained by the parser. The default function call recursion limit is set to 1000 and bumping it significantly risks segfaults. So clearly some limits exist and are apparently acceptable.
6. In Linux and other UNIX-y systems, there are many per-process or per-user limits, and they can be tuned -- the user (using sudo) can change many of those limits, the sysadmin can change the defaults within some range, and sometimes the kernel can be recompiled with different absolute limits (not an option for most users or even sysadmins). These limits are also quite varied -- the maximum number of open file descriptors is different than the maximum pipe buffer size. This is of course as it should be -- the limits exist to protect the OS and other users/processes from runaway code and intentional attacks on resources. (And yet, fork bombs exist, and it's easy to fill up a filesystem...) I take from this that limits are useful, may have to be overridable, and should have values that make sense given the resource they guard.
Being able to dynamically *reduce* a limit from one million seems like a good idea.
-- --Guido van Rossum (python.org/~guido <http://python.org/~guido>) /Pronouns: he/him //(why is my pronoun here?)/ <
http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z4QO3SJD... Code of Conduct: http://python.org/psf/codeofconduct/
On Tue, Dec 3, 2019 at 6:23 PM Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
My two cents: I find the arguments for security (malicious code) and ease of implementation compelling. I find the point about efficiency, on the other hand, to be almost a red herring in this case. In other words, there are great reasons to consider this regardless of efficiency, and IMO those should guide this. I do find the 1 million limit low, especially for bytecode instructions and lines of code. I think 1 billion / 2^32 / 2^31 (we can choose the bikeshed color later) would be much more reasonable, and the effect on efficiency compared to 2^20 should be negligible. I like the idea of making these configurable, e.g. adding a compilation option to increase the limit to 10^18 / 2^64 / 2^63. Mark, I say go for it, write the draft PEP, and try to get a wider audience to tell whether they know of cases where these limits would have been hit. - Tal Einat
Hi, On 05/12/2019 12:54 pm, Tal Einat wrote:
On Tue, Dec 3, 2019 at 6:23 PM Mark Shannon <mark@hotpy.org <mailto:mark@hotpy.org>> wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
My two cents:
I find the arguments for security (malicious code) and ease of implementation compelling. I find the point about efficiency, on the other hand, to be almost a red herring in this case. In other words, there are great reasons to consider this regardless of efficiency, and IMO those should guide this.
I do find the 1 million limit low, especially for bytecode instructions and lines of code. I think 1 billion / 2^32 / 2^31 (we can choose the bikeshed color later) would be much more reasonable, and the effect on efficiency compared to 2^20 should be negligible.
The effect of changing the bytecode size limit from 2^20 to 2^31 would be significant. Bytecode must incorporate jumps and a lower limit for bytecode size means that a fixed size encoding is possible. A fixed size encoding simplifies and localizes opcode decoding, which impacts the critical path of the interpreter.
I like the idea of making these configurable, e.g. adding a compilation option to increase the limit to 10^18 / 2^64 / 2^63.
While theoretically possible, it would be awkward to implement and very hard to test effectively.
Mark, I say go for it, write the draft PEP, and try to get a wider audience to tell whether they know of cases where these limits would have been hit.
- Tal Einat
Although I am cautiously and tentatively in favour of setting limits if the benefits Mark suggests are correct, I have thought of at least one case where a million classes may not be enough. I've seen people write code like this: for attributes in list_of_attributes: obj = namedtuple("Spam", "fe fi fo fum")(*attributes) values.append(obj) not realising that every obj is a singleton instance of a unique class. They might end up with a million dynamically created classes, each with a single instance, when what they wanted was a single class with a million instances. Could there be people doing this deliberately? If so, it must be nice to have so much RAM that we can afford to waste it so prodigiously: a namedtuple with ten items uses 64 bytes, but the associated class uses 444 bytes, plus the sizes of the methods etc. But I suppose there could be a justification for such a design. (Quoted sizes on my system running 3.5; YMMV.) -- Steven
On Fri, 6 Dec 2019 at 09:33, Steven D'Aprano <steve@pearwood.info> wrote:
Although I am cautiously and tentatively in favour of setting limits if the benefits Mark suggests are correct, I have thought of at least one case where a million classes may not be enough.
I've seen people write code like this:
for attributes in list_of_attributes: obj = namedtuple("Spam", "fe fi fo fum")(*attributes) values.append(obj)
not realising that every obj is a singleton instance of a unique class. They might end up with a million dynamically created classes, each with a single instance, when what they wanted was a single class with a million instances.
But isn't that the point here? A limit would catch this and prompt them to rewrite the code as cls = namedtuple("Spam", "fe fi fo fum") for attributes in list_of_attributes: obj = cls(*attributes) values.append(obj)
Could there be people doing this deliberately? If so, it must be nice to have so much RAM that we can afford to waste it so prodigiously: a namedtuple with ten items uses 64 bytes, but the associated class uses 444 bytes, plus the sizes of the methods etc. But I suppose there could be a justification for such a design.
You're saying that someone might have a justification for deliberately creating a million classes, based on an example that on the face of it is a programmer error (creating multiple classes when a single shared class would be better) and presuming that there *might* be a reason why this isn't an error? Well, yes - but I could just as easily say that someone might have a justification for creating a million classes in one program, and leave it at that. Without knowing (roughly) what the justification is, there's little we can take from this example. Having said that, I don't really have an opinion on this change. Basically, I feel that it's fine, as long as it doesn't break any of my code (which I can't imagine it would) but that's not very helpful! https://xkcd.com/1172/ ("Every change breaks someone's workflow") comes to mind here. Paul
On Fri, Dec 6, 2019 at 12:14 PM Paul Moore <p.f.moore@gmail.com> wrote:
On Fri, 6 Dec 2019 at 09:33, Steven D'Aprano <steve@pearwood.info> wrote:
Although I am cautiously and tentatively in favour of setting limits if the benefits Mark suggests are correct, I have thought of at least one case where a million classes may not be enough.
I've seen people write code like this:
for attributes in list_of_attributes: obj = namedtuple("Spam", "fe fi fo fum")(*attributes) values.append(obj)
not realising that every obj is a singleton instance of a unique class. They might end up with a million dynamically created classes, each with a single instance, when what they wanted was a single class with a million instances.
But isn't that the point here? A limit would catch this and prompt them to rewrite the code as
cls = namedtuple("Spam", "fe fi fo fum") for attributes in list_of_attributes: obj = cls(*attributes) values.append(obj)
This assumes two things: you actually hit the limit as you are testing or developing the code (how likely is that?), and the person hitting the limit has control over the code that does this. If it's in a library where it's usually not a problem, a library you don't directly control but that you're using in a way that triggers the limit -- for example, mixing the library with *other* library code you don't control -- the limit is just a burden.
Could there be people doing this deliberately? If so, it must be nice to have so much RAM that we can afford to waste it so prodigiously: a namedtuple with ten items uses 64 bytes, but the associated class uses 444 bytes, plus the sizes of the methods etc. But I suppose there could be a justification for such a design.
You're saying that someone might have a justification for deliberately creating a million classes, based on an example that on the face of it is a programmer error (creating multiple classes when a single shared class would be better) and presuming that there *might* be a reason why this isn't an error? Well, yes - but I could just as easily say that someone might have a justification for creating a million classes in one program, and leave it at that. Without knowing (roughly) what the justification is, there's little we can take from this example.
Having said that, I don't really have an opinion on this change. Basically, I feel that it's fine, as long as it doesn't break any of my code (which I can't imagine it would) but that's not very helpful! https://xkcd.com/1172/ ("Every change breaks someone's workflow") comes to mind here.
Paul _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YDMTJXAP... Code of Conduct: http://python.org/psf/codeofconduct/
-- Thomas Wouters <thomas@python.org> Hi! I'm an email virus! Think twice before sending your email to help me spread!
On Fri, Dec 06, 2019 at 11:10:49AM +0000, Paul Moore wrote: [...]
They might end up with a million dynamically created classes, each with a single instance, when what they wanted was a single class with a million instances.
But isn't that the point here? A limit would catch this and prompt them to rewrite the code as
cls = namedtuple("Spam", "fe fi fo fum") for attributes in list_of_attributes: obj = cls(*attributes) values.append(obj)
Indeed, and maybe in the long term their code would be better for it, but in the meantime code that was working is now broken. It's a backwards-incompatable change. This PEP isn't about forcing people to write better code against their wishes :-) This leads me to conclude that: (1) Regardless of what we do for the other resources, "number of classes" may have to be excluded from the PEP. (2) Any such limit on classes needs to be bumped up. (3) Or we need a deprecation period before adding a limit: In release 3.x, the interpreter raises a *warning* when the number of classes reaches a million; in release 3.X+2 or X+5 or whenever, that gets changed to an error. It will need a long deprecation period for the reasons Thomas mentions: the person seeing the warnings might not be the developer who can do anything about it. We have to give people plenty of time to see the warnings and hassle the developers into fixing it. For classes, it might be better to the PEP to increase the desired limit from a million to, let's say, 2**32 (4 billion). Most people are going to run into other limits before they hit that: a bare class inheriting from type uses about 500 bytes on a 32-bit system, and twice that on a 64-bit system: py> sys.getsizeof(type('Klass', (), {})) 1056 so a billion classes uses about a terabyte. In comparison, Windows 10 Home only supports 128 GB memory in total, and while Windows Server 2016 supports up to 26 TB, we surely can agree that we aren't required to allow Python scripts to fill the entire RAM with nothing but classes :-) I think Mark may have been too optimistic to hope for a single limit that is suitable for all seven of his listed resources: * The number of source code lines in a module. * The number of bytecode instructions in a code object. * The sum of local variables and stack usage for a code object. * The number of distinct names in a code object. * The number of constants in a code object. * The number of classes in a running interpreter. * The number of live coroutines in a running interpreter. A million seems reasonable for lines of source code, if we're prepared to tell people using machine generated code to split their humongous .py files into multiple scripts. A small imposition on a small subset of Python users, for the benefit of all. I'm okay with that. Likewise, I guess a million is reasonable for the next four resources, but I'm not an expert and my guess is probably worthless :-) A million seems like it might be too low for the number of classes; perhaps 2**32 is acceptible. And others have suggested that a million is too low for coroutines.
Could there be people doing this deliberately? If so, it must be nice to have so much RAM that we can afford to waste it so prodigiously: a namedtuple with ten items uses 64 bytes, but the associated class uses 444 bytes, plus the sizes of the methods etc. But I suppose there could be a justification for such a design.
You're saying that someone might have a justification for deliberately creating a million classes,
Yes. I *personally* cannot think of an example that wouldn't (in my opinion) be better written another way, but I don't think I'm quite knowledgable enough to categorically state that ALL such uses are bogus. -- Steven
On Sat, 7 Dec 2019 at 06:29, Steven D'Aprano <steve@pearwood.info> wrote:
A million seems reasonable for lines of source code, if we're prepared to tell people using machine generated code to split their humongous .py files into multiple scripts. A small imposition on a small subset of Python users, for the benefit of all. I'm okay with that.
I recently hit on a situation that created a one million line code file: https://github.com/pytest-dev/pytest/issues/4406#issuecomment-439629715 The original file (which is included in SymPy) has 3000 lines averaging 500 characters per line so that the total file is 1.5MB. Since it is a test file pytest rewrites the corresponding pyc file and adds extra lines to annotate the intermediate results in the large expressions. The pytest-rewritten code has just over a million lines. When I first tried pytest with this file it lead to a CPython segfault. It seems that the crash in CPython was fixed in 3.7.1 though so subsequent versions can work fine with this (although it is slow). The tests in the file are skipped anyway so I just made sure that the file was blacklisted in SymPy's pytest configuration. -- Oscar
Hi Oscar, Thanks for the feedback. On 07/12/2019 7:37 pm, Oscar Benjamin wrote:
On Sat, 7 Dec 2019 at 06:29, Steven D'Aprano <steve@pearwood.info> wrote:
A million seems reasonable for lines of source code, if we're prepared to tell people using machine generated code to split their humongous .py files into multiple scripts. A small imposition on a small subset of Python users, for the benefit of all. I'm okay with that.
I recently hit on a situation that created a one million line code file: https://github.com/pytest-dev/pytest/issues/4406#issuecomment-439629715
The original file (which is included in SymPy) has 3000 lines averaging 500 characters per line so that the total file is 1.5MB. Since it is a test file pytest rewrites the corresponding pyc file and adds extra lines to annotate the intermediate results in the large expressions. The pytest-rewritten code has just over a million lines.
There are two possible solutions here (in the context of PEP 611) 1. Split the original SymPy test file into two or more files and the test function into many smaller functions. 2. Up the line limit to two million and the bytecode limit to many million. Note that changing pytest to output fewer lines won't work as we will just hit the bytecode limit instead. 1. Is this difficult? I wouldn't expect it to be so. 2. Simple, but with a performance impact. The simplest solution appears to be to just up the limits, but the problem with that is any costs we incur are incurred by all Python programs forever. Fixing the test is a one off cost.
When I first tried pytest with this file it lead to a CPython segfault. It seems that the crash in CPython was fixed in 3.7.1 though so subsequent versions can work fine with this (although it is slow).
The tests in the file are skipped anyway so I just made sure that the file was blacklisted in SymPy's pytest configuration.
-- Oscar _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IUG6EBER... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, 9 Dec 2019 at 14:10, Mark Shannon <mark@hotpy.org> wrote:
On 07/12/2019 7:37 pm, Oscar Benjamin wrote:
On Sat, 7 Dec 2019 at 06:29, Steven D'Aprano <steve@pearwood.info> wrote:
A million seems reasonable for lines of source code, if we're prepared to tell people using machine generated code to split their humongous .py files into multiple scripts. A small imposition on a small subset of Python users, for the benefit of all. I'm okay with that.
I recently hit on a situation that created a one million line code file: https://github.com/pytest-dev/pytest/issues/4406#issuecomment-439629715
The original file (which is included in SymPy) has 3000 lines averaging 500 characters per line so that the total file is 1.5MB. Since it is a test file pytest rewrites the corresponding pyc file and adds extra lines to annotate the intermediate results in the large expressions. The pytest-rewritten code has just over a million lines.
There are two possible solutions here (in the context of PEP 611)
1. Split the original SymPy test file into two or more files and the test function into many smaller functions.
In this particular situation I think that it isn't necessary for the file to be an imported .py file. It could be a newline delimited text file that is read by the test suite rather than imported. However if we are using this to consider the PEP then note that the file has orders of magnitude fewer lines than the one million limit proposed.
2. Up the line limit to two million and the bytecode limit to many million.
It sounds like a bytecode limit of one million is a lot more restrictive than a one million limit on lines.
Note that changing pytest to output fewer lines won't work as we will just hit the bytecode limit instead.
I'm not sure. I think that pytest should have some kind of limit on what it produces in this situation. The rewriting is just an optimistic attempt to produce more detailed information in the test failure traceback. There's no reason it can't just be disabled if it happens to produce overly long output. I think that was briefly discussed that point in the pytest issue but there isn't a clear answer for how to define the limits. With the PEP it could have been a little clearer e.g. something like "definitely don't produce more than a million lines". In that sense these limits can be useful for people doing code generation. -- Oscar
On Tue, 10 Dec 2019 00:59:09 +0000 Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
Note that changing pytest to output fewer lines won't work as we will just hit the bytecode limit instead.
I'm not sure. I think that pytest should have some kind of limit on what it produces in this situation.
Agreed. I think pytest just needs to be fixed not to stupid thing. Regards Antoine.
On Sat, Dec 07, 2019 at 07:37:58PM +0000, Oscar Benjamin wrote:
I recently hit on a situation that created a one million line code file: https://github.com/pytest-dev/pytest/issues/4406#issuecomment-439629715
The original file (which is included in SymPy) has 3000 lines averaging 500 characters per line so that the total file is 1.5MB. Since it is a test file pytest rewrites the corresponding pyc file and adds extra lines to annotate the intermediate results in the large expressions. The pytest-rewritten code has just over a million lines.
If I'm reading you correctly, you're saying that, on average, pytest annotates each line of source code with over 300 additional lines of code.
When I first tried pytest with this file it lead to a CPython segfault. It seems that the crash in CPython was fixed in 3.7.1 though so subsequent versions can work fine with this (although it is slow).
Thanks, this is a good practical anecdote of real-life experience. -- Steven
On Tue, 10 Dec 2019 at 00:00, Steven D'Aprano <steve@pearwood.info> wrote:
On Sat, Dec 07, 2019 at 07:37:58PM +0000, Oscar Benjamin wrote:
I recently hit on a situation that created a one million line code file: https://github.com/pytest-dev/pytest/issues/4406#issuecomment-439629715
The original file (which is included in SymPy) has 3000 lines averaging 500 characters per line so that the total file is 1.5MB. Since it is a test file pytest rewrites the corresponding pyc file and adds extra lines to annotate the intermediate results in the large expressions. The pytest-rewritten code has just over a million lines.
If I'm reading you correctly, you're saying that, on average, pytest annotates each line of source code with over 300 additional lines of code.
To be clear that's what happens with this particular file but is not otherwise typical of pytest. The idea is to rewrite something like assert f(x+y) == z as tmp1 = x+y tmp2 = f(tmp1) tmp3 = z it tmp1 != tmp2: # Print information showing the intermediate expressions tmp1, tmp2, tmp3 This rewriting is normally useful and harmless but it explodes when used with complicated mathematical expressions like this: https://github.com/sympy/sympy/blob/d670689ae212c4f0ad4549eda17a111404694a27... -- Oscar
One aspect of scripting is being able to throw something together to create a correct solution to an immediate problem. If the proprietary software that you script around takes over 300 Gigs to lay out a CPU and delays are hugely expensive, then I don't want to waste time on optimisations to get around arbitrary limits. It sounds analogous to wanting to move to X bit integers to save a little here and there. If you are thinking of making limits then you might think of what limit say, 4 terabytes of Ram would impose on the smallest object and propose that. Remember, 2020 means trials of 2 chip, 256 thread AMD servers with 500Gigs of ram. Now if I can get Python sub interpreters working on that! Bliss. 😊 On Fri, Dec 6, 2019, 9:37 AM Steven D'Aprano <steve@pearwood.info> wrote:
Although I am cautiously and tentatively in favour of setting limits if the benefits Mark suggests are correct, I have thought of at least one case where a million classes may not be enough.
I've seen people write code like this:
for attributes in list_of_attributes: obj = namedtuple("Spam", "fe fi fo fum")(*attributes) values.append(obj)
not realising that every obj is a singleton instance of a unique class. They might end up with a million dynamically created classes, each with a single instance, when what they wanted was a single class with a million instances.
Could there be people doing this deliberately? If so, it must be nice to have so much RAM that we can afford to waste it so prodigiously: a namedtuple with ten items uses 64 bytes, but the associated class uses 444 bytes, plus the sizes of the methods etc. But I suppose there could be a justification for such a design.
(Quoted sizes on my system running 3.5; YMMV.)
-- Steven _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VIK7QKOR... Code of Conduct: http://python.org/psf/codeofconduct/
On 03/12/2019 17:15, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Shortened the mail - as I want my comment to be short. There are many longish ones, and have not gotten through them all. One guiding principle I learned from a professor (forgot his name sadly). A program has exactly - zero (0) of something, one (1) of something, or infinite. The moment it gets set to X, the case for X+1 appears. Since we are not talking about zero, or one - I guess my comment is make sure it can be used to infinity. Regards, Michael p.s. If this has already been suggested - my apologies for any noise.
* The number of live coroutines in a running interpreter: Implicitly limited by operating system limits until at least 3.11. DOes the O.S. limit anything on a coroutine? What for? As far as I know it is a minimal Python-only object, unless you have each co-routine holding a reference to a TCP socket - but that has nothing to do with Python's limits itself: a couroutine by itself is a small Python object with no external resources referenced - unlike a thread, and code with tens of thousands of coroutines can run perfectly without a single glitch. Of all limits mentioned in the PEP, this is the one I find no reason to exist, and that could eventually lead to needles breaking of otherwise perfectly harmless code. (The limit on the number of classes also is strange for me, as I've written in other mails) On Fri, 6 Dec 2019 at 13:39, Michael <aixtools@felt.demon.nl> wrote:
On 03/12/2019 17:15, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark.
Shortened the mail - as I want my comment to be short. There are many longish ones, and have not gotten through them all.
One guiding principle I learned from a professor (forgot his name sadly).
A program has exactly - zero (0) of something, one (1) of something, or infinite. The moment it gets set to X, the case for X+1 appears.
Since we are not talking about zero, or one - I guess my comment is make sure it can be used to infinity.
Regards,
Michael
p.s. If this has already been suggested - my apologies for any noise.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IDACDRYR... Code of Conduct: http://python.org/psf/codeofconduct/
On Sat, 7 Dec 2019 at 02:42, Michael <aixtools@felt.demon.nl> wrote:
A program has exactly - zero (0) of something, one (1) of something, or infinite. The moment it gets set to X, the case for X+1 appears.
Since we are not talking about zero, or one - I guess my comment is make sure it can be used to infinity.
I suspect the professor saying this hadn't worked on any industrial systems were it was critically important to degrade gracefully under load, or done much in the way of user experience design (which is often as much about managing the way things fail to help guide users back towards the successful path as it is about managing how the system behaves when things go well). One of the first systems I ever designed involved allocating small modular audio processing components across a few dozen different digital signal processors. I designed that system so that the only limits on each DSP was the total amount of available memory, and the number of audio inputs and outputs. Unfortunately, this turned out to be a mistake, as it made it next to impossible to design a smart scheduling engine, since we didn't have enough metadata about how much memory each component would need, nor enough live information about how much memory fragmentation each DSP was experiencing. So the management server resorted to a lot of "just try it and see if it works" logic, which made the worst case behaviour of the system when under significant load incredibly hard to predict. CPython's own recursion limit is a similar case - there's an absolute limit imposed by the C stack, where if we go over it, we'll get an unrecoverable failure (a segfault/memory access violation). So instead of doing that, we impose our own arbitrarily lower limit where we throw a *recoverable* error, before we hit the unrecoverable one. So I'm broadly in favour of the general principle of the PEP. However, I also agree with the folks suggesting that the "10e6 for all the limits" approach may be *too* simplified. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 12/6/19 11:08 AM, Michael wrote:
On 03/12/2019 17:15, Mark Shannon wrote:
Hi Everyone,
I am proposing a new PEP, still in draft form, to impose a limit of one million on various aspects of Python programs, such as the lines of code per module.
Any thoughts or feedback?
The PEP: https://github.com/markshannon/peps/blob/one-million/pep-1000000.rst
Cheers, Mark. Shortened the mail - as I want my comment to be short. There are many longish ones, and have not gotten through them all.
One guiding principle I learned from a professor (forgot his name sadly).
A program has exactly - zero (0) of something, one (1) of something, or infinite. The moment it gets set to X, the case for X+1 appears.
Since we are not talking about zero, or one - I guess my comment is make sure it can be used to infinity.
Regards,
Michael
p.s. If this has already been suggested - my apologies for any noise.
The version of this philosophy that I have heard is normally: Zero, One, Many, or sometimes Zero, One, Two, Many, and occasionally Zero, One, Two, Three, Many. The Idea is that the handling of Zero of something is obviously a different case from having some of it. Having Just One of something often can be treated differently than multiple of it, and sometimes it makes sense to only allow 1 of the thing. Sometimes, having just Two of the things allows for some useful extra interactions, and can be simpler than an arbritary number, so sometimes you can allow just two, but not many. Similarly, there are some more rare cases where maybe allowing just 3 and not more can make sense. In general, for larger values, if you allow M, then there isn't a good reason to not allow M+1 (until you hit practical resource limits). I wouldn't extend that to 'infinity' as there is a big catagorical difference between an arbitrary 'many' and 'infinite', as computers being finite machines CAN'T actually have infinite of something without special casing it. (and if you special case infinite, you might not make the effort handle large values of many). -- Richard Damon
There is value in saying "These are things that might be limited by the implementation." There is great value in documenting the limits that CPython in particular currently chooses to enforce. Users may want to see the numbers, and other implementations may wish to match or exceed these minimums as part of their compatibility efforts. This is particularly true if it effects bytecode validity, since other implementations often try to support bytecode as well as source code There is value is saying "A conforming implementation will support at least X", but X should be much smaller -- I don't want to declare micropython non-conformant just because it set limits more reasonable for its use case. I don't know that there is enough value in using a human memorable number (like a million), or in using the same limit across resources. For example, if the number of local variables, distinct names, and constants may be limited to 1,000,000 total instead of 1,000,000 each, I think that should be a quality of implementation issue instead of a language change. There may well be value in changing the limits supported by CPython (or at least CPython in default mode), or its bytecode format, but those should be phrased as clearly a CPython implementation PEP (or bytecode PEP) rather than a language change PEP.
participants (25)
-
Abdur-Rahmaan Janhangeer
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
David Cuthbert
-
Gregory P. Smith
-
Guido van Rossum
-
Jim J. Jewett
-
Joao S. O. Bueno
-
Karthikeyan
-
Kyle Stanley
-
Mark Shannon
-
Michael
-
Nathaniel Smith
-
Nick Coghlan
-
Oscar Benjamin
-
Paddy McCarthy
-
Paul Moore
-
Random832
-
Rhodri James
-
Richard Damon
-
Steve Dower
-
Steven D'Aprano
-
Tal Einat
-
Thomas Wouters