PEP 611: The one million limit.
Hi Everyone, Thanks for all your feedback on my proposed PEP. I've editing the PEP in light of all your comments and it is now hopefully more precise and with better justification. https://github.com/python/peps/pull/1249 Cheers, Mark.
Apologies again for commenting in the wrong place. On 05/12/2019 16:38, Mark Shannon wrote:
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and> reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking).
I don't think this assertion holds much water: 1. Caching make memory access much less of a limit than you would expect. 2. Non-aligned memory access vary from inefficient to impossible depending on the processor. 3. Shifting and masking isn't free, and again on some processors can be very expensive. Mark wrote:
here is also the potential for a more efficient instruction format, speeding up interpreter dispatch. I replied: This is the ARM/IBM mistake all over again. Mark challenged: Could you elaborate? Please bear in mind that this is software dispatching and decoding, not hardware.
Hardware generally has a better excuse for instruction formats, because for example you know that an ARM only has sixteen registers, so you only need four bits for any register operand in an instruction. Except that when they realised that they needed the extra address bits in the PC after all, they had to invent a seventeenth register to hold the status bits, and had to pretend it was a co-processor to get opcodes to access it. Decades later, status manipulation on modern ARMs is, in consequence, neither efficient nor pretty. You've talked some about not making the 640k mistake (and all the others we could and have pointed to) and that one million is a ridiculous limit. You don't seem to have taken on board that when those limits were set, they *were* ridiculous. I remember when we couldn't source 20Mb hard discs any more, and were worried that 40Mb was far too much... to share between twelve computers. More recently were the serious discussions of how to manage transferring terabyte-sized datasets (by van, it turned out). Sizes in computing projects have a habit of going up by orders of magnitude. Text files were just a few kilobytes, so why worry about only using sixteen bit sizes? Then flabby word processors turned that into megabytes, audio put another order of magnitude or two on that, video is up in the multiple gigabytes, and the amount of data involved in the Human Genome Project is utterly ludicrous. Have we hit the limit with Big Data? I'm not brave enough to say that, and when you start looking at the numbers involved, one million anythings doesn't look so ridiculous at all. -- Rhodri James *-* Kynesim Ltd
On 7/12/19 2:54 am, Rhodri James wrote:
You've talked some about not making the 640k mistake
I think it's a bit unfair to call it a "mistake". They had a 1MB address space limit to work with, and that was a reasonable place to put the division between RAM and display/IO space. If anything is to be criticised, it's Intel's decision to only add 4 more address bits when going from an 8-bit to a 16-bit architecture. -- Greg
On Sat, Dec 7, 2019 at 9:58 AM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
On 7/12/19 2:54 am, Rhodri James wrote:
You've talked some about not making the 640k mistake
I think it's a bit unfair to call it a "mistake". They had a 1MB address space limit to work with, and that was a reasonable place to put the division between RAM and display/IO space. If anything is to be criticised, it's Intel's decision to only add 4 more address bits when going from an 8-bit to a 16-bit architecture.
And to construct a bizarre segmented system that means that 16 + 16 = 20, thus making it very hard to improve on it later. If it hadn't been for the overlapping segment idea, it would have been easy to go to 24 address lines later, and eventually 32. But since the 16:16 segmented system was built the way it was, every CPU afterwards had to remain compatible with it. Do you know when support for the A20 gate was finally dropped? 2013. Yes. THIS DECADE. If they'd decided to go for 32-bit addressing (even with 20 address lines), it would have been far easier to improve on it later. I'm sure there were good reasons for what they did (and hey, it did mean TSRs could be fairly granular in their memory requirements), but it's still a lesson to be learned from in not unnecessarily restrict something that follows Moore's Law. ChrisA
I'd prefer it if we stayed on topic here... On Fri, Dec 6, 2019 at 3:15 PM Chris Angelico <rosuav@gmail.com> wrote:
On Sat, Dec 7, 2019 at 9:58 AM Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
On 7/12/19 2:54 am, Rhodri James wrote:
You've talked some about not making the 640k mistake
I think it's a bit unfair to call it a "mistake". They had a 1MB address space limit to work with, and that was a reasonable place to put the division between RAM and display/IO space. If anything is to be criticised, it's Intel's decision to only add 4 more address bits when going from an 8-bit to a 16-bit architecture.
And to construct a bizarre segmented system that means that 16 + 16 = 20, thus making it very hard to improve on it later. If it hadn't been for the overlapping segment idea, it would have been easy to go to 24 address lines later, and eventually 32. But since the 16:16 segmented system was built the way it was, every CPU afterwards had to remain compatible with it.
Do you know when support for the A20 gate was finally dropped? 2013. Yes. THIS DECADE. If they'd decided to go for 32-bit addressing (even with 20 address lines), it would have been far easier to improve on it later.
I'm sure there were good reasons for what they did (and hey, it did mean TSRs could be fairly granular in their memory requirements), but it's still a lesson to be learned from in not unnecessarily restrict something that follows Moore's Law.
ChrisA _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SW5HDQ27... Code of Conduct: http://python.org/psf/codeofconduct/
On Fri, 6 Dec 2019 13:54:13 +0000 Rhodri James <rhodri@kynesim.co.uk> wrote:
Apologies again for commenting in the wrong place.
On 05/12/2019 16:38, Mark Shannon wrote:
Memory access is usually a limiting factor in the performance of modern CPUs. Better packing of data structures enhances locality and> reduces memory bandwith, at a modest increase in ALU usage (for shifting and masking).
I don't think this assertion holds much water:
1. Caching make memory access much less of a limit than you would expect. 2. Non-aligned memory access vary from inefficient to impossible depending on the processor. 3. Shifting and masking isn't free, and again on some processors can be very expensive.
I think your knowledge is outdated. Shifts and masks are extremely fast on modern CPUs, and unaligned loads are fast as well (when served from the CPU cache). Moreover, modern CPUs are superscalar with many different execution units, so those instructions can be executed in parallel with other independent instructions. However, as soon as you load from main memory because of a cache miss, you take a hit of several hundreds cycles. Basically, computations are almost free compared to the cost of memory accesses. In any case, this will have to be judged on benchmark numbers, once Mark (or someone else) massages the interpreter to experiment with those runtime memory footprint reductions. Regards Antoine.
On 11/12/2019 21:35, Antoine Pitrou wrote:
In any case, this will have to be judged on benchmark numbers, once Mark (or someone else) massages the interpreter to experiment with those runtime memory footprint reductions.
This I absolutely agree with. Without evidence we're just waving our prejudices and varied experiences at one another (for example, my experience that the "modern processors" you were talking about have massive caches, so memory access isn't as much of an issue as you might think vs your experience that memory access matters more than doing a shift-and-mask all the time). I've seen no hard evidence of any actual improvement of any size, and without that there really isn't a decision to be made. -- Rhodri James *-* Kynesim Ltd
On Thu, 12 Dec 2019 11:43:58 +0000 Rhodri James <rhodri@kynesim.co.uk> wrote:
On 11/12/2019 21:35, Antoine Pitrou wrote:
In any case, this will have to be judged on benchmark numbers, once Mark (or someone else) massages the interpreter to experiment with those runtime memory footprint reductions.
This I absolutely agree with. Without evidence we're just waving our prejudices and varied experiences at one another (for example, my experience that the "modern processors" you were talking about have massive caches, so memory access isn't as much of an issue as you might think vs your experience that memory access matters more than doing a shift-and-mask all the time).
But massive caches are not that fast. L1 cache is typically very fast (3 or 4 cycles latency) but small (on the order of 64 kiB). L2 cache varies, but is generally significantly slower (typically 15 cycles latency) and medium sized (256 or 512 kiB perhaps). L3 cache is often massive (16 MiB is not uncommon) but actually quite slow (several dozens of cycles), by virtue of being large, further away and generally shared between all cores. If you have a 4-way superscalar CPU (which is an approximate description, since the level of allowed parallelism is not the same in all pipeline stages and depends on the instruction mix), during a single 15-cycle L2 cache access your CPU can issue at most 60 instructions. And during a L3 cache access there's an even larger amount of instructions that can be scheduled and executed. Regards Antoine.
I am not qualified to comment on much of this, but one simple one: 1 million is a nice round easy to remember number. But you can fit 2 million into 21 bits, and still fit three into 64 bits, so why not? Ialso noticed this:
Reference Implementation ======================== None, as yet. This will be implemented in CPython, once the PEP has been accepted.
As already discussed, we really can't know the benefits without some benchmarking, so I don't expect the PEP will be accepted without a (at least partial) reference implementation. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, 2019-12-05 at 16:38 +0000, Mark Shannon wrote:
Hi Everyone,
Thanks for all your feedback on my proposed PEP. I've editing the PEP in light of all your comments and it is now hopefully more precise and with better justification.
Other program languages have limits in their standards. For example: Values for #line in the C preprocessor: "If lineno is 0 or greater than 32767 (until C99) 2147483647 (since C99), the behavior is undefined." https://en.cppreference.com/w/c/preprocessor/line Similar for C++'s preprocessor (but for C++11) https://en.cppreference.com/w/cpp/preprocessor/line (These days I maintain GCC's location-tracking code, and we have a number of implementation-specific limits and heuristics for packing file/line/column data into a 32-bit type; see https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=libcpp/include/line-map.h and in particular LINE_MAP_MAX_LOCATION_WITH_COLS, LINE_MAP_MAX_LOCATION, LINE_MAP_MAX_COLUMN_NUMBER, etc) Hope this is constructive Dave
As PEP 611 reads to me, there is a lack of clarity as to whether you are proposing a Python-the-language limit or a CPython-the-implementation limit. I think your intent is the latter, but if so please be very clear about that in the abstract, title, and motivation. The Other implementations section could be clearer as well. It may not even be possible to impose some of those limits in current or future alternative implementations, so if you are proposing limits baked into the language specification (specifically, the Python Language Reference, i.e. https://docs.python.org/3/reference/index.html), then the PEP needs to state that, and feedback from other implementation developers should be requested. Cheers, -Barry
On Dec 5, 2019, at 08:38, Mark Shannon <mark@hotpy.org> wrote:
Hi Everyone,
Thanks for all your feedback on my proposed PEP. I've editing the PEP in light of all your comments and it is now hopefully more precise and with better justification.
https://github.com/python/peps/pull/1249
Cheers, Mark. _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KHCXDKDG... Code of Conduct: http://python.org/psf/codeofconduct/
participants (10)
-
Antoine Pitrou
-
Barry Warsaw
-
Chris Angelico
-
Chris Barker
-
David Malcolm
-
Ethan Furman
-
Greg Ewing
-
Gregory P. Smith
-
Mark Shannon
-
Rhodri James