Looking for people interested in a Python register virtual machine project

Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine. Though I kept getting things back in sync, I realize this is not a one-person project, at least not this one person. There are several huge chunks of Python I've ignored over the past 20 years, and not just the internals. (I've never used async anything, for example.) If it is ever to truly be a viable demonstration of the concept, I will need help. I forked the CPython repo and have a branch (register2) of said fork which is currently synced up with the 3.10 (currently master) branch: https://github.com/smontanaro/cpython/tree/register2 I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far: https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP". Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome. Skip Montanaro

The Parrot project was also intended to be the same thing, and for a while had a fair number of contributors. Unfortunately, it never obtained the performance wins that were good for. On Sat, Mar 20, 2021, 11:55 AM Skip Montanaro <skip.montanaro@gmail.com> wrote:
Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine. Though I kept getting things back in sync, I realize this is not a one-person project, at least not this one person. There are several huge chunks of Python I've ignored over the past 20 years, and not just the internals. (I've never used async anything, for example.) If it is ever to truly be a viable demonstration of the concept, I will need help. I forked the CPython repo and have a branch (register2) of said fork which is currently synced up with the 3.10 (currently master) branch:
https://github.com/smontanaro/cpython/tree/register2
I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far:
https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst
If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP". Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome.
Skip Montanaro _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IUKZPH... Code of Conduct: http://python.org/psf/codeofconduct/

Yes, I remember Parrot. As I understand it their original goal was a language-agnostic virtual machine, which might have complicated things. I will do a bit of reading and add some text to the "PEP." Skip On Sat, Mar 20, 2021, 11:36 AM David Mertz <mertz@gnosis.cx> wrote:
The Parrot project was also intended to be the same thing, and for a while had a fair number of contributors. Unfortunately, it never obtained the performance wins that were good for.
On Sat, Mar 20, 2021, 11:55 AM Skip Montanaro <skip.montanaro@gmail.com> wrote:
Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine. Though I kept getting things back in sync, I realize this is not a one-person project, at least not this one person. There are several huge chunks of Python I've ignored over the past 20 years, and not just the internals. (I've never used async anything, for example.) If it is ever to truly be a viable demonstration of the concept, I will need help. I forked the CPython repo and have a branch (register2) of said fork which is currently synced up with the 3.10 (currently master) branch:
https://github.com/smontanaro/cpython/tree/register2
I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far:
https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst
If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP". Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome.
Skip Montanaro _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IUKZPH... Code of Conduct: http://python.org/psf/codeofconduct/

It was (is). It was a VM idea. Taken from a 2001 April Fool's Day joke about Python and Perl merging. The goal of optimizing a register based VM independently of the grammars compiled to it seems smart. For a certain time our wonderful Alison Randall was even lead of it. The Python grammar must be several versions old, and not reflect the new parser. On the other hand, if it parses Perl, it must be a very flexible grammar spec. The fact the register based machine didn't wind up showing any big win was disappointing to me. I don't know any deep details, and am certainly not saying it might not every help. But benchmarks when I followed them were pretty much the same as Python, Perl, and Ruby VMs that were stack based. Yes, this or that micro-benchmark did better or worse, but nothing dramatic overall. On Sat, Mar 20, 2021, 12:57 PM Skip Montanaro <skip.montanaro@gmail.com> wrote:
Yes, I remember Parrot. As I understand it their original goal was a language-agnostic virtual machine, which might have complicated things.
I will do a bit of reading and add some text to the "PEP."
Skip
On Sat, Mar 20, 2021, 11:36 AM David Mertz <mertz@gnosis.cx> wrote:
The Parrot project was also intended to be the same thing, and for a while had a fair number of contributors. Unfortunately, it never obtained the performance wins that were good for.
On Sat, Mar 20, 2021, 11:55 AM Skip Montanaro <skip.montanaro@gmail.com> wrote:
Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine. Though I kept getting things back in sync, I realize this is not a one-person project, at least not this one person. There are several huge chunks of Python I've ignored over the past 20 years, and not just the internals. (I've never used async anything, for example.) If it is ever to truly be a viable demonstration of the concept, I will need help. I forked the CPython repo and have a branch (register2) of said fork which is currently synced up with the 3.10 (currently master) branch:
https://github.com/smontanaro/cpython/tree/register2
I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far:
https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst
If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP". Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome.
Skip Montanaro _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IUKZPH... Code of Conduct: http://python.org/psf/codeofconduct/

Hello, On Sat, 20 Mar 2021 10:54:10 -0500 Skip Montanaro <skip.montanaro@gmail.com> wrote:
Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine.
I guess it should be a good idea to answer what's the scope of this project - is it research one or "production" one? If it's research one, why be concerned with the churn of over-modern CPython versions? Wouldn't it be better to just use some scalable, incremental implementation which would allow to forward-port it to a newer version, if it ever comes to that? Otherwise, if it's "production", who's the "customer" and how they "compensate" you for doing work (chasing the moving target) which is clearly of little interest to you and conflicts with the goal of the project? []
I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far:
https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst
If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP".
Some comments on it: 1. I find it to be rather weak on motivational part. It's starts with a phrase like:
This PEP proposes the addition of register-based instructions to the existing Python virtual machine, with the intent that they eventually replace the existing stack-based opcodes.
Sorry, what? The purpose of register-based instructions is to just replace stack-based instructions? That's not what's I'd like to hear as the intro phrase. You probably want to replace one with the other because register-based ones offer some benefit, faster execution perhaps? That's what I'd like to hear instead of "deciphering" that between the lines.
They [2 instruction sets] are almost completely distinct.
That doesn't correspond to the mental image I would have. In my list, the 2 sets would be exactly the same, except that stack-based encode argument locations implicitly, while register-based - explicitly. Would be interesting to read (in the following "pep" sections) what makes them "almost completely distinct".
Within a single function only one set of opcodes or the other will be used at any one time.
That would be the opposite of "scalable, incremental" development approach mentioned above. Why not allow 2 sets to freely co-exist, and migrate codegeneration/implement code translation gradually?
## Motivation
I'm not sure the content of the section corresponds much to its title. It jumps from background survey of the different Python VM optimizations to (some) implementation details of register VM - leaving "motivation" somewhere "between the lines".
Despite all that effort, opcodes which do nothing more than move data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still account for nearly half of all opcodes executed.
... And - you intend to change that with a register VM? In which way and how? As an example, LOAD_GLOBAL isn't going anywhere - it loads a variable by *symbolic* name into a register.
Running Pyperformance using a development version of Python 3.9 showed that the five most frequently executed pure stack opcodes (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted for 35% of all executed instructions.
And you intend to change that with a register VM? How? Quick google search leads to https://www.strchr.com/x86_machine_code_statistics (yeah, that's not VM, it's RM (real machine), stats over different VMs would be definitely welcome):
The most popular instruction is MOV (35% of all instructions).
So, is the plan to replace 35% of "five most frequently executed pure stack opcodes" with 35% of register-register move instructions? If not, why it would be different and how would you achieve that?
They are low-cost instructions (compared with CALL_FUNCTION for example), but still eat up time and space in the virtual machine
But that's the problem of any VM - it's slow by definition. There can be less slow and more slow VMs, but VMs can't be fast. So, what's the top-level motivation - is it "making CPython fast" or "making CPython a little bit less slow"? By how much?
Consider the layout of the data section of a Frame object: All those LOAD_FAST and STORE_FAST instructions just copy pointers between chunks of RAM which are just a few bytes away from each other in memory.
Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy pointers (with extra dereferencing, but that's a detail). It's unclear why you try to ignore them ("cell" registers), putting ahead "locals" and "stack" registers. The actual register instructions implementation would just treat any frame slot as a register with continuous numbering, allowing to access all of locals, cells, and stack locs in the same way. In that regard, trying to rearrange 3 groups at this stage seems like rather unneeded implementation complexity with no clear motivation.
Instead, registers should be cleared upon last reference.
Worth discussing how to handle that. Apparently, only a way with explicit DECREF instruction would scale, but that shows that a register-based VM not only decreases # of generated instructions, but also increases it it other areas. The overall tally may be not what's expected.
Implemented ... some CALL_FUNCTION instructions
One of the significant omissions in the "pep" is the lack of discussion of "register based" calling calling convention.
most container-related BUILD instructions
Other "arbitrary number of arguments" instructions beyond CALL_FUNCTION* are next hard case which is worth discussing.
OTOH, maybe RVM opcode names should look more like traditional assembler instructions. (The author is getting on in years and finds something which looks more like assembler attractive, given his initial experience programming computers in the dark ages.) Instead of BINARY_ADD_REG, you might call it BAR.
IMHO that's as retrograde as it can get. I'd suggest to re-evaluate reasons why "traditional assemblers" were made as they were. The reasons might be: a) desire of one vendor to not fall down to "intellectual property" claims of another vendor; b) minor to the previous, but the desire for vendor-lock users. That are the reasons why vendors went out of their way to obfuscate their instruction names and make as much variability as possible for simple things like "move" or "add". Most modern platform-independent assemblers (IRs, though really ILs) follow the syntax of (mostly) normal programming languages (of course just with the flat, instead of structured, syntax). E.g. LLVM IR or Mypyc IR: https://github.com/python/mypy/blob/master/mypyc/test-data/irbuild-basic.tes... Back to the actual topic, I'd guess just suffixing existing instruction names with "_R" should be enough, e.g. BINARY_ADD -> BINARY_ADD_R. And of course, having LHS first. E.g. "BINARY_ADD_R a, b, c" means "a = b + c".
Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome.
Skip Montanaro
[] -- Best regards, Paul mailto:pmiscml@gmail.com

Thanks for the response. I will try to address your comments inline.
I guess it should be a good idea to answer what's the scope of this project - is it research one or "production" one? If it's research one, why be concerned with the churn of over-modern CPython versions? Wouldn't it be better to just use some scalable, incremental implementation which would allow to forward-port it to a newer version, if it ever comes to that?
The motivation for revisiting this idea was/is largely personal. As I indicated, I first messed around with it over 20 years ago and it's been in the back of my mind ever since. Somehow I never lost the code despite I'm not sure how many computers came and went and that the code was never uploaded to any sort of distributed version control system. I decided to pick things up again as a way to mostly keep my head in the game after I retired. So, neither "research" nor "production" seems to be a correct descriptor. Still, if taken to functional completion — functional enough for performance testing and application to more than just toy scripts — I realized pretty quickly that I'd need help.
Otherwise, if it's "production", who's the "customer" and how they "compensate" you for doing work (chasing the moving target) which is clearly of little interest to you and conflicts with the goal of the project?
Nobody is compensating me. I have no desire to try and turn it into something I do for hire. Maybe I misunderstood your question?
This PEP proposes the addition of register-based instructions to the existing Python virtual machine, with the intent that they eventually replace the existing stack-based opcodes.
Sorry, what? The purpose of register-based instructions is to just replace stack-based instructions? That's not what's I'd like to hear as the intro phrase. You probably want to replace one with the other because register-based ones offer some benefit, faster execution perhaps? That's what I'd like to hear instead of "deciphering" that between the lines.
Replacing stack-based instructions would be a reasonable initial goal, I think. Victor reported performance improvements in his implementation (also a translator). As I indicated in the "PEP" (I use that term rather loosely, as I have no plans at the moment to submit it for consideration, certainly not in its current, incomplete state), a better ultimate way to go would be to generate register instructions directly from the AST. The current translation scheme allows me to write simple test case functions, generate register instructions, then compare that when called the two produce the same result.
They [2 instruction sets] are almost completely distinct.
That doesn't correspond to the mental image I would have. In my list, the 2 sets would be exactly the same, except that stack-based encode argument locations implicitly, while register-based - explicitly. Would be interesting to read (in the following "pep" sections) what makes them "almost completely distinct".
Well, sure. The main difference is the way two pairs of instructions (say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their result. You still have to be able to add two objects, call functions, etc.
Within a single function only one set of opcodes or the other will be used at any one time.
That would be the opposite of "scalable, incremental" development approach mentioned above. Why not allow 2 sets to freely co-exist, and migrate codegeneration/implement code translation gradually?
The fact that I treat the current frame's stack space as registers makes it pretty much impossible to execute both stack and register instructions within the same frame. Victor's implementation did things differently in this regard. I believe he just allocated extra space for 256 registers at the end of each frame, so (in theory, I suppose), you could have instructions from both executed in the same frame.
## Motivation
I'm not sure the content of the section corresponds much to its title. It jumps from background survey of the different Python VM optimizations to (some) implementation details of register VM - leaving "motivation" somewhere "between the lines".
Despite all that effort, opcodes which do nothing more than move data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still account for nearly half of all opcodes executed.
... And - you intend to change that with a register VM? In which way and how? As an example, LOAD_GLOBAL isn't going anywhere - it loads a variable by *symbolic* name into a register.
Certainly, if you have data which isn't already on the stack, you are going to have to move data. As the appendix shows though, a fairly large chunk of the current virtual machine does nothing more than manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc).
Running Pyperformance using a development version of Python 3.9 showed that the five most frequently executed pure stack opcodes (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted for 35% of all executed instructions.
And you intend to change that with a register VM? How?
I modified the frame so that (once again) the current local variable space adjoins the current stack space (which, again, I treat as registers). The virtual machine can thus access local variables in place. In retrospect, I suspect it might not have been necessary. During the current phase where I've yet to implement any *_DEREF_REG instructions, it would be a moot point though. Still, I'm not sure the cell/free slots have the same semantics as locals/stack (an area of my expertise which is lacking). Isn't there an extra level of indirection there? In any case, if the cell/free slots are semantically just like locals, then it would be straightforward for me to restore the order of the data blocks in frames.
Quick google search leads to https://www.strchr.com/x86_machine_code_statistics (yeah, that's not VM, it's RM (real machine), stats over different VMs would be definitely welcome):
The most popular instruction is MOV (35% of all instructions).
So, is the plan to replace 35% of "five most frequently executed pure stack opcodes" with 35% of register-register move instructions? If not, why it would be different and how would you achieve that?
I have clearly not explained myself very well in the "PEP". I will rework that section. Still though, local variables and stack space are adjacent in my implementation, so local variables can be addressed directly without needing to first copy them onto the stack (or into a register). Clearly, global variables must still be copied into and out of registers to manipulate. I may well replicate the code object's constants in the frame as well so they can also be treated as (read-only) registers. Since FrameObjects are cached for reuse with the same code object, the cost to copy them should be bearable.
They are low-cost instructions (compared with CALL_FUNCTION for example), but still eat up time and space in the virtual machine
But that's the problem of any VM - it's slow by definition. There can be less slow and more slow VMs, but VMs can't be fast. So, what's the top-level motivation - is it "making CPython fast" or "making CPython a little bit less slow"? By how much?
The top-level motivation is to have fun. Need there be anything more ambitious? Still, point taken. Had I continued to pursue this back in the early 2000s, or had Victor succeeded in getting his implementation into the core, we might be much further along. The current problem is made all the more difficult by the fact that the virtual machine has grown so much in the past 20+ years.
Consider the layout of the data section of a Frame object: All those LOAD_FAST and STORE_FAST instructions just copy pointers between chunks of RAM which are just a few bytes away from each other in memory.
Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy pointers (with extra dereferencing, but that's a detail). It's unclear why you try to ignore them ("cell" registers), putting ahead "locals" and "stack" registers. The actual register instructions implementation would just treat any frame slot as a register with continuous numbering, allowing to access all of locals, cells, and stack locs in the same way. In that regard, trying to rearrange 3 groups at this stage seems like rather unneeded implementation complexity with no clear motivation.
I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that extra dereferencing will be more than a simple detail though. That makes the semantics of cell/free slots different than locals/registers slots (more like globals). If true, then my reordering of the frame data is worthwhile, I think. Skip

Hello Skip, On Mon, 22 Mar 2021 17:13:19 -0500 Skip Montanaro <skip.montanaro@gmail.com> wrote:
Thanks for the response. I will try to address your comments inline.
I guess it should be a good idea to answer what's the scope of this project - is it research one or "production" one? If it's research one, why be concerned with the churn of over-modern CPython versions? Wouldn't it be better to just use some scalable, incremental implementation which would allow to forward-port it to a newer version, if it ever comes to that?
The motivation for revisiting this idea was/is largely personal.
Thanks for putting it like that, that's the impression I also got (and I'm "familiar" (heard about) this project for a few years) ;-). But you're now looking for potential contributors, so it may be a good idea to (better) explain in "pep" what would motivate them to join in.
As I indicated, I first messed around with it over 20 years ago and it's been in the back of my mind ever since. Somehow I never lost the code despite I'm not sure how many computers came and went and that the code was never uploaded to any sort of distributed version control system. I decided to pick things up again as a way to mostly keep my head in the game after I retired. So, neither "research" nor "production" seems to be a correct descriptor.
I guess "research" fits right in, unless you really want to categorize it as a "personal quest". But then, unclear how request for contributors would fit in with that.
Still, if taken to functional completion — functional enough for performance testing and application to more than just toy scripts — I realized pretty quickly that I'd need help.
Otherwise, if it's "production", who's the "customer" and how they "compensate" you for doing work (chasing the moving target) which is clearly of little interest to you and conflicts with the goal of the project?
Nobody is compensating me. I have no desire to try and turn it into something I do for hire. Maybe I misunderstood your question?
Yes, I essentially try to hint that if: a) You're working with CPython bleeding edge. b) You find that (bleeding edge) adding extra chore. c) Nobody told you to work on bleeding (nor boss, nor a maintainer who said "I'll merge it once you've done"), Then: why do you complicate your task by working on bleeding edge? Could take not too old CPython version, e.g. 3.8/3.9, instead, and work with that.
This PEP proposes the addition of register-based instructions to the existing Python virtual machine, with the intent that they eventually replace the existing stack-based opcodes.
Sorry, what? The purpose of register-based instructions is to just replace stack-based instructions? That's not what's I'd like to hear as the intro phrase. You probably want to replace one with the other because register-based ones offer some benefit, faster execution perhaps? That's what I'd like to hear instead of "deciphering" that between the lines.
Replacing stack-based instructions would be a reasonable initial goal, I think. Victor reported performance improvements in his implementation (also a translator).
Btw, from just "pep", it's unclear if, and how much, you reuse Victor's work. If not, why? (The answer is useful to contributors - you ask them to "reuse" your code - how it's regarding your reuse of code of folks who were before you?).
As I indicated in the "PEP" (I use that term rather loosely, as I have no plans at the moment to submit it for consideration, certainly not in its current, incomplete state),
Sure, I understand. If anything, I find big similarities between your situation and mine - I also have some ideas (but different) regarding changes (improvements I think) to Python runtime (not just VM), and I also write "pseudoPEPs" about them seeking for feedback, etc.
a better ultimate way to go would be to generate register instructions directly from the AST.
Yeah, potentially better ultimate way. Note that advanced (JIT, etc) Java JVMs do well by starting from standard stack-based Java bytecode. []
explicitly. Would be interesting to read (in the following "pep" sections) what makes them "almost completely distinct".
Well, sure. The main difference is the way two pairs of instructions (say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their result. You still have to be able to add two objects, call functions, etc.
I'd note that your reply skips answering question about calling convention for register-based VM, and that's again one of the most important questions (and one I'd be genuinely interested to hear). []
The fact that I treat the current frame's stack space as registers makes it pretty much impossible to execute both stack and register instructions within the same frame.
I don't see how that would be true (in general, I understand that you may have constraints re: that, but that's exactly why I bring that up - why do you have constraints like that?). Even existing Python VM allows to use both in the same frame, e.g. LOAD_FAST. It takes value of register and puts it on a stack. Do you mean details like need to translate stack-based instructions into 2 (or more) instructions of: a) actual register-register instruction and; b) stack pointer adjustment, so stack-based instructions still kept working? Yes, you would need to do that, until you fully switch to register-based ones. But then there's 2 separate tasks: 1. Make register VM work. (Should be medium complexity.) 2. Make it fast. (Likely will be hard.) If you want to achieve both right from the start - oh-oh, that may be double-hard.
Victor's implementation did things differently in this regard. I believe he just allocated extra space for 256 registers at the end of each frame, so (in theory, I suppose), you could have instructions from both executed in the same frame.
I hope you have a plan of how to deal with more than 256 registers, etc. Register VM adds a lot of accidental implementation complexity ;-).
## Motivation
I'm not sure the content of the section corresponds much to its title. It jumps from background survey of the different Python VM optimizations to (some) implementation details of register VM - leaving "motivation" somewhere "between the lines".
Despite all that effort, opcodes which do nothing more than move data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still account for nearly half of all opcodes executed.
... And - you intend to change that with a register VM? In which way and how? As an example, LOAD_GLOBAL isn't going anywhere - it loads a variable by *symbolic* name into a register.
Certainly, if you have data which isn't already on the stack, you are going to have to move data.
Even if you have data in registers, you still may need to move it around to accommodate special conventions of some instructions.
As the appendix shows though, a fairly large chunk of the current virtual machine does nothing more than manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc).
Running Pyperformance using a development version of Python 3.9 showed that the five most frequently executed pure stack opcodes (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted for 35% of all executed instructions.
And you intend to change that with a register VM? How?
I modified the frame so that (once again) the current local variable space adjoins the current stack space (which, again, I treat as registers). The virtual machine can thus access local variables in place.
For simple instructions, yes. What about instructions with multiple (arbitrarily number) arguments, like CALL_FUNCTION*/CALL_METHOD*, BUILD_*, etc. (That's the same question as asked in the first mail and not answered here.)
In retrospect, I suspect it might not have been necessary. During the current phase where I've yet to implement any *_DEREF_REG instructions, it would be a moot point though. Still, I'm not sure the cell/free slots have the same semantics as locals/stack (an area of my expertise which is lacking). Isn't there an extra level of indirection there? In any case, if the cell/free slots are semantically just like locals, then it would be straightforward for me to restore the order of the data blocks in frames.
Yes, that's the point - semantically they're just locals, even though they are usually accessed with an extra level of indirection.
Quick google search leads to https://www.strchr.com/x86_machine_code_statistics (yeah, that's not VM, it's RM (real machine), stats over different VMs would be definitely welcome):
The most popular instruction is MOV (35% of all instructions).
So, is the plan to replace 35% of "five most frequently executed pure stack opcodes" with 35% of register-register move instructions? If not, why it would be different and how would you achieve that?
I have clearly not explained myself very well in the "PEP".
Well, it seems to be written with an idea that a reader is already familiar with the benefits of register-based VMs. As a fresh reader, I tried to point out that fact. I also happen to be familiar with those benefits, and the fact that "on average", register-based VMs are faster than stack-based. But that only makes me ask why do you think that those "on average" benefits would apply to Python VM case, and ask for additional details regarding more complex cases with it (which would IMHO noticeably cancel any RVM benefits, unless you have a cunning, well-grounded plan to deal with them).
I will rework that section. Still though, local variables and stack space are adjacent in my implementation, so local variables can be addressed directly without needing to first copy them onto the stack (or into a register).
Only for simple operations. Where idyllic picture for RISC CPUs with their uniform register files breaks is function calling (Python analog: CALL_FUNCTION). For not purely RISC CPUs, additional put-down are instructions with adhoc register constraints (Python analog would be BUILD_LIST, etc.)
Clearly, global variables must still be copied into and out of registers to manipulate. I may well replicate the code object's constants in the frame as well so they can also be treated as (read-only) registers. Since FrameObjects are cached for reuse with the same code object, the cost to copy them should be bearable.
They are low-cost instructions (compared with CALL_FUNCTION for example), but still eat up time and space in the virtual machine
But that's the problem of any VM - it's slow by definition. There can be less slow and more slow VMs, but VMs can't be fast. So, what's the top-level motivation - is it "making CPython fast" or "making CPython a little bit less slow"? By how much?
The top-level motivation is to have fun. Need there be anything more ambitious?
Well, my point is that if you ask for contributors, it's fair to be fair with them ;-). If you ask them to join the fun, it's fair, and then it makes sense to maximize level of fun and minimize level of chore (no bleeding edge chasing, minimize changes overall, use incremental development, where you always have something working, not "we'll get that working in a year if we pull well every day"). Otherwise, maybe it would be useful to have more objective criteria/goal, like "let's try to make Python faster", and then it makes sense to explain why you think register VM would be faster in general, and in Python case specifically. (The work process can still be fun-optimized, chore-minimizing ;-) ).
Still, point taken. Had I continued to pursue this back in the early 2000s, or had Victor succeeded in getting his implementation into the core, we might be much further along. The current problem is made all the more difficult by the fact that the virtual machine has grown so much in the past 20+ years.
Consider the layout of the data section of a Frame object: All those LOAD_FAST and STORE_FAST instructions just copy pointers between chunks of RAM which are just a few bytes away from each other in memory.
Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy pointers (with extra dereferencing, but that's a detail). It's unclear why you try to ignore them ("cell" registers), putting ahead "locals" and "stack" registers. The actual register instructions implementation would just treat any frame slot as a register with continuous numbering, allowing to access all of locals, cells, and stack locs in the same way. In that regard, trying to rearrange 3 groups at this stage seems like rather unneeded implementation complexity with no clear motivation.
I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that extra dereferencing will be more than a simple detail though. That makes the semantics of cell/free slots different than locals/registers slots (more like globals). If true, then my reordering of the frame data is worthwhile, I think.
Until you have benchmarking data that proves that on a *specific CPU* one layout is 0.39% faster than the other (will be different and even opposite on another arch/CPU), it's all guesswork, I'm afraid. And extra changed code to maintain, which spoils the fun ;-).
Skip
-- Best regards, Paul mailto:pmiscml@gmail.com

a) You're working with CPython bleeding edge. b) You find that (bleeding edge) adding extra chore. c) Nobody told you to work on bleeding (nor boss, nor a maintainer who said "I'll merge it once you've done"),
Then: why do you complicate your task by working on bleeding edge? Could take not too old CPython version, e.g. 3.8/3.9, instead, and work with that.
I started this in the 3.9 alpha timeframe. Staying up-to-date wasn't too difficult. It never occurred to me that the virtual machine would undergo so much churn for 3.10, so I just stuck with main/master and paid the price when the changes started to arrive in earnest. When the 3.10 branch is created I will take that off-ramp this time.
Btw, from just "pep", it's unclear if, and how much, you reuse Victor's work. If not, why? (The answer is useful to contributors - you ask them to "reuse" your code - how it's regarding your reuse of code of folks who were before you?).
Both Victor's and my earlier work took place in the dim dark past. There have been so many functional changes to the virtual machine that directly reusing either old code base wasn't feasible. I do have a copy of Victor's work though which I have referred to from time-to-time. I just never tried to merge it with something recent, like 3.9.
explicitly. Would be interesting to read (in the following "pep" sections) what makes them "almost completely distinct".
Well, sure. The main difference is the way two pairs of instructions (say, BINARY_ADD vs BINARY_ADD_REG) get their operands and save their result. You still have to be able to add two objects, call functions, etc.
I'd note that your reply skips answering question about calling convention for register-based VM, and that's again one of the most important questions (and one I'd be genuinely interested to hear).
I've not attempted to make any changes to calling conventions. It occurred to me that the LOAD_METHOD/CALL_METHOD pair could perhaps be merged into a single opcode, but I haven't really thought about that. Perhaps there's a good reason the method is looked up before the arguments are pushed onto the stack (call_function()?). In a register-based VM there's no need to do things in that order.
The fact that I treat the current frame's stack space as registers makes it pretty much impossible to execute both stack and register instructions within the same frame.
I don't see how that would be true (in general, I understand that you may have constraints re: that, but that's exactly why I bring that up - why do you have constraints like that?). Even existing Python VM allows to use both in the same frame, e.g. LOAD_FAST. It takes value of register and puts it on a stack.
Sure, but that's because it must. All operands must be on the stack. My code does have a step where it tries to remove LOAD_FAST_REG and STORE_FAST_REG opcodes. It's not very good though. Pesky implicit references cause problems. Still, I am able to remove some of them. This should get better over time. And, it is possible that at some point I decide to add back in some stack space for stuff like calling functions, constructing lists, etc.
Do you mean details like need to translate stack-based instructions into 2 (or more) instructions of: a) actual register-register instruction and; b) stack pointer adjustment, so stack-based instructions still kept working?
Yes, but I see (I think) what you're getting at. If I continued to maintain the stack pointer, in theory stack opcodes could exist along with register opcodes.
Yes, you would need to do that, until you fully switch to register-based ones. But then there's 2 separate tasks:
1. Make register VM work. (Should be medium complexity.) 2. Make it fast. (Likely will be hard.)
I'm not too worried about #2 yet. :-) And as demonstrated by the current project's incompleteness, either I'm not up to medium complexity tasks anymore or it's harder than you think. :-) Some of the second step isn't too hard, I don't think. I already mentioned eliding generated fast LOAD/STORE instructions, and in my previous email mentioned copying constants from the code object to the frame object on creation. I also think opcode prediction and fast dispatch should be straightforward. I just haven't bothered with that yet.
If you want to achieve both right from the start - oh-oh, that may be double-hard.
Victor's implementation did things differently in this regard. I believe he just allocated extra space for 256 registers at the end of each frame, so (in theory, I suppose), you could have instructions from both executed in the same frame.
I hope you have a plan of how to deal with more than 256 registers, etc. Register VM adds a lot of accidental implementation complexity ;-).
One of the reasons I just reused the current stack space as my register space harkens back to a thread with Tim Peters back in the earliest days of my original work. He indicated that it was possible to use no more space than the allocated stack space. That worked for me then. (I spent a few hours one day looking for that thread, but never found it.)
## Motivation
I'm not sure the content of the section corresponds much to its title. It jumps from background survey of the different Python VM optimizations to (some) implementation details of register VM - leaving "motivation" somewhere "between the lines".
Despite all that effort, opcodes which do nothing more than move data onto or off of the stack (LOAD_FAST, LOAD_GLOBAL, etc) still account for nearly half of all opcodes executed.
... And - you intend to change that with a register VM? In which way and how? As an example, LOAD_GLOBAL isn't going anywhere - it loads a variable by *symbolic* name into a register.
Certainly, if you have data which isn't already on the stack, you are going to have to move data.
Even if you have data in registers, you still may need to move it around to accommodate special conventions of some instructions.
Yes. In particular, function calling and construction of lists (and similar collections) requires arguments to be in order in contiguous locations.
As the appendix shows though, a fairly large chunk of the current virtual machine does nothing more than manipulate the stack (LOAD_FAST, STORE_FAST, POP_TOP, etc).
Running Pyperformance using a development version of Python 3.9 showed that the five most frequently executed pure stack opcodes (LOAD_FAST, STORE_FAST, POP_TOP, DUP_TOP and ROT_TWO) accounted for 35% of all executed instructions.
And you intend to change that with a register VM? How?
I modified the frame so that (once again) the current local variable space adjoins the current stack space (which, again, I treat as registers). The virtual machine can thus access local variables in place.
For simple instructions, yes. What about instructions with multiple (arbitrarily number) arguments, like CALL_FUNCTION*/CALL_METHOD*, BUILD_*, etc. (That's the same question as asked in the first mail and not answered here.)
CALL_FUNCTION and one or two variants as well as BUILD_* are done and work. I have to be careful when removing unneeded LOAD/STORE instructions that I pay attention to these implicit references (e.g., "%r1 and the next three slots" instead of "%r1, %r2, %r3, %r4").
In retrospect, I suspect it might not have been necessary. During the current phase where I've yet to implement any *_DEREF_REG instructions, it would be a moot point though. Still, I'm not sure the cell/free slots have the same semantics as locals/stack (an area of my expertise which is lacking). Isn't there an extra level of indirection there? In any case, if the cell/free slots are semantically just like locals, then it would be straightforward for me to restore the order of the data blocks in frames.
Yes, that's the point - semantically they're just locals, even though they are usually accessed with an extra level of indirection.
Quick google search leads to https://www.strchr.com/x86_machine_code_statistics (yeah, that's not VM, it's RM (real machine), stats over different VMs would be definitely welcome):
The most popular instruction is MOV (35% of all instructions).
So, is the plan to replace 35% of "five most frequently executed pure stack opcodes" with 35% of register-register move instructions? If not, why it would be different and how would you achieve that?
I have clearly not explained myself very well in the "PEP".
Well, it seems to be written with an idea that a reader is already familiar with the benefits of register-based VMs. As a fresh reader, I tried to point out that fact. I also happen to be familiar with those benefits, and the fact that "on average", register-based VMs are faster than stack-based. But that only makes me ask why do you think that those "on average" benefits would apply to Python VM case, and ask for additional details regarding more complex cases with it (which would IMHO noticeably cancel any RVM benefits, unless you have a cunning, well-grounded plan to deal with them).
Yeah, that's a definite shortcoming. I will try to add some more introductory material. Also, as I was implementing things and stumbling on things like implicit register references I was not going back and adding content to the document. I'm going to stop here and see if I can improve the document while some things are fresh in my mind. I will leave the rest for another day. You've given me plenty to think about. Thanks, Skip
I will rework that section. Still though, local variables and stack space are adjacent in my implementation, so local variables can be addressed directly without needing to first copy them onto the stack (or into a register).
Only for simple operations. Where idyllic picture for RISC CPUs with their uniform register files breaks is function calling (Python analog: CALL_FUNCTION). For not purely RISC CPUs, additional put-down are instructions with adhoc register constraints (Python analog would be BUILD_LIST, etc.)
Clearly, global variables must still be copied into and out of registers to manipulate. I may well replicate the code object's constants in the frame as well so they can also be treated as (read-only) registers. Since FrameObjects are cached for reuse with the same code object, the cost to copy them should be bearable.
They are low-cost instructions (compared with CALL_FUNCTION for example), but still eat up time and space in the virtual machine
But that's the problem of any VM - it's slow by definition. There can be less slow and more slow VMs, but VMs can't be fast. So, what's the top-level motivation - is it "making CPython fast" or "making CPython a little bit less slow"? By how much?
The top-level motivation is to have fun. Need there be anything more ambitious?
Well, my point is that if you ask for contributors, it's fair to be fair with them ;-). If you ask them to join the fun, it's fair, and then it makes sense to maximize level of fun and minimize level of chore (no bleeding edge chasing, minimize changes overall, use incremental development, where you always have something working, not "we'll get that working in a year if we pull well every day").
Otherwise, maybe it would be useful to have more objective criteria/goal, like "let's try to make Python faster", and then it makes sense to explain why you think register VM would be faster in general, and in Python case specifically. (The work process can still be fun-optimized, chore-minimizing ;-) ).
Still, point taken. Had I continued to pursue this back in the early 2000s, or had Victor succeeded in getting his implementation into the core, we might be much further along. The current problem is made all the more difficult by the fact that the virtual machine has grown so much in the past 20+ years.
Consider the layout of the data section of a Frame object: All those LOAD_FAST and STORE_FAST instructions just copy pointers between chunks of RAM which are just a few bytes away from each other in memory.
Ok, but LOAD_DEREF and STORE_DEREF instructions also just copy pointers (with extra dereferencing, but that's a detail). It's unclear why you try to ignore them ("cell" registers), putting ahead "locals" and "stack" registers. The actual register instructions implementation would just treat any frame slot as a register with continuous numbering, allowing to access all of locals, cells, and stack locs in the same way. In that regard, trying to rearrange 3 groups at this stage seems like rather unneeded implementation complexity with no clear motivation.
I haven't even looked at LOAD_DEREF or STORE_DEREF yet. I think that extra dereferencing will be more than a simple detail though. That makes the semantics of cell/free slots different than locals/registers slots (more like globals). If true, then my reordering of the frame data is worthwhile, I think.
Until you have benchmarking data that proves that on a *specific CPU* one layout is 0.39% faster than the other (will be different and even opposite on another arch/CPU), it's all guesswork, I'm afraid. And extra changed code to maintain, which spoils the fun ;-).

On Tue, Mar 23, 2021 at 12:40 PM Skip Montanaro <skip.montanaro@gmail.com> wrote:
I've not attempted to make any changes to calling conventions. It occurred to me that the LOAD_METHOD/CALL_METHOD pair could perhaps be merged into a single opcode, but I haven't really thought about that. Perhaps there's a good reason the method is looked up before the arguments are pushed onto the stack (call_function()?). In a register-based VM there's no need to do things in that order.
IIRC the reason is that Python's language reference promises left-to-right evaluation here. So o.m(f()) needs to evaluate o.m (which may have a side effect if o overrides __getattr__) before it calls f(). -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

Skip Montanaro writes:
So, neither "research" nor "production" seems to be a correct descriptor.
Not even government funding agencies distinguish between "research" and "itch-scratching" as long as you self-promote enough! :-) I agree with Paul, feel free to call it "research"! Steve

So, neither "research" nor "production" seems to be a correct descriptor.
Not even government funding agencies distinguish between "research" and "itch-scratching" as long as you self-promote enough! :-) I agree with Paul, feel free to call it "research"!
Along similar lines, when deciding what to write in the Type header of the PEP, I could choose between Informational, Standards Track and Process. None of those seemed all that perfect of a fit, though in the end I entered Standards Track. Maybe for his April Fool's joke this year, Barry Warsaw can add a Just Messin' Around type to PEP 1. :-) Skip

In the "Object Lifetime" section you say "registers should be cleared upon last reference". That isn't safe, since there can be hidden dependencies on side effects of __del__, e.g.: process_objects = create_pipeline() output_process = process_objects[-1] return output_process.wait() If the process class terminates the process in __del__ (PyQt5's QProcess does), then implicitly deleting process_objects after the second line will break the code.

On Mon, Mar 22, 2021 at 7:49 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
In the "Object Lifetime" section you say "registers should be cleared upon last reference". That isn't safe, since there can be hidden dependencies on side effects of __del__, e.g.:
process_objects = create_pipeline() output_process = process_objects[-1] return output_process.wait()
If the process class terminates the process in __del__ (PyQt5's QProcess does), then implicitly deleting process_objects after the second line will break the code.
Hang on hang on hang on. After the second line, there are two references to the last object, and one to everything else. (If create_pipeline returns two objects, one for each end of the pipe, then there are two references to the second one, and one to the first.) Even if you dispose of process_objects itself on the basis that it's not used any more (which I would disagree with, since it's very difficult to manage that well), it shouldn't terminate the process, because one of the objects is definitely still alive. This is nothing to do with a register-based VM and everything to do with standard Python semantics, so this can't change. ChrisA

On Sun, Mar 21, 2021 at 3:35 PM Chris Angelico <rosuav@gmail.com> wrote:
On Mon, Mar 22, 2021 at 7:49 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
In the "Object Lifetime" section you say "registers should be cleared
upon last reference". That isn't safe, since there can be hidden dependencies on side effects of __del__, e.g.:
process_objects = create_pipeline() output_process = process_objects[-1] return output_process.wait()
If the process class terminates the process in __del__ (PyQt5's QProcess
does), then implicitly deleting process_objects after the second line will break the code.
Hang on hang on hang on. After the second line, there are two references to the last object, and one to everything else. (If create_pipeline returns two objects, one for each end of the pipe, then there are two references to the second one, and one to the first.) Even if you dispose of process_objects itself on the basis that it's not used any more (which I would disagree with, since it's very difficult to manage that well), it shouldn't terminate the process, because one of the objects is definitely still alive.
In the hypothetical scenario, presumably create_pipeline() returns a list of process objects, where the process class somehow kills the process when it is finalized. In that case dropping the last reference to process_objects[0] would kill the first process in the pipeline. I don't know if that's good API design, but Ben states that PyQt5 does this, and it could stand in for any number of other APIs that legitimately destroy an external resource when the last reference is dropped. (E.g., stdlib temporary files.) Curiously, this is about the opposite problem that we would have if we were to replace the current reference counting scheme with some kind of traditional garbage collection system.
This is nothing to do with a register-based VM and everything to do with standard Python semantics, so this can't change.
Exactly. The link with a register-based VM is that if we replaced the value stack with temporary local variables (as the "register-based" VM scheme does), we'd have to decide on the lifetimes of those temporary variables. Finalizing them when the function returns might extend the lifetimes of some objects compared to the stack-based scheme. But finalizing all local variables (temporary or not) as soon as they are no longer needed by subsequent code could *shorten* the lifetimes, as in the above example. A reasonable solution would be to leave the lifetimes of explicitly named locals alone, but use the proposed ("as soon as possible") scheme for temporary variables. This would appear to match how values on the value stack are treated. Honestly that's how I read the quoted section of Skip's proto-PEP (since it explicitly mentions registers). A version of the example that exhibits the same questionable behavior would be this: return create_pipeline()[-1].wait() Presumably this would not work correctly with the PyQt5 process class. -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>

On Mon, Mar 22, 2021 at 3:14 PM Guido van Rossum <guido@python.org> wrote:
On Sun, Mar 21, 2021 at 3:35 PM Chris Angelico <rosuav@gmail.com> wrote:
On Mon, Mar 22, 2021 at 7:49 AM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
In the "Object Lifetime" section you say "registers should be cleared upon last reference". That isn't safe, since there can be hidden dependencies on side effects of __del__, e.g.:
process_objects = create_pipeline() output_process = process_objects[-1] return output_process.wait()
If the process class terminates the process in __del__ (PyQt5's QProcess does), then implicitly deleting process_objects after the second line will break the code.
Hang on hang on hang on. After the second line, there are two references to the last object, and one to everything else. (If create_pipeline returns two objects, one for each end of the pipe, then there are two references to the second one, and one to the first.) Even if you dispose of process_objects itself on the basis that it's not used any more (which I would disagree with, since it's very difficult to manage that well), it shouldn't terminate the process, because one of the objects is definitely still alive.
In the hypothetical scenario, presumably create_pipeline() returns a list of process objects, where the process class somehow kills the process when it is finalized. In that case dropping the last reference to process_objects[0] would kill the first process in the pipeline. I don't know if that's good API design, but Ben states that PyQt5 does this, and it could stand in for any number of other APIs that legitimately destroy an external resource when the last reference is dropped. (E.g., stdlib temporary files.)
The question is really whether process_objects ceases to exist after the last time it's referenced. I may have misinterpreted the thin example here, but let's just focus on process_objects[0] (hereunder "po0" for simplicity), and assume that there's at least two elements in the list. A list begins to exist somewhere inside create_pipeline(), and at the point where that list is returned, it has a reference to po0. That list is returned, and assigned to process_objects, which we assume is a function-local variable. So the function's locals reference process_objects, which references po0. So far, so good. Then we get a new variable output_process, and we lift something unrelated from the list. Then we call a method on an unrelated object, and return from the function. At what point does the process_objects list cease to be referenced? After the last visible use of it, or at the end of the function? My understanding of Python's semantics is that the list object MUST continue to exist all the way up until the function exits, or wording that another way, that the function's call frame has a reference to ALL of its locals, not just the ones that can visibly be seen to be used. Allowing an object to be disposed of early if there are no future uses of it would be quite surprising. It would be different if, before the return statement, "process_objects = None" were inserted. Then the list would cease to be referenced, and po0 would cease to be referenced, and regardless of the exact type of GC being used, it would be legit to ditch it before the wait() call. If *that* version is broken, then there's a problem with the objects in the list depending on each other in a non-Python-visible way, and that's a bug in the library. Can a PyQT user clarify, please? ChrisA

On Sun, Mar 21, 2021 at 11:10 PM Chris Angelico <rosuav@gmail.com> wrote:
At what point does the process_objects list cease to be referenced? After the last visible use of it, or at the end of the function?
In Python as it stands, at the end of the function, as you say. Skip Montanaro's PEP suggested that in his register machine, locals would be dereferenced after their last visible use. I don't think that's intrinsically a bad idea, but it's not backward compatible. The thing with the process objects was just an example of currently working code that would break. The example has nothing to do with PyQt5 really. I just happen to know that QProcess objects kill the controlled process when they're collected. I think it's a bad design, but that's the way it is. Another example would be something like td = tempfile.TemporaryDirectory() p = subprocess.Popen([..., td.name, ...], ...) p.wait() where the temporary directory will hang around until the process exits with current semantics, but not if td is deleted after the second line. Of course you should use a with statement in this kind of situation, but there's probably a lot of code that doesn't.

On Mon, Mar 22, 2021 at 5:37 PM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
On Sun, Mar 21, 2021 at 11:10 PM Chris Angelico <rosuav@gmail.com> wrote:
At what point does the process_objects list cease to be referenced? After the last visible use of it, or at the end of the function?
In Python as it stands, at the end of the function, as you say.
Skip Montanaro's PEP suggested that in his register machine, locals would be dereferenced after their last visible use. I don't think that's intrinsically a bad idea, but it's not backward compatible. The thing with the process objects was just an example of currently working code that would break.
The example has nothing to do with PyQt5 really. I just happen to know that QProcess objects kill the controlled process when they're collected. I think it's a bad design, but that's the way it is.
Another example would be something like
td = tempfile.TemporaryDirectory() p = subprocess.Popen([..., td.name, ...], ...) p.wait()
where the temporary directory will hang around until the process exits with current semantics, but not if td is deleted after the second line. Of course you should use a with statement in this kind of situation, but there's probably a lot of code that doesn't.
Thanks for the clarification. I think the tempfile example will be a lot easier to explain this with, especially since it requires only the stdlib and isn't implying that there's broken code in a third-party library. I don't like this. In a bracey language (eg C++), you can declare that a variable should expire prior to the end of the function by including it in a set of braces; in Python, you can't do that, and the normal idiom is to reassign the variable or 'del' it. Changing the semantics of when variables cease to be referenced could potentially break a LOT of code. Maybe, if Python were a brand new language today, you could define the semantics that way (and require "with" blocks for anything that has user-visible impact, reserving __del__ for resource disposal ONLY), but as it is, that's a very very sneaky change that will break code in subtle and hard-to-debug ways. (Not sure why this change needs to go alongside the register-based VM, as it seems to my inexpert mind to be quite orthogonal to it; but whatever, I guess there's a good reason.) ChrisA

As I wrote, Skip’s Porto+PEP is not proposing to delete locals that are not used in the rest of the function, only registers. So the voiced concerns don’t apply. On Sun, Mar 21, 2021 at 23:59 Chris Angelico <rosuav@gmail.com> wrote:
On Mon, Mar 22, 2021 at 5:37 PM Ben Rudiak-Gould <benrudiak@gmail.com> wrote:
On Sun, Mar 21, 2021 at 11:10 PM Chris Angelico <rosuav@gmail.com>
At what point does the process_objects list cease to be referenced? After the last visible use of it, or at the end of the function?
In Python as it stands, at the end of the function, as you say.
Skip Montanaro's PEP suggested that in his register machine, locals would be dereferenced after their last visible use. I don't think that's intrinsically a bad idea, but it's not backward compatible. The thing with
wrote: the process objects was just an example of currently working code that would break.
The example has nothing to do with PyQt5 really. I just happen to know
that QProcess objects kill the controlled process when they're collected. I think it's a bad design, but that's the way it is.
Another example would be something like
td = tempfile.TemporaryDirectory() p = subprocess.Popen([..., td.name, ...], ...) p.wait()
where the temporary directory will hang around until the process exits
with current semantics, but not if td is deleted after the second line. Of course you should use a with statement in this kind of situation, but there's probably a lot of code that doesn't.
Thanks for the clarification. I think the tempfile example will be a lot easier to explain this with, especially since it requires only the stdlib and isn't implying that there's broken code in a third-party library.
I don't like this. In a bracey language (eg C++), you can declare that a variable should expire prior to the end of the function by including it in a set of braces; in Python, you can't do that, and the normal idiom is to reassign the variable or 'del' it. Changing the semantics of when variables cease to be referenced could potentially break a LOT of code. Maybe, if Python were a brand new language today, you could define the semantics that way (and require "with" blocks for anything that has user-visible impact, reserving __del__ for resource disposal ONLY), but as it is, that's a very very sneaky change that will break code in subtle and hard-to-debug ways.
(Not sure why this change needs to go alongside the register-based VM, as it seems to my inexpert mind to be quite orthogonal to it; but whatever, I guess there's a good reason.)
ChrisA _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/C3CUQY... Code of Conduct: http://python.org/psf/codeofconduct/
-- --Guido (mobile)

In the "Object Lifetime" section you say "registers should be cleared upon last reference". That isn't safe, since there can be hidden dependencies on side effects of __del__, e.g.:
process_objects = create_pipeline() output_process = process_objects[-1] return output_process.wait()
If the process class terminates the process in __del__ (PyQt5's QProcess does), then implicitly deleting process_objects after the second line will break the code.
Yeah, that is old writing, so is probably less clear (no pun intended) than it should be. In frame_dealloc, Py_CLEAR is called for stack/register slots instead of just Py_XDECREF. Might not be necessary. Skip

Yeah, that is old writing, so is probably less clear (no pun intended) than it should be. In frame_dealloc, Py_CLEAR is called for stack/register slots instead of just Py_XDECREF. Might not be necessary.
Also, the intent is not to change any semantics here. The implementation of RETURN_VALUE_REG still Py_INCREFs the to-be-returned value. It's not like the data can get reclaimed before the caller receives it. S

Hi Skip, thanks for the proto-PEP which makes for an interesting reading. While reading the PEP, I had these questions: 1) I understand the goal is to make CPython faster. But this is not stated explicitly. Is there a way to make this more explicit in the beginning, and also how this would be achieved ? 2) You write: "the five most frequently executed pure stack opcodes [...] accounted for 35% of all executed instructions." and "They are low-cost instructions". If the only optimisation is to get rid of 35% of instructions, but they are quite fast, the overall speed gain would be at most 35%, and probably much less. Is there a way to measure not only execution counts of such instructions , but also the time spent, in a typical benchmark ? 3) Would the RVM allow for additional optimisations, besides the ones in the preceding point ? 4) Did you run some time benchmark with your current implementation? 5) You write: "It failed for a number of reasons" (Victor Stinner's project) -> It would be interesting to quickly write down the main reasons why it failed, and how your proposed approach would prevent these failures. Regards, S. On Sat, Mar 20, 2021 at 4:54 PM Skip Montanaro <skip.montanaro@gmail.com> wrote:
Back in the late 90s (!) I worked on a reimagining of the Python virtual machine as a register-based VM based on 1.5.2. I got part of the way with that, but never completed it. In the early 2010s, Victor Stinner got much further using 3.4 as a base. The idea (and dormant code) has been laying around in my mind (and computers) these past couple decades, so I took another swing at it starting in late 2019 after retirement, mostly as a way to keep my head in the game. While I got a fair bit of the way, it stalled. I've picked it up and put it down a number of times in the past year, often needing to resolve conflicts because of churn in the current Python virtual machine. Though I kept getting things back in sync, I realize this is not a one-person project, at least not this one person. There are several huge chunks of Python I've ignored over the past 20 years, and not just the internals. (I've never used async anything, for example.) If it is ever to truly be a viable demonstration of the concept, I will need help. I forked the CPython repo and have a branch (register2) of said fork which is currently synced up with the 3.10 (currently master) branch:
https://github.com/smontanaro/cpython/tree/register2
I started on what could only very generously be called a PEP which you can read here. It includes some of the history of this work as well as details about what I've managed to do so far:
https://github.com/smontanaro/cpython/blob/register2/pep-9999.rst
If you think any of this is remotely interesting (whether or not you think you'd like to help), please have a look at the "PEP". Because this covers a fair bit of the CPython implementation, chances to contribute in a number of areas exist, even if you have never delved into Python's internals. Questions/comments/pull requests welcome.
Skip Montanaro _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-leave@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/IUKZPH... Code of Conduct: http://python.org/psf/codeofconduct/
-- Stefane Fermigier - http://fermigier.com/ - http://twitter.com/sfermigier - http://linkedin.com/in/sfermigier Founder & CEO, Abilian - Enterprise Social Software - http://www.abilian.com/ Chairman, National Council for Free & Open Source Software (CNLL) - http://cnll.fr/ Founder & Organiser, PyParis & PyData Paris - http://pyparis.org/ & http://pydata.fr/
participants (8)
-
Ben Rudiak-Gould
-
Chris Angelico
-
David Mertz
-
Guido van Rossum
-
Paul Sokolovsky
-
Skip Montanaro
-
Stephen J. Turnbull
-
Stéfane Fermigier