Global name lookup schemes
Okay, i spent another afternoon drawing silly pictures full of boxes and arrows. I swear, i'm going to be seeing pointers in my dreams tonight. Here are figures representing my current understanding of the various schemes on the table: Jeremy 1: the dlict scheme http://lfw.org/python/jeremy1.gif http://lfw.org/python/jeremy1.tif http://lfw.org/python/jeremy1.ai Jeremy, i think i'm still somewhat unclear -- notice the two question marks in the figure. What kind of animal is the cache? I assumed that the invalidation info lives in an array parallel to the dlict's array. Is this right? Guido 1: the original cellptr/objptr scheme http://lfw.org/python/guido1.gif http://lfw.org/python/guido1.tif http://lfw.org/python/guido1.ai Ping 1: guido1 + a tweak to always use two dereferencing steps http://lfw.org/python/ping1.gif http://lfw.org/python/ping1.tif http://lfw.org/python/ping1.ai Tim 1: timdicts and cells with shadow flags http://lfw.org/python/tim1.gif http://lfw.org/python/tim1.tif http://lfw.org/python/tim1.ai GIFs are small versions, TIFs are big versions, AIs are Adobe Illustrator source files. Please examine, send me corrections, discuss, enjoy... :) Still to do: Guido 2: the globals_vector scheme Skip 1: the global-tracking scheme (I don't actually know yet what in this diagram would be different from the way things work today. Statically, Skip's picture is mostly the same; it's the runtime behaviour that's different. Still, it's probably good to have a reference picture of today's data structures anyway.) -- ?!ng
[Very nice pictures] Way cool, Ping ! Does AI provide tools for simplifying these kind of diagrams or did you do it all by hand ? Perhaps Guido ought to add these to the PEP as external reference ?! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
On Mon, 18 Feb 2002, M.-A. Lemburg wrote:
Way cool, Ping ! Does AI provide tools for simplifying these kind of diagrams or did you do it all by hand ?
I spent all day on it. Illustrator was not much help, sadly. It's so lame that it can't even keep arrowheads stuck to arrows, and its grid-snapping behaviour is mysterious and unpredictable. Unfortunately it's the only tool i have that's remotely good enough for the job (works with my Wacom tablet, fast navigation, multiple undo).
Perhaps Guido ought to add these to the PEP as external reference ?!
I would like him to, once we have made sure they are accurate. -- ?!ng
Ka-Ping Yee wrote:
On Mon, 18 Feb 2002, M.-A. Lemburg wrote:
Way cool, Ping ! Does AI provide tools for simplifying these kind of diagrams or did you do it all by hand ?
I spent all day on it. Illustrator was not much help, sadly.
Ouch... and I thought someone has finally come up with a great tool for doing technical diagrams. Oh well; I'll stick with Corel Draw then.
It's so lame that it can't even keep arrowheads stuck to arrows, and its grid-snapping behaviour is mysterious and unpredictable. Unfortunately it's the only tool i have that's remotely good enough for the job (works with my Wacom tablet, fast navigation, multiple undo).
Hmm, you sure did a great job on the diagrams given this environment. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
On Monday, February 18, 2002, at 02:49 , M.-A. Lemburg wrote:
Ka-Ping Yee wrote:
On Mon, 18 Feb 2002, M.-A. Lemburg wrote:
Way cool, Ping ! Does AI provide tools for simplifying these kind of diagrams or did you do it all by hand ?
I spent all day on it. Illustrator was not much help, sadly.
Ouch... and I thought someone has finally come up with a great tool for doing technical diagrams. Oh well; I'll stick with Corel Draw then.
I've heard very good things of OmniGraffle. Never used it myself, but their web browser OmniWeb absolutely rooooooooooolz! Check out www.omnigroup.com. MacOSX only, of course. -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
Ka-Ping Yee wrote:
On Mon, 18 Feb 2002, M.-A. Lemburg wrote:
Way cool, Ping ! Does AI provide tools for simplifying these kind of diagrams or did you do it all by hand ?
I spent all day on it. Illustrator was not much help, sadly. It's so lame that it can't even keep arrowheads stuck to arrows, and its grid-snapping behaviour is mysterious and unpredictable.
FYI, a language that I really enjoyed working with for illustrations is MetaPost: http://cm.bell-labs.com/who/hobby/MetaPost.html http://www.tug.org/metapost.html It's a really cool language, and like a lot of drawing languages, debugging is pretty. =) I've wanted to bridge Python and MetaPost, but never found the time (or frankly, a good excuse). --da
On Mon, 18 Feb 2002, David Ascher wrote:
FYI, a language that I really enjoyed working with for illustrations is MetaPost:
I'd rather discuss the *diagrams* on this list than the diagram-making tools. :) (You can send your suggestions for better tools to me individually, if you like, and i'll summarize later if there's interest.) So what do you all think of the various global name lookup proposals? -- ?!ng
[Ping]
I'd rather discuss the *diagrams* on this list than the diagram-making tools. :)
Unless you write a new tool in Python <wink>.
... So what do you all think of the various global name lookup proposals?
I expect reality has once again managed to extinguish most post-Conference euphoria. I spent 200% of my "free time" this weekend doing research for PSF board issues, and still haven't gotten to even reading about Oren's (IIRC) dict gimmicks. Guido is off traveling. Jeremy is in the midst of moving. Skip is too busy approving posts of mine that stinking SpamCop rejects since I involuntarily switched ISPs. So I'm glad we harassed Guido into at least starting a PEP ... once-upon-a-time-ly y'rs - tim
I've been working on Skip's rattlesnake and have made some progress. Right now I'm trying to hack the compiler package and am looking for a good reference on code generation. I have the "New Dragon book" as well as "Essentials of Programming Lanuages" but neither one seem to be telling me want I want to know. Any suggestions? Neil
Neil Schemenauer:
good reference on code generation. I have the "New Dragon book" as well as "Essentials of Programming Lanuages" but neither one seem to be telling me want I want to know. Any suggestions?
You could try Appel: Modern Compiler Implementation in {C,ML,Java}. -- "How should I know if it works? That's what beta testers are for. I only coded it." (Attributed to Linus Torvalds, somewhere in a posting)
On Mon, 18 Feb 2002, Simon Cozens wrote:
Neil Schemenauer:
good reference on code generation. I have the "New Dragon book" as well as "Essentials of Programming Lanuages" but neither one seem to be telling me want I want to know. Any suggestions?
You could try Appel: Modern Compiler Implementation in {C,ML,Java}.
When you get to optimizations, you want Advanced Compiler Design and Implementation by Muchnick. And/Or Building an Optimizing Compiler by Morgan.
Daniel Berlin:
You could try Appel: Modern Compiler Implementation in {C,ML,Java}.
When you get to optimizations, you want Advanced Compiler Design and Implementation by Muchnick.
And/Or Building an Optimizing Compiler by Morgan.
Yeah. See also http://www.perldoc.com/readinglist.pl Don't-be-put-off-by-the-domain-name-ly yrs, Simon -- You are in a maze of little twisting passages, all different.
On Tuesday, February 19, 2002, at 09:51 AM, Neil Schemenauer wrote:
Daniel Berlin wrote:
When you get to optimizations, you want Advanced Compiler Design and Implementation by Muchnick.
Right now I'm not planning to do any optimizations (except perhaps limiting the number of registers used).
This is, of course, a tricky optimization to do. Limiting registers used involves splitting live ranges at the right places, etc. --Dan
On Tue, 19 Feb 2002, Daniel Berlin wrote:
On Tuesday, February 19, 2002, at 09:51 AM, Neil Schemenauer wrote:
Daniel Berlin wrote:
When you get to optimizations, you want Advanced Compiler Design and Implementation by Muchnick.
Right now I'm not planning to do any optimizations (except perhaps limiting the number of registers used).
This is, of course, a tricky optimization to do. Limiting registers used involves splitting live ranges at the right places, etc.
Why limit the number of registers at all? So long as they fit in L1 cache you are golden. If not, no great loss. Of course, this does mean that you will want to have the ability to heap-allocate large register files, though I suspect that frame objects do this already for fast locals (of course, I haven't looked). -Kevin -- Kevin Jacobs The OPAL Group - Enterprise Systems Architect Voice: (216) 986-0710 x 19 E-mail: jacobs@theopalgroup.com Fax: (216) 986-0714 WWW: http://www.theopalgroup.com
On Tuesday, February 19, 2002, at 11:01 AM, Kevin Jacobs wrote:
On Tue, 19 Feb 2002, Daniel Berlin wrote:
On Tuesday, February 19, 2002, at 09:51 AM, Neil Schemenauer wrote:
Daniel Berlin wrote:
When you get to optimizations, you want Advanced Compiler Design and Implementation by Muchnick.
Right now I'm not planning to do any optimizations (except perhaps limiting the number of registers used).
This is, of course, a tricky optimization to do. Limiting registers used involves splitting live ranges at the right places, etc.
Why limit the number of registers at all? So long as they fit in L1 cache you are golden.
Err, what makes you think this? The largest problem on architectures like x86 is the number of registers. You end up with about 4 usable registers. (hardware register renaming only helps eliminate instruction dependencies, before someone mentions it). Performance quickly drops when you start spilling registers to the stack. In fact, i've seen multiple SPEC regressions of 15% or more caused by a single extra spilled register. Why? Because you have to save it and reload it multiple times. These *kill* pipelines, and instruction scheduling. It's also *much* harder to model the cache hierarchy properly so that you can make sure they'd fit in the l1 cache, than it is to make sure they stay in registers where needed in the first place. Try taking a performance critical loop entirely in registers, and change it to save to and load from memory into a register on every iteration. See how much slower it gets. --Dan
Daniel Berlin wrote:
The largest problem on architectures like x86 is the number of registers. You end up with about 4 usable registers. (hardware register renaming only helps eliminate instruction dependencies, before someone mentions it). Performance quickly drops when you start spilling registers to the stack.
I'm not going to be using hardware registers. Bytecode will be generated to run on a virtual machine. I can use a many registers as I want. However, I suspect it would be better to reuse registers rather than have one for every intermediate result. Neil
[Neil Schemenauer]
I'm not going to be using hardware registers. Bytecode will be generated to run on a virtual machine. I can use a many registers as I want. However, I suspect it would be better to reuse registers rather than have one for every intermediate result.
I think your intuition there is darned good. Within a basic block, "the obvious" greedy scheme is provably optimal wrt minimizing the # of temp registers needed by the block: start the block with an initially empty set of temp registers. March down the instructions one at a time. For each input temp register whose contained value's last use is in the current instruction, return that temp register to the set of free temp registers. Then for each output temp register needed by the current instruction, take one (any) from the set of free temp registers; else if the set is empty invent a new temp register out of thin air (bumping a block high-water mark is an easy way to do this). That part is easy. What's hard (too expensive to be practical, in general) is provably minimizing the overall number of temps *across* basic blocks. Still, look up "graph coloring register assignment" on google and you'll find lots of effective heuristics. For a first cut, punt: just store everything still alive into memory at the end of a basic block. If you're retaining Rattlesnake's idea of treating "the register file" as a superset of the local vrbl vector, mounds of such "stores" will be nops. What's also hard is selecting instructions in such a way as to minimize the number of temp registers needed, ditto ordering instructions toward that end. When you think about those, you realize that minimizing the number of temps is actually a tradeoff, not an absolute good, both affecting and affected by other decisions. A mountain of idiosyncratic heuristics follows soon after <wink>. but=you-don't-have-to-solve-everything-at-the-start-ly y'rs - tim
Tim Peters wrote:
Within a basic block, "the obvious" greedy scheme is provably optimal wrt minimizing the # of temp registers needed by the block
I already had this part of the plan mostly figured out. Thanks for verifying my thinking however.
What's hard (too expensive to be practical, in general) is provably minimizing the overall number of temps *across* basic blocks.
This was the part that was worrying me.
Still, look up "graph coloring register assignment" on google and you'll find lots of effective heuristics. For a first cut, punt: just store everything still alive into memory at the end of a basic block.
Okay, that's easy. I suspect it will work fairly well in practice since most functions have a small number of basic blocks and that increasing the number of registers is cheap.
If you're retaining Rattlesnake's idea of treating "the register file" as a superset of the local vrbl vector, mounds of such "stores" will be nops.
I'm planning to keep this idea. There seems to be no good reason to treat local variables any differently than registers. I suppose it would be fairly easy to add a simple peep-hole optimizer that would clean out the redundant stores. When you talked about flexible intermediate code did you have anything in mind? Hmm, perhaps constants can be handled in a similar way. The only way I can think of doing it at the moment is to copy the list of constants into registers when the frame is created. That seems like it could easily end up as a net loss though.
What's also hard is selecting instructions in such a way as to minimize the number of temp registers needed, ditto ordering instructions toward that end.
But is there really any freedom to do reordering? For example, a BINARY_ADD opcode to end up doing practically anything. Neil
[Tim]
Within a basic block, "the obvious" greedy scheme is ...
[Neil Schemenauer]
I already had this part of the plan mostly figured out. Thanks for verifying my thinking however.
You're welcome. Note that you've been pretty mysterious about what it is you do want to know, so I'm more pleased that I channeled you than that I was slightly helpful <wink>.
What's hard (too expensive to be practical, in general) is provably minimizing the overall number of temps *across* basic blocks.
This was the part that was worrying me.
It can worry you later just as well. Python isn't C, and the dynamic semantics make it very much harder to prove that a subexpression is, e.g., a loop invariant, or that an instance of "+" won't happen to change the binding of every global, etc (ha! now I see you pointed that out yourself later -- good chanelling on your end too <wink>). For that reason there's less need to get gonzo at the start. IIRC, the primary motivation for Rattlesnake was to cut eval loop overhead, and it's enough to aim for just that much at the start.
If you're retaining Rattlesnake's idea of treating "the register file" as a superset of the local vrbl vector, mounds of such "stores" will be nops.
I'm planning to keep this idea. There seems to be no good reason to treat local variables any differently than registers.
Not now. If you're looking at going on to generate native machine code someday, well, this isn't the only simplification that will bite.
I suppose it would be fairly easy to add a simple peep-hole optimizer that would clean out the redundant stores. When you talked about flexible intermediate code did you have anything in mind?
That's why I urged you to keep it in Python at the start. IRs come in all sorts of flavors, and AFAICT people more or less stumble into one that works well for them based on what they've done so far (and that was my experience too in my previous lives). You have to expect to rework it as ambitions gorw. Note Daniel Berlin's helpful comment that gcc is moving toward 3 IRs now; that's the way these things always go. At the start, I expect I'd represent a Python basic block as an object containing a plain Python list of instruction objects. Then you've got the whole universe of zippy Python builtin list ops to build on when mutating the instruction stream. Note that my focus is to minimize *your* time wrestling with this stuff: implementing fancy data structures is a waste of time at the start. I'd also be delighted to let cyclic gc clean up dead graph structures, so wouldn't spend any time trying, e.g., to craft a gimmick out of weakrefs to avoid hard cycles. You may or may not want to do a survey of popular IRs. Here's a nice *brief* page with links to follow: http://www.math.tau.ac.il/~guy/pa/ir.html I think a lot of this is a matter of taste. For example, some people swear by the "Value Dependence Graphs" that came out of Microsoft Research. I never liked it, and it's hard to explain why ("too complicated"). Static Single Assignment form is more to my tastes, but again it's hard to explain why ("almost but not quite too simple"). Regardless, you can get a lot done at the Rattlesnake level just following your gut intuitions, prodded by what turns out to be too clumsy. As with most other things, reading papers is much more useful after you've wrestled with the problems on your own. There's only one way to do it, but what that way is remains a mystery to me.
... But is there really any freedom to do reordering? For example, a BINARY_ADD opcode to end up doing practically anything.
That's right, and that's why you should gently file away but ignore almost all the advice you'll get <wink>. Skip kept discovering over and over again just how little his peephole optimizer could get away with doing on Python bytecode. Collapsing jumps to jumps is one of the few safe things you can get away with. BTW, the JUMP_IF_TRUE and JUMP_IF_FALSE opcodes peek at the top of the stack instead of consuming it. As a result, both the fall-through and target instructions are almost always the no-bang-for-the-buck POP_TOP. This always irritated me to death (it's *useful* behavior, IIRC, only for chained comparisons). If Rattlesnake cleans up just that much, it will be a huge win in my eyes, depite not being measurable <ahem>.
[Neil Schemenauer]
I've been working on Skip's rattlesnake and have made some progress.
Cool! I encourage this. That and 2 dollars will buy you a cup of coffee.
Right now I'm trying to hack the compiler package and am looking for a good reference on code generation. I have the "New Dragon book" as well as "Essentials of Programming Lanuages" but neither one seem to be telling me want I want to know. Any suggestions?
Write it in Python because you won't be happy with your first 2 or 3 attempts. Compiler books have historically been very heavy on front end issues, in large part because the theory of parsing is well understood. What relatively little you can find on back end issues tends to be sketchy and severely limited by the author's personal experience. In large part this is because almost no interesting optimization problem can be solved in linear time (whether it's optimal instruction selection, optimal register assignment, optimal instruction ordering, ...), so real-life back ends are a mountain of idiosyncratic heuristics. Excellent advice that almost nobody follows <0.5 wink>: choose a flexible intermediate representation, then structure all your transformations as independent passes, such that the output of every pass is acceptable as the input to every pass. Then keep each pass focused, as simple as possible (for example, if a transformation may create regions of dead code, don't dare try to clean it up in the same pass, or contort the logic even a little bit to try to avoid creating dead code -- instead let it create all the dead code it wants, and (re)invoke a "remove dead code" pass afterwards). *Because* back ends are a mountain of idiosyncratic heuristics, this design lets you add new ones, remove old ones, and reorder them with minimal pain. One compiler I read about (but didn't have the pleasure of using) actually allowed you to specify the sequence of back end transformations on the cmdline, using a regular expression notation where, e.g. (hoist dead)+ meant "run the hoist pass followed by the dead code removal pass, one or more times, until a fixed point is reached". Since none of that told you want to know either <wink>, what do you want to know? Sounds like a legit topic for python-dev.
On Mon, 18 Feb 2002, Tim Peters wrote:
[Neil Schemenauer]
I've been working on Skip's rattlesnake and have made some progress.
Cool! I encourage this. That and 2 dollars will buy you a cup of coffee.
Right now I'm trying to hack the compiler package and am looking for a good reference on code generation. I have the "New Dragon book" as well as "Essentials of Programming Lanuages" but neither one seem to be telling me want I want to know. Any suggestions?
Write it in Python because you won't be happy with your first 2 or 3 attempts.
Compiler books have historically been very heavy on front end issues, in large part because the theory of parsing is well understood. What relatively little you can find on back end issues tends to be sketchy and severely limited by the author's personal experience. In large part this is because almost no interesting optimization problem can be solved in linear time (whether it's optimal instruction selection, optimal register assignment, optimal instruction ordering, ...), so real-life back ends are a mountain of idiosyncratic heuristics.
This is true. In fact, it's actually worse than "can be solved in linear time", it's "are currently thought/proved to be in NP". For graph coloring register allocation algorithms, it's even worse (if you thought that was possible). You can't even approximate the chromatic number of the graph (IE, the number of colors, and therefore, registers, it would take to color it) to more than a certain degree in an absurd time bound. However, you've missed the middle end, where a lot of interesting optimizations *can* be done in linear time or n log n time, and where most people now concentrate their time. On an SSA graph, you can do at least the following in linear time or n log n time: Partial Redundancy Elimination Conditional Constant Propagation Copy propagation Dead store elimination Dead load elimination Global code motion Global value numbering Store motion Load motion Dead code elimination Lots of loop optimizations Lots of memory hiearchy optimizations I've ignored interprocedural optimizations, including various pointer analyses that are linear time or close to it, because it would be harder to apply them to python.
Excellent advice that almost nobody follows <0.5 wink>: choose a flexible intermediate representation, then structure all your transformations as independent passes, such that the output of every pass is acceptable as the input to every pass.
Everyone tries to do this these days, actually. At least, from my working on gcc and looking at the source to tons of compilers each year. You really need more than one level of IR to do serious optimization. Tradeoff between losing valueable info (such as array indexing operations) vs. simplicity of writing optimization passes usually causes people to do some types optimization on higher level IR's (particularly, loop optimizations), while other optimization passes on lower IR's. GCC is moving towards 3 IR's, a language independent tree IR, a mid-level RTL, and the current low-level RTL.
Then keep each pass focused, as simple as possible (for example, if a transformation may create regions of dead code, don't dare try to clean it up in the same pass, or contort the logic even a little bit to try to avoid creating dead code -- instead let it create all the dead code it wants, and (re)invoke a "remove dead code" pass afterwards).
Usually you don't hit this problem inside a single pass like DCE, because they iterate until nothing changes.
One compiler I read about (but didn't have the pleasure of using) actually allowed you to specify the sequence of back end transformations on the cmdline, using a regular expression notation where, e.g.
(hoist dead)+
meant "run the hoist pass followed by the dead code removal pass, one or more times, until a fixed point is reached".
Since none of that told you want to know either <wink>, what do you want to know? Sounds like a legit topic for python-dev.
Tim> Excellent advice that almost nobody follows <0.5 wink>: choose a Tim> flexible intermediate representation, then structure all your Tim> transformations as independent passes, such that the output of Tim> every pass is acceptable as the input to every pass. I did this with my peephole optimizer. It worked great. Each peephole optimization is a very simple subclass of an OptimizeFilter base class. The IR is essentially the bytecode split into basic blocks, with each basic block a list of (opcode argument) tuples. Jump targets are represented as simple indexes into the block list. (In fact, my Rattlesnake converter was just a "peephole optimizer" named InstructionSetConverter.) As Tim mentioned about KISS, this means you sometimes have to run particular optimizations or groups of optimizations multiple times. I want to get it checked into the sandbox where others can play with it, but time has shifted its foundation a tad and a couple optimizations don't work any longer. Skip
participants (10)
-
Daniel Berlin -
David Ascher -
Jack Jansen -
Ka-Ping Yee -
Kevin Jacobs -
M.-A. Lemburg -
Neil Schemenauer -
Simon Cozens -
Skip Montanaro -
Tim Peters