[OT] stats on type-inferencing atomic types in local variables in the stdlib
My thesis (which, for those who don't know, was to come up with a way to do type inferencing in the compiler without requiring any semantic changes; basically type inferencing atomic types assigned to local variables) is now far enough long that I have the algorithm done and I can generate statistics on what opcodes are called with the most common types that I can specifically infer (can also do static type checking on occasion; only triggers were actual unit tests making sure TypeError was raised for certain things like ``~4.2`` and such). Thought some of you might get a kick out of this since the numbers are rather blatent for certain opcodes and methods. To read the stats, the number to the left is the number of times the opcode was compiled (not executed!) with the specific type(s) known for the opcode (if it took two args, then both types are listed; order was considered irrelevant). Now they are listed as integers, so here is the conversion:: Basestring 4 IntegralType 8 FloatType 16 ImagType 32 DictType 64 ListType 128 TupleType 256 For the things named "meth_<something>" that is the method being called immediately on the type. Now realize these numbers are only for opcodes where I could definitely infer the type; ones where it could be more than one type, regardless if those possibilities were very specific, I just ignored it and did not include in the stats. I also tweaked some opcodes knowing how they are more often used. So, for instance, BINARY_MODULO checks specifically for the case of when the left side is a basestring and then just doesn't worry about the other args. Other ones I just didn't bother with all the args since it was not interesting to me in terms of deciding what type-specific opcodes I want to come up with. Anyway, here are the numbers on Lib sans Lib/test (129,814 lines according to SLOCCount) for the ones above 100:: (101, ('BINARY_MULTIPLY', (8, 4))), (106, ('BINARY_SUBSCR', 128)), (118, ('GET_ITER', 128)), (124, ('BINARY_MODULO', None)), (195, ('meth_join', 4)), (204, ('BINARY_ADD', (8, 8))), (331, ('BINARY_ADD', (4, 4))), (513, ('BINARY_LSHIFT', (8, 8))), (840, ('meth_append', 128)), (1270, ('PRINT_ITEM', 4)), (1916, ('BINARY_MODULO', 4)), (12302, ('STORE_SUBSCR', 64))] We sure like our dictionaries (for those that don't know, dictionaries are created by making an empty dict and then basically doing an indivual assignment for each value). We also seem to love to use string interpolation, and printing stuff. Using list.append is also popular. Now the BINARY_LSHIFT is rather interesting, and that ties into the whole issue of how much I can actually infer; since binary work tends to be with all constants I can infer it really easily and so its frequency is rather high. Its actual frequency of use, though, compared to other things probably is not high, though. Plus I doubt Plone, for instance, uses ``<<`` very often so I suspect the opcode will get weeded out when I incorporate stats from the other apps I am taking stats from. As for the stuff I cut out, the surprising thing from those numbers was how few mathematical expressions could be inferred. I checked my numbers with grep and there really is only 3 times where a float constant is divided by a float constant (and they are all in colorsys). I was not expecting that at all. Guess global variables or object attributes tend to have them or I just can't infer the values. Either way I just wasn't expecting that. Anyway, as I said I just thought some people might find this interesting. Don't read into this too much since I am just using these numbers as guidelines for type-specific opcodes to write for use as a quantifiable measurement of the usefulness of type inferencing like this. -Brett P.S.: anyone who is *really* interested I can send you the full stats for the apps I have run my modified version of compile.c against.
Hello Brett, On Thu, Sep 30, 2004 at 01:02:34PM -0700, Brett C. wrote:
(101, ('BINARY_MULTIPLY', (8, 4))), (106, ('BINARY_SUBSCR', 128)), (118, ('GET_ITER', 128)), (124, ('BINARY_MODULO', None)), (195, ('meth_join', 4)), (204, ('BINARY_ADD', (8, 8))), (331, ('BINARY_ADD', (4, 4))), (513, ('BINARY_LSHIFT', (8, 8))), (840, ('meth_append', 128)), (1270, ('PRINT_ITEM', 4)), (1916, ('BINARY_MODULO', 4)), (12302, ('STORE_SUBSCR', 64))]
I was surprized at first to see so few operations involving integers. There isn't even LOAD_SUBSCR for a list and an integer, though I believe it to be a very common operation. It makes me guess that your type-inferencer does not recognize 'i' to be an integer in constructions like for i in range(...) ? If so, it might be worth considering that some built-ins (most likely range(), len(), enumerate()) should be treated specially. Remember there was also discussion in this list at some point about the compiler producing opcodes like BUILTIN_LEN. This means that it might be acceptable to break the precise semantics (which involve looking up the name 'len' or 'range' in the globals first) and just special-case these common built-ins. Armin
Armin Rigo wrote:
Hello Brett,
On Thu, Sep 30, 2004 at 01:02:34PM -0700, Brett C. wrote:
(101, ('BINARY_MULTIPLY', (8, 4))), (106, ('BINARY_SUBSCR', 128)), (118, ('GET_ITER', 128)), (124, ('BINARY_MODULO', None)), (195, ('meth_join', 4)), (204, ('BINARY_ADD', (8, 8))), (331, ('BINARY_ADD', (4, 4))), (513, ('BINARY_LSHIFT', (8, 8))), (840, ('meth_append', 128)), (1270, ('PRINT_ITEM', 4)), (1916, ('BINARY_MODULO', 4)), (12302, ('STORE_SUBSCR', 64))]
I was surprized at first to see so few operations involving integers. There isn't even LOAD_SUBSCR for a list and an integer, though I believe it to be a very common operation.
It's picked up and listed above; 106 times. I already type check that lists are being indexed by integer or an object so I didn't bother to keep separate stats for those two cases.
It makes me guess that your type-inferencer does not recognize 'i' to be an integer in constructions like
for i in range(...)
?
Right, I don't infer those situations.
If so, it might be worth considering that some built-ins (most likely range(), len(), enumerate()) should be treated specially. Remember there was also discussion in this list at some point about the compiler producing opcodes like BUILTIN_LEN. This means that it might be acceptable to break the precise semantics (which involve looking up the name 'len' or 'range' in the globals first) and just special-case these common built-ins.
Not going to break semantics for my thesis (the whole was not to), but this is all going to be mentioned in the Future Work section on what can be done to improve the situation for type inferencing. It would definitely help if built-ins were known at compile-time, but since I have limited time I had to limit myself to what could be done without adding features to the compiler beyond just type inferencing. -Brett
participants (2)
-
Armin Rigo
-
Brett C.