So I was saying to someone the other day "Gee, I wonder why Python doesn't use tagged integers, it seems like it would be a lot faster than allocating new objects all the time.", and they said "Cause you'd have to change everything, too much work!" and I said "Nah, you only need to change a few things to use macros, it'd only take a few hours, mostly query-replace". So, of course, I had to do it then, and it only took a couple hours, and appears to be at least somewhat faster. On the test that probably puts my change in the most positive light possible: "x = 0; while x < 50000000: x = x + 1", it achieves about a 50% increase in speed. More normal integer-heavy things seem to be at most 20% faster. My implementation works as follows: - Assumes at least 32-bit words - Lower 2 bits of machine word are the tag. tag 0 == normal object pointer (all object pointers are already word aligned, so their lower 2 bits will be 0). tag 1 == integer. 2&3 unassigned. - Integers that fit in 30 bits are stored into the word directly, not in allocated memory. - For example, integer "5" would be stored as (5 << 2) | 1 == 0x15. - Thus, an arbitrary "PyObject *" can no longer be relied upon to actually be a pointer. Sometimes it will be tagged data instead. - Thus, no code can directly access object fields ->ob_refcnt, or ->ob_type. Introduced Py_GETTYPE/Py_GETREF macros and search&replaced code in python to use them. These macros check the tag bits, and do the right thing if it is tagged data. - I kept tagged integers as the same class as allocated int objects, as it simplified implementation (nothing outside intobject knows the difference). It would probably be nicer to do away with heap allocated int objects and overflow directly to long objects after 30 bits, but, IIRC, there are various bits of python that require integers up to the platform's full integer width to fit in an instance of PyInt_Type. Also, subclasses of integer must be heap allocated too, of course. - intobject.c cannot directly access ->ob_ival. Instead, I made it use PyInt_AS_LONG, which is modified to know about tagged integers. - That's about it! So, why doesn't python already use tagged integers? Surely someone's thought to "just do it" before? I only see discussion of it with relation to pypy. A couple things: - It's almost certainly not fully portable -- that's fine! Keep using heap allocated int objects on architectures that it doesn't work with. - It does cause incompatibility with external C modules. Not only binary incompatibility -- the source also needs to be modified to not use ->ob_type/->ob_refcnt directly. ob_refcnt is hardly used outside of INCREF/DECREF macros already, so that's good. ob_type is used a lot in extension modules' PyXXX_Check macros. Here's the patch I have against Python-2.3.3. Please note this is just a couple hour hack, it may have errors. Most of the diff is quite boring and repetitious. http://fuhm.net/~jknight/python233-tagint.diff.gz. So, any thoughts? Worth continuing on with this? If this is something that people are interested in actually doing, I could update the patch against latest CVS and put the changes in #ifdefs so it's compile-time selectable. Thoughts for future development: There is space available for 2 more tagged data types. Could they be productively used? Perhaps one for single element tuples. Perhaps one for single character unicode strings. Dunno if those are easily doable and would actually increase performance. James PS: Here's the interesting portions of the changes. Yes, I realize typeof() and ({ are GCC extensions, but this was the most straightforward way to implement inline expression macros that may use their arguments more than once. Maybe they should be inline functions instead of macros? ===== object.h #define Py_OBJ_TAG(op) (((Py_uintptr_t)(op)) & 0x03) #define Py_GETTYPE(op) ({typeof((op)) __op = (op); \ (!Py_OBJ_TAG(__op))?__op->ob_type:Py_tagged_types[Py_OBJ_TAG(__op)]; }) #define Py_GETREF(op) ({typeof((op)) __op = (op); \ Py_OBJ_TAG(__op)?1:__op->ob_refcnt; }) #define Py_SETREF(op, val) ({typeof((op)) __op = (op); \ if(!Py_OBJ_TAG(__op)) \ __op->ob_refcnt = (val); \ }) #define Py_ADDREF(op, amt) ({typeof((op)) ___op = (op); \ if(!Py_OBJ_TAG(___op)) \ ___op->ob_refcnt += (amt); \ Py_GETREF(___op); \ }) #define Py_INCREF(op) ( \ _Py_INC_REFTOTAL _Py_REF_DEBUG_COMMA \ Py_ADDREF(op, 1), (void)0) #define Py_DECREF(op) \ if (_Py_DEC_REFTOTAL _Py_REF_DEBUG_COMMA \ Py_ADDREF(op, -1) != 0) \ _Py_CHECK_REFCNT(op) \ else \ _Py_Dealloc((PyObject *)(op)) ===== intobject.h #define PyInt_AS_LONG(op) ({typeof((op)) __op = (op); \ Py_OBJ_TAG(__op)?(((long)__op) >> 2):((PyIntObject *)__op)->ob_ival; })
James Y Knight wrote:
So I was saying to someone the other day "Gee, I wonder why Python doesn't use tagged integers, it seems like it would be a lot faster than allocating new objects all the time.", and they said "Cause you'd have to change everything, too much work!" and I said "Nah, you only need to change a few things to use macros, it'd only take a few hours, mostly query-replace".
So, of course, I had to do it then, and it only took a couple hours, and appears to be at least somewhat faster.
On the test that probably puts my change in the most positive light possible: "x = 0; while x < 50000000: x = x + 1", it achieves about a 50% increase in speed. More normal integer-heavy things seem to be at most 20% faster.
Interesting. I would have thought that hackery like this would result in more significant speedups. Looks like the Python implementation is darn fast already :-) Note that Python shares small integers and uses a free list for the rest, so most of the times, the implementation can create objects using already allocated memory. I'd rather not like to see hackery like misused pointers in the core, so you can count me as -1 on this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 14 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
I'd rather not like to see hackery like misused pointers in the core, so you can count me as -1 on this.
-1000 here. In a past life (the ABC implementation) we used this and it was a horrible experience. Basically, there were too many places where you had to make sure you didn't have an integer before dereferencing a pointer, and finding the omissions one core dump at a time was a nightmare. --Guido van Rossum (home page: http://www.python.org/~guido/)
At 08:40 AM 7/14/04 -0700, Guido van Rossum wrote:
I'd rather not like to see hackery like misused pointers in the core, so you can count me as -1 on this.
-1000 here. In a past life (the ABC implementation) we used this and it was a horrible experience. Basically, there were too many places where you had to make sure you didn't have an integer before dereferencing a pointer, and finding the omissions one core dump at a time was a nightmare.
That certainly seems like it would be so, except... doesn't every bit of Python code that's doing anything *other* than accessing ob_type or ob_refcnt have to first check if ob_type is the type that code is looking for anyway? And, don't the vast majority of accesses to ob_type and ob_refcnt occur inside Python core API macros and functions anyway? If in addition to Mr. Knight's patch, ob_type and ob_refcnt were renamed so as to break any existing direct access in the core or extension modules, upgrading them to use Py_GETTYPE and Py_GETREF would be straightforward because one would "find the omissions" one *compilation error* at a time. Further, renaming those fields to something like 'XXXusePy_GETTYPE_instead' would provide a built-in hint that you weren't supposed to use the field directly. :) An additional trick: make it so that the macros for defining new object types still create an ob_type field, but let PyObject * point to the structure with the XXX fields. Thus, you can only access ob_type directly if you've already cast to a non PyObject * type. (i.e., you've already used an appropriate PyXXX_Check on that variable and would like to use it as a regular object now). Naturally, the issues of portability and what impact Mr. Knight's approach has on non-integer code would need to be taken into consideration as well, but it seems to me that this approach *could* be made type-safe, at the cost of some initial pain to update the approximately 640 uses of 'ob_type' in the current source base.
-1000 here. In a past life (the ABC implementation) we used this and it was a horrible experience. Basically, there were too many places where you had to make sure you didn't have an integer before dereferencing a pointer, and finding the omissions one core dump at a time was a nightmare.
That certainly seems like it would be so, except... doesn't every bit of Python code that's doing anything *other* than accessing ob_type or ob_refcnt have to first check if ob_type is the type that code is looking for anyway?
And, don't the vast majority of accesses to ob_type and ob_refcnt occur inside Python core API macros and functions anyway?
If in addition to Mr. Knight's patch, ob_type and ob_refcnt were renamed so as to break any existing direct access in the core or extension modules, upgrading them to use Py_GETTYPE and Py_GETREF would be straightforward because one would "find the omissions" one *compilation error* at a time.
Further, renaming those fields to something like 'XXXusePy_GETTYPE_instead' would provide a built-in hint that you weren't supposed to use the field directly. :)
An additional trick: make it so that the macros for defining new object types still create an ob_type field, but let PyObject * point to the structure with the XXX fields. Thus, you can only access ob_type directly if you've already cast to a non PyObject * type. (i.e., you've already used an appropriate PyXXX_Check on that variable and would like to use it as a regular object now).
Naturally, the issues of portability and what impact Mr. Knight's approach has on non-integer code would need to be taken into consideration as well, but it seems to me that this approach *could* be made type-safe, at the cost of some initial pain to update the approximately 640 uses of 'ob_type' in the current source base.
Sorry, I'm still not convinced that it's worth to break all the 3rd party extensions that undoubtedly are doing all sorts of things that strictly speaking they shouldn't do. And what about all the extra code generated for Py_DECREF and Py_INCREF calls? These now all contain an extra jump. Horrors! --Guido van Rossum (home page: http://www.python.org/~guido/)
On Jul 14, 2004, at 9:42 PM, Guido van Rossum wrote:
Sorry, I'm still not convinced that it's worth to break all the 3rd party extensions that undoubtedly are doing all sorts of things that strictly speaking they shouldn't do.
There really is a minimal set of things you can do that will not already cause problems. The only thing you can do with an arbitrary PyObject * is access its ob_type or ob_refcnt. Anything else will break with objects today. So, those accesses all need to be cleaned up to use the new macros. The other thing a 3rd party extension could do that would now break is to access a PyIntObject's ->ob_ival field directly.There's already the PyInt_AS_LONG macro they ought to be using for that. If the idea to rename the ob_type/ob_refcnt fields is implemented, I really don't see any mysterious runtime failures occurring. The idea of doing this in stages (as Jeff Epler says) is probably a good one. One question though (I am somewhat unknowledgeable in this area): are C extensions generally binary compatible between python major versions anyways? I had thought that they weren't.
And what about all the extra code generated for Py_DECREF and Py_INCREF calls? These now all contain an extra jump. Horrors!
Indeed. That (and the similar branch in Py_GETTYPE) is why some operations are slower. The only question here is: do the speedups outweigh the slowdowns. Investigating the pybench some more, it seems that speedup for many of the tests is because of the shortcut for tagged types in Py_INCREF and Py_DECREF. E.g. with the pybench TupleSlicing test, as written, the speed diff is -39.13%. However, changing the tuple to contain strings instead of integers causes the change to be +0.03%. This makes me think that perhaps using an inline tagged representation for things like True, False, and single character strings < 256 might be a win as well, even though they don't ever cause memory allocation -- just because of the refcounting. Well, I tried making booleans a tagged types as well: pybench then runs 11% faster than standard python (vs the 7% with just tagged integers). The all-important benchmark Parrotbench runs 4.8% faster than standard python. James P.S.: I have also previously experimented with removing refcounting from Python completely, and using the Boehm GC, because I theorized that refcounting could very well be slower than a full conservative GC. It mostly worked (I did not finish, so it did not support destructors or weakrefs), but was not a win, because it was no longer possible to use large block allocation and free list tricks for integers, and thus integer allocation caused a huge slowdown. I might revisit that now. I do realize using Boehm GC has pretty much a negative probability of making it into standard python, though. :)
James Y Knight
Investigating the pybench some more, it seems that speedup for many of the tests is because of the shortcut for tagged types in Py_INCREF and Py_DECREF. E.g. with the pybench TupleSlicing test, as written, the speed diff is -39.13%. However, changing the tuple to contain strings instead of integers causes the change to be +0.03%.
This makes me think that perhaps using an inline tagged representation for things like True, False, and single character strings < 256 might be a win as well, even though they don't ever cause memory allocation -- just because of the refcounting. Well, I tried making booleans a tagged types as well: pybench then runs 11% faster than standard python (vs the 7% with just tagged integers).
Interesting!
P.S.: I have also previously experimented with removing refcounting from Python completely, and using the Boehm GC, because I theorized that refcounting could very well be slower than a full conservative GC. It mostly worked (I did not finish, so it did not support destructors or weakrefs), but was not a win, because it was no longer possible to use large block allocation and free list tricks for integers, and thus integer allocation caused a huge slowdown. I might revisit that now. I do realize using Boehm GC has pretty much a negative probability of making it into standard python, though. :)
The last time I mentioned that the Boehm GC didn't help Python much (on comp.lang.lisp) Hans Boehm showed up and claimed that it was because we weren't using it properly :-) Let us know how it goes... Cheers, mwh -- A difference which makes no difference is no difference at all. -- William James (I think. Reference anyone?)
James Y Knight
The only thing you can do with an arbitrary PyObject * is access its ob_type or ob_refcnt. Anything else will break with objects today.
One other thing that would break is testing whether an object is an integer and then accessing fields of the integer object directly. Not that this seems a likely thing for people to do, though. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
At 01:41 PM 7/16/04 +1200, Greg Ewing wrote:
James Y Knight
: The only thing you can do with an arbitrary PyObject * is access its ob_type or ob_refcnt. Anything else will break with objects today.
One other thing that would break is testing whether an object is an integer and then accessing fields of the integer object directly. Not that this seems a likely thing for people to do, though.
I guess we'd need to XXX-ify ob_ival as well, then. There's already a PyInt_AS_LONG macro that should be used for that purpose. So far, the -1000 from Guido makes it seem unlikely this'll get accepted, though. I am sort of curious, though, as to how many extension modules would actually require changes to support this. A bit of grepping through the source of several I use, seems to indicate that it isn't that common to use ob_type, and it's most likely to appear: 1) in the setup of a type object, setting the type's type to type :) (by far the most common, and not a usage that needs to be changed) 2) in the definition of a macro that verifies the type of an extension object 3) in allocation/deallocation routines Only uses 2 and 3 would need to be changed, and they tend to be quite isolated. By contrast, I didn't find any ob_ival uses, and the only place I saw direct use of ob_refcnt was in some Pyrex-generated tp_dealloc code. Of course, my sampling was hardly statistically valid. I imagine the real extensions to check out would be things like SciPy, Numeric, PIL, wxPython, Boost::Python, win32all, PyCrytpo, etc. That is, widely used packages of very large size. If I understand correctly, extensions have to be recompiled to move between Python versions anyway, and compared to some previous C API changes, this one actually seems pretty minor. Extensions like 'kjbuckets' that were written for early 1.x Python versions definitely needed work to make them build with the later 1.x and 2.x versions.
James Y Knight wrote:
So I was saying to someone the other day "Gee, I wonder why Python doesn't use tagged integers, it seems like it would be a lot faster than allocating new objects all the time.", and they said "Cause you'd have to change everything, too much work!" and I said "Nah, you only need to change a few things to use macros, it'd only take a few hours, mostly query-replace".
So, of course, I had to do it then, and it only took a couple hours, and appears to be at least somewhat faster.
It made integer-heavy things 20% faster. How much slower did it make everything else? Did you try running a complete benchmark (e.g. pystone) before and after? And the lack of portability is a real problem - it means that every code path must be tested twice AND there might be architectures and problems were tagged integers work but it actually makes things slower. I'm skeptical until I see a complete system benchmark but it's great that you tried this! Cheers, Brian
Brian Quinlan wrote: ...
And the lack of portability is a real problem - it means that every code path must be tested twice AND there might be architectures and problems were tagged integers work but it actually makes things slower.
I think it was not the proposal to use tagging on every platform, but just those where it works. Having this as a compile time feature would make it possible to optimize your Python for your purposes. Things get interesting when it comes to compare refcounting cost vs. pointer decoding cost vs. garbage collection cost. I have no idea what the results will be, and I think it is just *great* to try this out, because the effect could be dramatic: If very many tiny, often-used objects are tagged, then very many refcount actions can be saved, and the overall object count shrinks, Boehm GC might become feasible. If we then can write C extensions without having to think of reference counts, code would be simplified quite much, and I would save 80 percent of debugging time. Maybe we have a set of rules which depend on each other: 1) "Tagging is -1000 because of the headaches" 2) "Everything is object and has refcount" 3) "Many tiny objects must be allocated as blocks" 4) "Boehm GC disables bulk allocation of tiny objects" 1) is just a claim which I would not sign, see the post about systematically renaming fields to find all places with the compiler. 1) leads to 2), 2) leads to 3), and this disables using 4). By solving 1), it might be possible to break the whole chain, and I would even accept a slitghly slower result, if the whole system gets *simpler* by dropping refcounts. So whatever is the outcome, I'm happy that somebody took it on to try this out. ciao - chris -- Christian Tismer :^) mailto:tismer@stackless.com Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
[Christian Tismer]
... Boehm GC might become feasible. If we then can write C extensions without having to think of reference counts, code would be simplified quite much, and I would save 80 percent of debugging time.
There's no escape from gc debugging nightmares at the C level, regardless of gc strategy. With Boehm gc (Bgc) you have to ensure that all live objects are reachable from what Bgc guesses is the current root set, and screwing that up is just as deadly as screwing up refcounts. It's harder to debug, though, because refcounting is very simple to picture, while Bgc is a relatively huge pile of complicated code in its own right. I'm sure Neil Schemenauer still has fond memories of the pleasant days he spent trying to account for Tk crashes last time he tried hooking Bgc to Python <heh>. Bgc has a special entry point to call, to register roots that Bgc can't find on its own -- guess how *you* find out which roots those are? One crash at a time, and in the presence of threads you're damned lucky to get a repeatable test case.
Tim Peters wrote:
[Christian Tismer]
... Boehm GC might become feasible. If we then can write C extensions without having to think of reference counts, code would be simplified quite much, and I would save 80 percent of debugging time.
There's no escape from gc debugging nightmares at the C level, regardless of gc strategy. With Boehm gc (Bgc) you have to ensure that all live objects are reachable from what Bgc guesses is the current root set, and screwing that up is just as deadly as screwing up refcounts. It's harder to debug, though, because refcounting is very simple to picture, while Bgc is a relatively huge pile of complicated code in its own right. I'm sure Neil Schemenauer still has fond memories of the pleasant days he spent trying to account for Tk crashes last time he tried hooking Bgc to Python <heh>. Bgc has a special entry point to call, to register roots that Bgc can't find on its own -- guess how *you* find out which roots those are? One crash at a time, and in the presence of threads you're damned lucky to get a repeatable test case.
Well, I was about to stick with this answer. Then I saw that Prothon is going a different path: They don't use a Boehm gc, but they make every object indirect, over a global object table, and use generational garbage collection with arenas. I guess this would not lead to problems like stack introspection, finding roots etc., because all objects are in the object table. Of course it is another level of indirection everywhere, with the advantage that object bodies become moveable. I think this is a simple enough gc to shorten the nightmares. No idea how much the performance loss is, or if this is amortized by the missing reference counting? At least an opportunity to try would be nice to have. If not now, we will try it with PyPy, anyway -- ciao - chris -- Christian Tismer :^) mailto:tismer@stackless.com Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Christian Tismer
Then I saw that Prothon is going a different path: They don't use a Boehm gc, but they make every object indirect, over a global object table, and use generational garbage collection with arenas. I guess this would not lead to problems like stack introspection, finding roots etc., because all objects are in the object table. Of course it is another level of indirection everywhere, with the advantage that object bodies become moveable.
That would break countless extension modules, though. Today, extensions can count on stable object addresses. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com
David Abrahams wrote:
Christian Tismer
writes: Then I saw that Prothon is going a different path: They don't use a Boehm gc, but they make every object indirect, over a global object table, and use generational garbage collection with arenas. I guess this would not lead to problems like stack introspection, finding roots etc., because all objects are in the object table. Of course it is another level of indirection everywhere, with the advantage that object bodies become moveable.
That would break countless extension modules, though. Today, extensions can count on stable object addresses.
That's ok. I don't think this whole thread is about a real change of Python in the near future. It is about testing alternatives and to study how things change then. Sure, this is all impossible to make compatible. ciao - chris -- Christian Tismer :^) mailto:tismer@stackless.com Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
Christian Tismer
I guess this would not lead to problems like stack introspection, finding roots etc., because all objects are in the object table.
You still need roots to find reachable objects. The object table contains all objects, whether reachable or not. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing wrote:
Christian Tismer
: I guess this would not lead to problems like stack introspection, finding roots etc., because all objects are in the object table.
You still need roots to find reachable objects. The object table contains all objects, whether reachable or not.
I was referring to the problems which Tim pointed out. With an explicit object table, I see much less problems. -- Christian Tismer :^) mailto:tismer@stackless.com Mission Impossible 5oftware : Have a break! Take a ride on Python's Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/ 14109 Berlin : PGP key -> http://wwwkeys.pgp.net/ work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776 PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04 whom do you want to sponsor today? http://www.stackless.com/
I tried this back in 2002. http://mail.python.org/pipermail/python-dev/2002-August/027685.html Excerpt: [...] Subject: [Python-Dev] Performance (non)optimization: 31-bit ints in pointers | | Performance results are mixed. A small program designed to test the | speed of all-integer arithmetic comes out faster by 14% (3.38 vs 2.90 | "user" time on my machine) but pystone comes out 5% slower (14124 vs 13358 | "pystones/second"). | | I don't know if anybody's barked up this tree before, but I think | these results show that it's almost certainly not worth the effort to | incorporate this "performance" hack in Python. I'll keep my tree around | for awhile, in case anybody else wants to see it, but beware that it | still has serious issues even in the core: | >>> 0+0j | Traceback (most recent call last): | File "<stdin>", line 1, in ? | TypeError: unsupported operand types for +: 'int' and 'complex' | >>> (0).__class__ | Segmentation fault Guido replied: | We used *exactly* this approach in ABC. I decided not to go with it | in Python, for two reasons that are essentially what you write up | here: (1) the changes are very pervasive (in ABC, we kept finding | places where we had pointer-manipulating code that had to be fixed to | deal with the small ints), and (2) it wasn't at all clear if it was a | performance win in the end (all the extra tests and special cases | may cost as much as you gain). http://mail.python.org/pipermail/python-dev/2002-August/027686.html Jeff
On Jul 14, 2004, at 12:28 PM, Jeff Epler wrote:
I tried this back in 2002. http://mail.python.org/pipermail/python-dev/2002-August/027685.html
Interesting that I somehow missed that in my search. I had thought surely someone must've tried it already. Ah well. :) But, I'm glad I did it, because I do have somewhat different results. 1) No segfaults or other weird behavior (all tests pass) 2) With tagged integers does actually benchmark faster than without (although, not by all that much). The version of the patch I sent earlier was indeed slightly slower at pystone than standard python. However, I have made a few changes that have since made it faster, instead: 1) made Py_tagged_types a const array. 2) removed inlining of creation of small ints for PyInt_FromLong. Turns out that makes it go slower not faster. Anyhow, current result is a 3% overall speedup of pystone. 35791 pystones/second (standard python 2.3.3), vs. 36900 pystones/second (tagged integers). I also tried running pybench -w 5. Here's the results, compared to standard python 2.3.3. Overall speedup of 7.19%. I must admit I find these results quite puzzling, as I certainly didn't expect it to affect list slicing that much. Also, that Compare* are so much slower seems strange as well. Perhaps there's something that can be done about that. Tests: per run per oper. diff *) ------------------------------------------------------------------------ BuiltinFunctionCalls: 227.75 ms 0.45 us -3.72% BuiltinMethodLookup: 386.30 ms 0.18 us +3.33% CompareFloats: 283.50 ms 0.16 us +57.46% CompareFloatsIntegers: 343.35 ms 0.19 us -1.07% CompareIntegers: 261.90 ms 0.07 us -0.44% CompareInternedStrings: 210.10 ms 0.11 us +19.68% CompareLongs: 244.45 ms 0.14 us +30.65% CompareStrings: 338.05 ms 0.17 us +11.73% CompareUnicode: 265.35 ms 0.18 us +17.70% ConcatStrings: 456.20 ms 0.76 us -3.31% ConcatUnicode: 564.00 ms 0.94 us -2.92% CreateInstances: 300.30 ms 1.79 us -2.52% CreateStringsWithConcat: 178.35 ms 0.22 us +4.54% CreateUnicodeWithConcat: 559.35 ms 0.70 us +3.84% DictCreation: 230.50 ms 0.38 us -12.02% DictWithFloatKeys: 504.20 ms 0.21 us -5.86% DictWithIntegerKeys: 225.10 ms 0.09 us -22.82% DictWithStringKeys: 247.20 ms 0.10 us -7.98% ForLoops: 183.85 ms 4.60 us -18.97% IfThenElse: 211.60 ms 0.08 us -2.78% ListSlicing: 81.70 ms 5.84 us -60.64% NestedForLoops: 141.95 ms 0.09 us -12.38% NormalClassAttribute: 296.05 ms 0.12 us -4.25% NormalInstanceAttribute: 266.15 ms 0.11 us -10.51% PythonFunctionCalls: 336.45 ms 0.51 us -10.05% PythonMethodCalls: 277.80 ms 0.93 us -1.35% Recursion: 233.50 ms 4.67 us -18.99% SecondImport: 191.85 ms 1.92 us +4.44% SecondPackageImport: 205.75 ms 2.06 us +3.03% SecondSubmoduleImport: 255.75 ms 2.56 us +3.67% SimpleComplexArithmetic: 206.70 ms 0.23 us +4.21% SimpleDictManipulation: 172.50 ms 0.14 us -14.22% SimpleFloatArithmetic: 296.15 ms 0.13 us +8.32% SimpleIntFloatArithmetic: 183.85 ms 0.07 us -33.65% SimpleIntegerArithmetic: 182.00 ms 0.07 us -32.62% SimpleListManipulation: 217.50 ms 0.20 us -10.12% SimpleLongArithmetic: 161.70 ms 0.25 us +6.73% SmallLists: 562.85 ms 0.55 us +13.40% SmallTuples: 468.30 ms 0.49 us +15.70% SpecialClassAttribute: 289.00 ms 0.12 us -5.97% SpecialInstanceAttribute: 460.00 ms 0.19 us -1.33% StringMappings: 319.10 ms 0.63 us -0.17% StringPredicates: 297.50 ms 0.27 us -0.77% StringSlicing: 270.00 ms 0.39 us -1.93% TryExcept: 203.20 ms 0.03 us +0.15% TryRaiseExcept: 217.40 ms 3.62 us -28.71% TupleSlicing: 223.55 ms 0.53 us -42.15% UnicodeMappings: 250.10 ms 3.47 us +0.44% UnicodePredicates: 257.60 ms 0.29 us +2.79% UnicodeProperties: 398.55 ms 0.50 us +6.69% UnicodeSlicing: 395.95 ms 0.57 us +1.71% ------------------------------------------------------------------------ Average round time: 15663.00 ms -7.19% For comparison, here's the results from standard python: Tests: per run per oper. overhead ------------------------------------------------------------------------ BuiltinFunctionCalls: 236.55 ms 0.46 us 0.50 ms BuiltinMethodLookup: 373.85 ms 0.18 us 1.00 ms CompareFloats: 180.05 ms 0.10 us 1.00 ms CompareFloatsIntegers: 347.05 ms 0.19 us 1.00 ms CompareIntegers: 263.05 ms 0.07 us 2.00 ms CompareInternedStrings: 175.55 ms 0.09 us 3.50 ms CompareLongs: 187.10 ms 0.10 us 1.00 ms CompareStrings: 302.55 ms 0.15 us 3.50 ms CompareUnicode: 225.45 ms 0.15 us 2.50 ms ConcatStrings: 471.80 ms 0.79 us 1.00 ms ConcatUnicode: 580.95 ms 0.97 us 1.00 ms CreateInstances: 308.05 ms 1.83 us 1.00 ms CreateStringsWithConcat: 170.60 ms 0.21 us 1.00 ms CreateUnicodeWithConcat: 538.65 ms 0.67 us 1.00 ms DictCreation: 262.00 ms 0.44 us 1.00 ms DictWithFloatKeys: 535.60 ms 0.22 us 3.50 ms DictWithIntegerKeys: 291.65 ms 0.12 us 3.50 ms DictWithStringKeys: 268.65 ms 0.11 us 3.00 ms ForLoops: 226.90 ms 5.67 us 0.50 ms IfThenElse: 217.65 ms 0.08 us 2.50 ms ListSlicing: 207.55 ms 14.83 us 0.50 ms NestedForLoops: 162.00 ms 0.11 us 0.00 ms NormalClassAttribute: 309.20 ms 0.13 us 1.50 ms NormalInstanceAttribute: 297.40 ms 0.12 us 1.50 ms PythonFunctionCalls: 374.05 ms 0.57 us 1.00 ms PythonMethodCalls: 281.60 ms 0.94 us 0.50 ms Recursion: 288.25 ms 5.77 us 0.50 ms SecondImport: 183.70 ms 1.84 us 0.50 ms SecondPackageImport: 199.70 ms 2.00 us 0.50 ms SecondSubmoduleImport: 246.70 ms 2.47 us 0.50 ms SimpleComplexArithmetic: 198.35 ms 0.23 us 1.00 ms SimpleDictManipulation: 201.10 ms 0.17 us 0.50 ms SimpleFloatArithmetic: 273.40 ms 0.12 us 2.00 ms SimpleIntFloatArithmetic: 277.10 ms 0.10 us 2.00 ms SimpleIntegerArithmetic: 270.10 ms 0.10 us 2.00 ms SimpleListManipulation: 242.00 ms 0.22 us 1.00 ms SimpleLongArithmetic: 151.50 ms 0.23 us 0.50 ms SmallLists: 496.35 ms 0.49 us 2.50 ms SmallTuples: 404.75 ms 0.42 us 1.50 ms SpecialClassAttribute: 307.35 ms 0.13 us 2.00 ms SpecialInstanceAttribute: 466.20 ms 0.19 us 2.00 ms StringMappings: 319.65 ms 0.63 us 1.50 ms StringPredicates: 299.80 ms 0.27 us 5.00 ms StringSlicing: 275.30 ms 0.39 us 1.50 ms TryExcept: 202.90 ms 0.03 us 3.00 ms TryRaiseExcept: 304.95 ms 5.08 us 1.00 ms TupleSlicing: 386.40 ms 0.92 us 0.50 ms UnicodeMappings: 249.00 ms 3.46 us 1.00 ms UnicodePredicates: 250.60 ms 0.28 us 6.50 ms UnicodeProperties: 373.55 ms 0.47 us 6.50 ms UnicodeSlicing: 389.30 ms 0.56 us 2.00 ms ------------------------------------------------------------------------ Average round time: 16876.00 ms James
James Y Knight
On the test that probably puts my change in the most positive light possible: "x = 0; while x < 50000000: x = x + 1", it achieves about a 50% increase in speed. More normal integer-heavy things seem to be at most 20% faster.
And you have to ask yourself -- how integer-heavy does typical Python code get? Most Python code I write deals with higher-level things most of the time -- strings, lists, dicts, class instances. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg Ewing
James Y Knight
: On the test that probably puts my change in the most positive light possible: "x = 0; while x < 50000000: x = x + 1", it achieves about a 50% increase in speed. More normal integer-heavy things seem to be at most 20% faster.
And you have to ask yourself -- how integer-heavy does typical Python code get?
Most Python code I write deals with higher-level things most of the time -- strings, lists, dicts, class instances.
It's kinda hard to work seriously with lists and strings without integers :-) This was, at the very least, a worthwhile experiment... Cheers, mwh -- Python enjoys making tradeoffs that drive *someone* crazy <wink>. -- Tim Peters, comp.lang.python
Michael Hudson
It's kinda hard to work seriously with lists and strings without integers :-)
That may be true in C, but I think it's far less true in languages like Python that have high-level iteration facilities and string and list manipulation functions. One still uses *some* integers, but my feeling is they're a much smaller proportion of the mix.
This was, at the very least, a worthwhile experiment...
Certainly, but it needs to me measured on typical code as well, not just code artificially designed to exercise integer operations, before judging whether it's worth the effort. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
On Wed, Jul 14, 2004 at 02:41:19AM -0400, James Y Knight wrote:
- Thus, no code can directly access object fields ->ob_refcnt, or ->ob_type. Introduced Py_GETTYPE/Py_GETREF macros and search&replaced code in python to use them. These macros check the tag bits, and do the right thing if it is tagged data.
If this looks like a promising direction, Py_GETTYPE and Py_GETREF macros could be added today, in 2.5 or 2.6 ob_refcnt and ob_type would be removed/renamed, and 2.6 or 2.7 would see the first actual use of tagged types. Sure, you couldn't compile a 2.4 extension on 2.6, but will it matter? By the way, I don't know what optimizers actually do, but given PyObject *spam(PyObject *o) { Py_GETTYPE(o); junk; Py_GETREF(o); junk; Py_INCREF(o); junk; Py_DECREF(o); junk; } the compiler is free to notice that "o" can't change (its address is never taken), so the "is it a tagged object" test can be done once, not 4 times. I assume that tagged integers could be turned off at compile time, for the platforms where they don't work for one reason or another. I'm showing my ignorance here, but what other types are traditionally tagged, and do any of those translate well to Python? I suppose you could tag some singleton objects, like None, True, False, and (), but those are already shared. Tagging length-1 strings and unicode strings seems like it might have its advantages, though the length-0 and length-1 strings, length-0 unicode strings, and first 256 length-1 unicode strings are already shared too. I can't see much use for a 21-bit-mantissa floating-point type, and I can't see how you could use tagging for any other types in Python. Jeff
On Jul 14, 2004, at 10:32 PM, Jeff Epler wrote:
By the way, I don't know what optimizers actually do, but given PyObject *spam(PyObject *o) { Py_GETTYPE(o); junk; Py_GETREF(o); junk; Py_INCREF(o); junk; Py_DECREF(o); junk; } the compiler is free to notice that "o" can't change (its address is never taken), so the "is it a tagged object" test can be done once, not 4 times.
I was unable to convince GCC 3.3.3 to be this smart. I don't think it even optimized "Py_GETTYPE(o) == X || Py_GETTYPE(o) == Y" to remove the redundant Py_GETTYPE. Perhaps someone versed in compiler-trickery could explain why it refuses to do common-subexpression-elimination there, or else trick it into doing so.
I'm showing my ignorance here, but what other types are traditionally tagged, and do any of those translate well to Python?
I suppose you could tag some singleton objects, like None, True, False, and (), but those are already shared.
Tagging length-1 strings and unicode strings seems like it might have its advantages, though the length-0 and length-1 strings, length-0 unicode strings, and first 256 length-1 unicode strings are already shared too.
I can't see much use for a 21-bit-mantissa floating-point type, and I can't see how you could use tagging for any other types in Python.
In general, there's not much point in using tagged objects for singleton types like booleans. It's a waste of a tag value. However see my other post for counter-results in the case of Python. You've basically got the standard tagged objects covered there: integers, and characters. E.g. see http://www.xemacs.org/Documentation/21.5/html/internals_8.html I had an idea that it might be possible to have a single-element tuple be a tagged object as well, but that may well be not worth the overhead. On a 64-bit computers it could be possible to steal two bits from the exponent of (small) floats to do inline 62-bit floats. I wouldn't want to steal bits from the mantissa, you'd end up with different (non IEEE standard) results. James
One more random Tagged Types thought. If we were introducing the test & branch for tagged types in incref/decref anyway (unlikely, I know) wouldn't this let us add another oft-suggested optimization at the same price: for certain immortal objects (None, (), '', etc) there's no need to actually change the refcount. Make these objects Tagged, and those memory accesses go away in favor of a single register test that already exists to provide tagged integers. Jeff
participants (11)
-
Brian Quinlan
-
Christian Tismer
-
David Abrahams
-
Greg Ewing
-
Guido van Rossum
-
James Y Knight
-
Jeff Epler
-
M.-A. Lemburg
-
Michael Hudson
-
Phillip J. Eby
-
Tim Peters