I've experimented with about a dozen ways to improve dictionary performance and found one that benefits some programs by up to 5% without hurting the performance of other programs by more than a single percentage point. It entails a one line change to dictobject.c resulting in a new schedule of dictionary sizes for a given number of entries: Number of Current size Proposed size Filled Entries of dictionary of dictionary -------------- ------------- ------------- [-- 0 to 5 --] 8 8 [-- 6 to 10 --] 16 32 [-- 11 to 21 --] 32 32 [-- 22 to 42 --] 64 128 [-- 43 to 85 --] 128 128 [-- 86 to 170 --] 256 512 [-- 171 to 341 --] 512 512 The idea is to lower the average sparseness of dictionaries (by 0% to 50% of their current sparsenes). This results in fewer collisions, faster collision resolution, fewer memory accesses, and better cache performance. A small side-benefit is halving the number of resize operations as the dictionary grows. The above table of dictionary sizes shows that odd numbered steps have the same size as the current approach while even numbered steps are twice as large. As a result, small dicts keep their current size and the amortized size of large dicts remains the same. Along the way, some dicts will be a little larger and will benefit from the increased sparseness. I would like to know what you guys think about the idea and would appreciate your verifying the performance on your various processors and operating systems. Raymond Hettinger P.S. The one line patch is: *** dictobject.c 17 Mar 2003 19:46:09 -0000 2.143 --- dictobject.c 25 Apr 2003 22:33:24 -0000 *************** *** 532,538 **** * deleted). */ if (mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2) { ! if (dictresize(mp, mp->ma_used*2) != 0) return -1; } return 0; --- 532,538 ---- * deleted). */ if (mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2) { ! if (dictresize(mp, mp->ma_used*4) != 0) return -1; }
I've experimented with about a dozen ways to improve dictionary performance and found one that benefits some programs by up to 5% without hurting the performance of other programs by more than a single percentage point.
It entails a one line change to dictobject.c resulting in a new schedule of dictionary sizes for a given number of entries:
Number of Current size Proposed size Filled Entries of dictionary of dictionary -------------- ------------- ------------- [-- 0 to 5 --] 8 8 [-- 6 to 10 --] 16 32 [-- 11 to 21 --] 32 32 [-- 22 to 42 --] 64 128 [-- 43 to 85 --] 128 128 [-- 86 to 170 --] 256 512 [-- 171 to 341 --] 512 512
I suppose there's an "and so on" here, right? I wonder if for *really* large dicts the space sacrifice isn't worth the time saved?
The idea is to lower the average sparseness of dictionaries (by 0% to 50% of their current sparsenes). This results in fewer collisions, faster collision resolution, fewer memory accesses, and better cache performance. A small side-benefit is halving the number of resize operations as the dictionary grows.
I think you mean "raise the average sparseness" don't you? (The more sparse something is, the more gaps it has.) I tried the patch with my new favorite benchmark, startup time for Zope (which surely populates a lot of dicts :-). It did give about 0.13 seconds speedup on a total around 3.5 seconds, or almost 4% speedup. --Guido van Rossum (home page: http://www.python.org/~guido/)
I've experimented with about a dozen ways to improve dictionary performance and found one that benefits some programs by up to 5% without hurting the performance of other programs by more than a single percentage point.
You wouldn't have some created some handy tables of 'typical' dictionary usage, would you? They would be interesting in general, but very nice for the PEPs doing dict optimizations for symbol tables in particular. -jack
[jack, master of the mac]
You wouldn't have some created some handy tables of 'typical' dictionary usage, would you? They would be interesting in general, but very nice for the PEPs doing dict optimizations for symbol tables in particular.
That path proved fruitless. I studied the usage patterns in about a dozen of my apps and found that there is no such thing as typical. Instead there are many categories of dictionary usage. * attribute/method look-up in many small dictionaries * uniquification apps with many redundant stores and few lookups * membership testing with few stores and many lookups into small or large dicts. * database style lookups following Zipf's law for key access in large dicts. * graph explorers that access a few keys frequently and then move onto another set of related nodes. * global/builtin variable access following a failed search of locals. Almost every dictionary tune-up that helped one app would end-up hurting another. The only way to test the effectiveness of a change was to time a whole suite of applications. The standard benchmarks were useless in this regard. Worse still, contrived test programs would not predict the results for real apps. There were several reasons for this: * there is a special case for handling dicts that only have string keys * real apps exhibit keys access patterns that pull the most frequently accessed entries into the cache. this thwarted attempts to improve cache performance at the expense of more collisions. * any small, fixed set of test keys may have atypical collision anomalies, non-representative access frequencies, or not be characteristic of other dicts with a slightly different number of keys. * some sets of keys have non-random hash patterns but if you rely on this, it victimizes other sets of keys. * the results are platform dependent (ratio of processor speed to memory speed; size of cache; size of cache a line; cache associativity; write-back vs. write-through; etc). I had done some experiments that focused on symbol tables and had some luck with sequential searches into a self-organizing list. Using a list eliminated the holes and allowed more of the entries to fit in a single cache line. No placeholders were needed for deleted entries and that saves a test in the search loop. The self-organizing property kept the most frequently accessed entries at the head of the list. Using a sequential search had slightly less overhead than the hash table search pattern. Except for the builtin dictionary, most of the symbol tables in my apps have only a handful of entries. if-only-i-had-had-a-single-valid-dict-performance-predictor-ly yours, Raymond Hettinger
[jack, master of the mac]
You wouldn't have some created some handy tables of 'typical' dictionary usage, would you? They would be interesting in general, but very nice for the PEPs doing dict optimizations for symbol tables in particular.
That path proved fruitless. I studied the usage patterns in about a dozen of my apps and found that there is no such thing as typical. Instead there are many categories of dictionary usage. * attribute/method look-up in many small dictionaries * uniquification apps with many redundant stores and few lookups * membership testing with few stores and many lookups into small or large dicts. * database style lookups following Zipf's law for key access in large dicts. * graph explorers that access a few keys frequently and then move onto another set of related nodes. * global/builtin variable access following a failed search of locals. Almost every dictionary tune-up that helped one app would end-up hurting another. The only way to test the effectiveness of a change was to time a whole suite of applications. The standard benchmarks were useless in this regard. Worse still, contrived test programs would not predict the results for real apps. There were several reasons for this: * there is a special case for handling dicts that only have string keys * real apps exhibit keys access patterns that pull the most frequently accessed entries into the cache. this thwarted attempts to improve cache performance at the expense of more collisions. * any small, fixed set of test keys may have atypical collision anomalies, non-representative access frequencies, or not be characteristic of other dicts with a slightly different number of keys. * some sets of keys have non-random hash patterns but if you rely on this, it victimizes other sets of keys. * the results are platform dependent (ratio of processor speed to memory speed; size of cache; size of cache a line; cache associativity; write-back vs. write-through; etc). I had done some experiments that focused on symbol tables and had some luck with sequential searches into a self-organizing list. Using a list eliminated the holes and allowed more of the entries to fit in a single cache line. No placeholders were needed for deleted entries and that saves a test in the search loop. The self-organizing property kept the most frequently accessed entries at the head of the list. Using a sequential search had slightly less overhead than the hash table search pattern. Except for the builtin dictionary, most of the symbol tables in my apps have only a handful of entries. if-only-i-had-had-a-single-valid-dict-performance-predictor-ly yours, Raymond Hettinger
On Mon, Apr 28, 2003 at 07:50:14PM -0400, Raymond Hettinger wrote:
[jackdiederich]
You wouldn't have some created some handy tables of 'typical' dictionary usage, would you? They would be interesting in general, but very nice for the PEPs doing dict optimizations for symbol tables in particular.
That path proved fruitless. I studied the usage patterns in about a dozen of my apps and found that there is no such thing as typical. Instead there are many categories of dictionary usage. [symbol table amongst them]
A good proj would be breaking out the particular cases of dictionary usage and using the right dict for the right job. Module symbol tables are dicts that have a different 'typical' usage than dicts in general. They are likely even regular enough in usage to actually _have_ a typical usage (no finger quotes). I've looked at aliasing dicts used in symbol (builtin, module, local) tables so they could be specialized from generic dicts in the source and I get lost in the nuances (esp frame stuff). If someone who did know the code well enough would make the effort it would allow those of us who are familiar but not intimate with the source to take a shot at optimizing a particular use case (symbol table dicts). Alas, people who are that familiar aren't likely to do it, they have more important things to do. -jackdied ps, I've always wanted to try ternary trees as symbol tables. They have worse than O(1) lookup, but in real life are probably OK for symbol tables. They nest beautifully and do cascading caching decently.
Guido van Rossum wrote:
I've experimented with about a dozen ways to improve dictionary performance and found one that benefits some programs by up to 5% without hurting the performance of other programs by more than a single percentage point.
It entails a one line change to dictobject.c resulting in a new schedule of dictionary sizes for a given number of entries:
Perhaps you could share that change ? Or is it on SF somewhere ?
Number of Current size Proposed size Filled Entries of dictionary of dictionary -------------- ------------- ------------- [-- 0 to 5 --] 8 8 [-- 6 to 10 --] 16 32 [-- 11 to 21 --] 32 32 [-- 22 to 42 --] 64 128 [-- 43 to 85 --] 128 128 [-- 86 to 170 --] 256 512 [-- 171 to 341 --] 512 512
I suppose there's an "and so on" here, right? I wonder if for *really* large dicts the space sacrifice isn't worth the time saved?
Once upon a time, when I was playing with inlining dictionary tables (now part of the dictionary implementation thanks to Tim), I found that optimizing dictionaries to have a table size 8 gave the best results. Most dictionaries in a Python application have very few entries (and most of them were instance dictionaries at the time -- not sure whether that's changed). Another result of my experiments was that reducing the number of resizes made a big difference. To get some more useful numbers, I suggest to instrument Python to display the table size of dictionaries and the number of resizes necessary to make them that big. You should also keep a good eye on the overall process size. I believe that the reason for the speedups you see is that cache sizes and processor optimizations have changes since the times the current resizing implementation was chosen, so maybe we ought to rethink the parameters: * minimum table size * first three resize steps I don't think that large dictionaries should become more sparse -- that's just a waste of memory.
The idea is to lower the average sparseness of dictionaries (by 0% to 50% of their current sparsenes). This results in fewer collisions, faster collision resolution, fewer memory accesses, and better cache performance. A small side-benefit is halving the number of resize operations as the dictionary grows.
I think you mean "raise the average sparseness" don't you? (The more sparse something is, the more gaps it has.)
I tried the patch with my new favorite benchmark, startup time for Zope (which surely populates a lot of dicts :-). It did give about 0.13 seconds speedup on a total around 3.5 seconds, or almost 4% speedup.
-- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Apr 29 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
EuroPython 2003, Charleroi, Belgium: 56 days left
[Raymond]
I've experimented with about a dozen ways to improve dictionary performance and found one that benefits some programs by up to 5% without hurting the performance of other programs by more than a single percentage point.
It entails a one line change to dictobject.c resulting in a new schedule of dictionary sizes for a given number of entries:
[Mark Lemburg]
Perhaps you could share that change ? Or is it on SF somewhere ?
It was in the original post. But SF is better, so I just loaded it to the patch manager: www.python.org/sf/729395 [GvR]
I suppose there's an "and so on" here, right? I wonder if for *really* large dicts the space sacrifice isn't worth the time saved?
Due to the concerns raised about massive dictionaries, I revised the patch to switch back to the old growth schedule for sizes above 100,000 entries (approx 1.2 Mb). [Mark Lemburg]
I believe that the reason for the speedups you see is that cache sizes and processor optimizations have changes since the times the current resizing implementation was chosen, so maybe we ought to rethink the parameters:
* minimum table size * first three resize steps
I've done dozens of experiements with changing these parameters and changing the resize ratio (from 2/3 to 4/5, 3/5, 1/2, 3/7, and 4/7) but found that what helped some applications would hurt others. The current tuning remains fairly effective. Changing the resize step from *2 to *4 was the only alteration that yielded across the board improvements. Raymond Hettinger
Raymond Hettinger wrote:
I believe that the reason for the speedups you see is that cache sizes and processor optimizations have changes since the times the current resizing implementation was chosen, so maybe we ought to rethink the parameters:
* minimum table size * first three resize steps
I've done dozens of experiements with changing these parameters and changing the resize ratio (from 2/3 to 4/5, 3/5, 1/2, 3/7, and 4/7) but found that what helped some applications would hurt others. The current tuning remains fairly effective. Changing the resize step from *2 to *4 was the only alteration that yielded across the board improvements.
Ok, but I still fear that using *4 will cause too much memory bloat for dicts which have more than 10-30 entries. If you instrument Python you'll find that for typical applications, most dictionaries will have only few entries. Tuning the implementation to those findings is what you really want to do :-) If you take e.g. Zope, what difference in memory consumption does your patch make ? -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Apr 29 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
EuroPython 2003, Charleroi, Belgium: 56 days left
[M.-A. Lemburg]
... Once upon a time, when I was playing with inlining dictionary tables (now part of the dictionary implementation thanks to Tim),
Thank you!
... I don't think that large dictionaries should become more sparse -- that's just a waste of memory.
Collision resolution is very fast if the dict slots happen to live in cache. When they're out of cache, the apparent speed of the C code is irrelevant, the time is virtually all consumed by the HW (or even OS) resolving the cache misses, and every collision probe is very likely to be a cache miss then (the probe sequence-- by design --jumps all over the slots in (pseudo-)random order). So when Raymond explained that increasing sparseness helped *most* for large dicts, it made great sense to me. We can likely resolve dozens of collisions in a small dict in the time it takes for one extra probe in a large dict. Jeremy had a possibly happy idea wrt this: make the collision probe sequence start in a slot adjacent to the colliding slot. That's likely to get sucked into cache "for free", tagging along with the slot that collided. If that's effective, it could buy much of the "large dict" speed gains Raymond saw without increasing the dict size. If someone wants to experiment with that in lookdict_string(), stick a new ++i; before the for loop, and move the existing i = (i << 2) + i + perturb + 1; to the bottom of that loop. Likewise for lookdict().
Jeremy had a possibly happy idea wrt this: make the collision probe sequence start in a slot adjacent to the colliding slot. That's likely to get sucked into cache "for free", tagging along with the slot that collided. If that's effective, it could buy much of the "large dict" speed gains Raymond saw without increasing the dict size.
I worked on similar approaches last month and found them wanting. The concept was that a 64byte cache line held 5.3 dict entries and that probing those was much less expensive than making a random probe into memory outside of the cache. The first thing I learned was that the random probes were necessary to reduce collisions. Checking the adjacent space is like a single step of linear chaining, it increases the number of collisions. That would be fine if the cost were offset by decreased memory access time; however, for small dicts, the whole dict is already in cache and having more collisions degrades performance with no compensating gain. The next bright idea was to have a separate lookup function for small dicts and for larger dictionaries. I set the large dict lookup to search adjacent entries. The good news is that an artificial test of big dicts showed a substantial improvement (around 25%). The bad news is that real programs were worse-off than before. A day of investigation showed the cause. The artificial test accessed keys randomly and showed the anticipated benefit. However, real programs access some keys more frequently than others (I believe Zipf's law applies.) Those keys *and* their collision chains are likely already in the cache. So, big dicts had the same limitation as small dicts: You always lose when you accept more collisions in return for exploiting cache locality. The conclusion was clear, the best way to gain performance was to have fewer collisions in the first place. Hence, I resumed experiments on sparsification.
If someone wants to experiment with that in lookdict_string(), stick a new
++i;
before the for loop, and move the existing
i = (i << 2) + i + perturb + 1;
to the bottom of that loop. Likewise for lookdict().
PyStone gains 1%. PyBench loses a 1%. timecell gains 2% (spreadsheet benchmark) timemat loses 2% (pure python matrix package benchmark) timepuzzle loses 1% (class based graph traverser) Raymond Hettinger P.S. There is one other way to improve cache behavior but it involves touching code throughout dictobject.c. Move the entry values into a separate array from the key/hash pairs. That way, you get 8 entries per cache line. P.P.S. One other idea is to use a different search pattern for small dictionaries. Store entries in a self-organizing list with no holes. Dummy fields aren't needed which saves a test in the linear search loop. When an entry is found, move it one closer to the head of the list so that the most common entries get found instantly. Since there are no holes, all eight cells can be used instead of the current maximum of five. Like the current arrangement, the whole small dict fits into just two cache lines. ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################
[Raymond Hettinger]
... I worked on similar approaches last month and found them wanting. The concept was that a 64byte cache line held 5.3 dict entries and that probing those was much less expensive than making a random probe into memory outside of the cache.
The first thing I learned was that the random probes were necessary to reduce collisions. Checking the adjacent space is like a single step of linear chaining, it increases the number of collisions.
Yes, I believe that any regularity will.
That would be fine if the cost were offset by decreased memory access time; however, for small dicts, the whole dict is already in cache and having more collisions degrades performance with no compensating gain.
The next bright idea was to have a separate lookup function for small dicts and for larger dictionaries. I set the large dict lookup to search adjacent entries. The good news is that an artificial test of big dicts showed a substantial improvement (around 25%). The bad news is that real programs were worse-off than before.
You should qualify that to "some real programs", or perhaps "all real programs I've tried". On the other side, I have real programs that access large dicts in random order, so if you tossed those into your mix, a 25% gain on those would more than wipe out the 1-2% losses you saw elsewhere.
A day of investigation showed the cause. The artificial test accessed keys randomly and showed the anticipated benefit. However, real programs access some keys more frequently than others (I believe Zipf's law applies.)
Some real programs do, and, for all I know, most real programs. It's not the case that all real programs do. The counterexamples that sprang instantly to my mind are those using dicts to accumulate stats for testing random number generators. Those have predictable access patterns only when the RNG they're testing sucks <wink>.
Those keys *and* their collision chains are likely already in the cache. So, big dicts had the same limitation as small dicts: You always lose when you accept more collisions in return for exploiting cache locality.
Apart from that "always" ain't always so, I really like that as a summary!
The conclusion was clear, the best way to gain performance was to have fewer collisions in the first place. Hence, I resumed experiments on sparsification.
How many collisions are you seeing? For int-keyed dicts, all experiments I ran said Python's dicts collided less than a perfectly random hash table would collide (the reason for that is explained in dictobject.c, and that int-keyed dicts tend to use a contiguous range of ints as keys). For string-keyed dicts, extensive experiments said collision behavior was indistinguishable from a perfectly random hash table. I never cared enough about other kinds of keys to time 'em, at least not since systematic flaws were fixed in the tuple and float hash functions (e.g., the tuple hash function used to xor the tuple's elements' hash codes, so that all permututions of a given tuple had the same hash code; that's necessary for unordered sets, but tuples are ordered).
If someone wants to experiment with that in lookdict_string(), stick a new
++i;
before the for loop, and move the existing
i = (i << 2) + i + perturb + 1;
to the bottom of that loop. Likewise for lookdict().
PyStone gains 1%. PyBench loses a 1%. timecell gains 2% (spreadsheet benchmark) timemat loses 2% (pure python matrix package benchmark) timepuzzle loses 1% (class based graph traverser)
You'll forgive me if I'm skeptical: they're such small differences that, if I saw them, I'd consider them to be a wash -- in the noise. What kind of platform are you running on that has timing behavior repeatable enough to believe 1-2% blips?
P.S. There is one other way to improve cache behavior but it involves touching code throughout dictobject.c.
Heh -- that wouldn't even be considered a minor nuisance to the truly obsessed <wink>.
Move the entry values into a separate array from the key/hash pairs. That way, you get 8 entries per cache line.
What use case would that help? You said before that small dicts are all in cache anyway, so it wouldn't help those. The jumps in large dicts are so extreme that it doesn't seem to matter if the cache line size du jour holds 1 slot or 100. To the contrary, at the end of the large dict lookup, it sounds like it would incur an additional cache miss to fetch the value after the key was found (since that value would no longer ever ride along with the key and hashcode). I can think of a different reason for considering this: sets have no use for the value slot, and wouldn't need to allocate space for 'em.
P.P.S. One other idea is to use a different search pattern for small dictionaries. Store entries in a self-organizing list with no holes. Dummy fields aren't needed which saves a test in the linear search loop. When an entry is found, move it one closer to the head of the list so that the most common entries get found instantly.
I don't see how more than just one can be found instantly; if "instantly" here means "in no more than a few tries", that's usually true of dicts too -- but is still an odd meaning for "instantly" <wink>.
Since there are no holes, all eight cells can be used instead of the current maximum of five. Like the current arrangement, the whole small dict fits into just two cache lines.
Neil Schemenauer suggested that a few years ago, but I don't recall it going anywhere. I don't know how to predict for which apps it would be faster. If people are keen to optimize very small search tables, think about schemes that avoid the expense of hashing too.
On Wed, Apr 30, 2003 at 12:32:23AM -0400, Tim Peters wrote:
If someone wants to experiment with that in lookdict_string(), stick a new
++i;
before the for loop, and move the existing
i = (i << 2) + i + perturb + 1;
to the bottom of that loop. Likewise for lookdict().
You might also investigate making PyDictEntry a power-of-two bytes big (it's currently 12 bytes) so that they align nicely in the cache, and then use i ^= 1; instead of ++i; so that the second key checked is always in the same (32-byte or bigger) cache line. Of course, increasing the size of PyDictEntry would also increase the size of all dicts by 33%, so the speed payoff would have to be big. It's also not obvious that ma_smalltable will be 32-byte aligned (since no special effort was made, it's unlikely to be). If it's not, then this optimization would still not pay (compared to i++) for <= MINSIZE dictionaries. (which are the important case?) A little program indicates that the table has an 8-byte or better alignment, the xor approach gives same-cache-line results more frequently than the increment approach even with a 12-byte PyDictEntry. This doesn't quite make sense to me. It also indicates that if the alignment is not 32 bytes but the dict is 16 bytes that xor is a loss, which does make sense. The results (for a 32-byte cache line): algorithm sizeof() alignment % in same cache line i^=1 12 4 62.5 i^=1 12 8 75.0 i^=1 12 16 75.0 i^=1 12 32 75.0 i^=1 16 4 50.0 i^=1 16 8 50.0 i^=1 16 16 50.0 i^=1 16 32 100.0 ++i 12 4 62.5 ++i 12 8 62.5 ++i 12 16 62.5 ++i 12 32 62.5 ++i 16 4 50.0 ++i 16 8 50.0 ++i 16 16 50.0 ++i 16 32 50.0 so using i^=1 and adding 4 bytes to each dict (if necessary) to get 8-alignment of ma_smalltable would give a 12.5% increase in the hit rate of the second probe compared to i++. Ouch. When I take into account that each probe accesses me_key (not just me_hash) the results change: i^=1 16 4 37.5 ++i 16 4 37.5 i^=1 12 16 50.0 i^=1 12 32 50.0 i^=1 12 4 50.0 i^=1 12 8 50.0 i^=1 16 16 50.0 i^=1 16 8 50.0 ++i 12 16 50.0 ++i 12 32 50.0 ++i 12 4 50.0 ++i 12 8 50.0 ++i 16 16 50.0 ++i 16 32 50.0 ++i 16 8 50.0 i^=1 16 32 100.0 You don't beat i++ unless you go to size 16 with alignment 32. Looking at the # of cache lines accessed on average, the numbers are unsurprising. For the 37.5% items, 1.625 cache lines are accessed for the two probes, 1.5 for the 50% items, and 1.0 for the 100% items. Looking at the number of cache lines accessed for a single probe, 8-or-better alignment gives 1.0 cache lines accessed for 16-byte structures, and 1.125 for all other cases (4-byte alignment or 12-byte structure) If the "more than 3 probes" case bears optimizing (and I doubt it does), the for(perturb) loop could be unrolled once, with even iterations using ++i or i^=1 and odd iterations using i = (i << 2) + i + perturb + 1; so that the same-cache-line property is used as often as possible. Of course, the code duplication of the rest of the loop body will increase i-cache pressure a bit. And I'm surprised if you read this far. Summary: i^=1 is not likely to win comapred to ++i, unless we increase dict size 33%. Jeff
And I'm surprised if you read this far. Summary: i^=1 is not likely to win comapred to ++i, unless we increase dict size 33%.
Right! I had tried i^=1 and it had near zero or slightly negative effects on performance. It resulted in more collisions, though the collisions were resolved relatively cheaply. I had also experimented with changing alignment, but nothing helped. Everything is already word aligned and that takes care of the HW issues. The only benefit to the alignment is that i^=1 guarantees a cache hit. Without alignment, the odds are 4 out of 5.3 will have a hit (since there a 5.3 entries to a line). Increasing the dict size 33% with unused space doesn't help sparseness and negatively impacts the chance cache hits you already have with smaller dictionaries. heyhey-mymy-there's-more-to-the-picture-than-meets-the-eye, Raymond Hettinger
FYI, for years the dict code had some #ifdef'ed preprocessor gimmick to force cache alignment. I ripped that out a while back because nobody ever reported an improvement when using it.
[Timbot]
FYI, for years the dict code had some #ifdef'ed preprocessor gimmick to force cache alignment. I ripped that out a while back because nobody ever reported an improvement when using it.
Gee, you mean we're not the first ones to have ever thought up dictionary optimizations that didn't pan out? I've tried square wheels, pentagonal wheels, and gotten even better results with octagonal wheels. Each further subdivision seems to have less-and-less payoff so I'm confident that octagonal is close to optimum ;-) I'm going to write-up an informational PEP to summarize the results of research to-date. After the first draft, I'm sure the other experimenters will each have lessons to share. In addition, I'll attach a benchmarking suite and dictionary simulator (fully instrumented). That way, future generations can reproduce the results and pickup where we left-off. I've decided that this new process should have a name, something pithy, yet magical sounding, so it shall be dubbed SCIENCE. Raymond Hettinger
[Raymond Hettinger]
... I'm going to write-up an informational PEP to summarize the results of research to-date.
I'd suggest instead a text file checked into the Object directory, akin to the existing listsort.txt -- it's only of interest to the small fraction of hardcore developers with an optimizing bent.
After the first draft, I'm sure the other experimenters will each have lessons to share. In addition, I'll attach a benchmarking suite and dictionary simulator (fully instrumented). That way, future generations can reproduce the results and pickup where we left-off.
They probably won't, though. The kind of people attracted to this kind of micro-level fiddling revel in recreating this kind of stuff themselves. For example, you didn't look hard enough to find the sequence of dict simulators Christian posted last time he got obsessed with this <wink>. On the chance that they might, a plain text file-- or a Wiki page! --is easier to update than a PEP over time. The benchmarking suite should also be checked in, and should be very welcome. Perhaps it's time for a "benchmark" subdirectory under Lib/test? It doesn't make much sense even now that pystone and sortperf live directly in the test directory.
I've decided that this new process should have a name, something pithy, yet magical sounding, so it shall be dubbed SCIENCE.
LOL! But I'm afraid it's not real science unless you first write grant proposals, and pay a Big Name to agree to be named as Principal Investigator. I'll write Uncle Don a letter on your behalf <wink>.
[Tim Peters]
The benchmarking suite should also be checked in, and should be very welcome. Perhaps it's time for a "benchmark" subdirectory under Lib/test? It doesn't make much sense even now that pystone and sortperf live directly in the test directory.
Works for me. Can we perhaps decide whether we want to do this in the near future? I am going to be writing up module docs for the test package and if we are going to end up moving them I would like to be get this written into the docs the first time through. -Brett
[Tim]
The benchmarking suite should also be checked in, and should be very welcome. Perhaps it's time for a "benchmark" subdirectory under Lib/test? It doesn't make much sense even now that pystone and sortperf live directly in the test directory.
Works for me. Can we perhaps decide whether we want to do this in the near future? I am going to be writing up module docs for the test package and if we are going to end up moving them I would like to be get this written into the docs the first time through.
-Brett
Should the benchmarks directory be part of the distribution, or should it be in the nondist part of the CVS tree? --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum <guido@python.org> writes:
[Tim]
The benchmarking suite should also be checked in, and should be very welcome. Perhaps it's time for a "benchmark" subdirectory under Lib/test? It doesn't make much sense even now that pystone and sortperf live directly in the test directory.
Works for me. Can we perhaps decide whether we want to do this in the near future? I am going to be writing up module docs for the test package and if we are going to end up moving them I would like to be get this written into the docs the first time through.
-Brett
Should the benchmarks directory be part of the distribution, or should it be in the nondist part of the CVS tree?
I can't think why you'd want it in nondist, unless they depend on huge input files or something. Cheers, M. -- Those who have deviant punctuation desires should take care of their own perverted needs. -- Erik Naggum, comp.lang.lisp
Should the benchmarks directory be part of the distribution, or should it be in the nondist part of the CVS tree?
I can't think why you'd want it in nondist, unless they depend on huge input files or something.
OK. --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, May 01, 2003, Guido van Rossum wrote:
Should the benchmarks directory be part of the distribution, or should it be in the nondist part of the CVS tree?
Given the constant number of arguments in c.l.py about speed, I'd keep it in the distribution unless/until it gets large. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "In many ways, it's a dull language, borrowing solid old concepts from many other languages & styles: boring syntax, unsurprising semantics, few automatic coercions, etc etc. But that's one of the things I like about it." --Tim Peters on Python, 16 Sep 93
participants (12)
-
Aahz
-
Brett Cannon
-
Guido van Rossum
-
Jack Diederich
-
Jeff Epler
-
M.-A. Lemburg
-
Michael Hudson
-
Raymond Hettinger
-
Raymond Hettinger
-
Terry Reedy
-
Tim Peters
-
Tim Peters