Mailman 3 RE: [Python-Dev] Dictionary tuning - Python-Dev

Raymond Hettinger

April 2003

5:20 p.m.

New subject: Dictionary tuning

What is the effect on peak memory usage over "average" programs?

Since the amortized growth is the same, the effect is Nil on average. Special cases can be contrived with a specific number of records where the memory use is doubled, but in general it is nearly unchanged for average programs.

...

This might be a worthwhile speedup on small dicts (up to a TBD number of entries) but not worthwhile for large dicts.

Actually, it helps large dictionaries even more that small dictionaries. Collisions in large dicts are resolved through other memory probes which are almost certain not to be in the current cache line.

...

However, to add this capability in would of course add more code to a very common code path (additional test for current size to decide what factor to increase by).

Your intuition is exactly correct. All experiments to special case various sizes results in decreased performance because it added a tiny amount to some of the most heavily exercised code in python. Further, it results in an unpredicable branch which is also not a good thing. [GvR]

...

I tried the patch with my new favorite benchmark, startup time for Zope (which surely populates a lot of dicts :-). It did give about 0.13 seconds speedup on a total around 3.5 seconds, or almost 4% speedup.

[Tim]

...

Nice (in relative, not absolute terms). Do we have any numbers on memory usage during and after that period?

I found out that timing dict performance was hard. Capturing memory usage was harder. Checking entry space,space plus unused space, calls to PyMalloc, and calls to the os malloc, only the last is important, but it depends on all kinds of things that are not easily controlled. Tim Delaney _______________________________________________

Reply

Sign in to reply online Use email software

Tim Peters

6:49 p.m.

New subject: Dictionary tuning

[Tim Delaney]

...

...
What is the effect on peak memory usage over "average" programs?

[Raymond Hettinger]

...

Since the amortized growth is the same, the effect is Nil on average. Special cases can be contrived with a specific number of records where the memory use is doubled, but in general it is nearly unchanged for average programs.

That doesn't make sense. Dicts can be larger after the patch, but never smaller, so there's nothing opposing the "can be larger" part: on average, allocated address space must be strictly larger than before. Whether that *matters* on average to the average user is something we can answer rigorously just as soon as we find an average user with an average program <wink>. I'm not inclined to worry much about it.

...

...
This might be a worthwhile speedup on small dicts (up to a TBD number of entries) but not worthwhile for large dicts.

...

Actually, it helps large dictionaries even more that small dictionaries. Collisions in large dicts are resolved through other memory probes which are almost certain not to be in the current cache line.

That part makes sense. Resizing a large dict is an expensive operation too.

...

...
However, to add this capability in would of course add more code to a very common code path (additional test for current size to decide what factor to increase by).

...

Your intuition is exactly correct. All experiments to special case various sizes results in decreased performance because it added a tiny amount to some of the most heavily exercised code in python.

This part isn't clear: the changed code is in the body of an if() block that normally *isn't* entered (over an ever-growing dict's life, it's entered O(log(len(dict))) times, and independent of the number of dict lookups). The change cuts the number of times it's entered by approximately a factor of 2, but it isn't entered often even now.

...

Further, it results in an unpredicable branch which is also not a good thing.

Since the body of the loop isn't entered often, unpredictable one-shot branches within the body shouldn't have a measurable effect. The unpredictable branches when physically resizing the dict will swamp them regardless. The surrounding if-test continues to be predictable in the "branch taken" direction. What could be much worse is that stuffing code into the if-block bloats the code so much as to frustrate lookahead I-stream caching of the normal "branch taken and return 0" path: if (mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2) { if (dictresize(mp, mp->ma_used*2) != 0) return -1; } return 0; Rewriting as if (mp->ma_used <= n_used || mp->ma_fill*3 < (mp->ma_mask+1)*2) return 0; return dictresize(mp, mp->ma_used*2) ? -1 : 0; would help some compilers generate better code for the expected path, and especially if the blob after "return 0;" got hairier. IOW, if fiddling with different growth factors at different sizes slowed things down, we have to look for something that affected the *normal* paths; it's hard to imagine that the the guts of the if-block execute often enough to matter (discounting its call to dictresize(), which is an expensive routine).

...

I found out that timing dict performance was hard. Capturing memory usage was harder. Checking entry space,space plus unused space, calls to PyMalloc, and calls to the os malloc, only the last is important, but it depends on all kinds of things that are not easily controlled.

In my early Cray days, the Cray boxes were batch one-job-at-a-time, and all memory was real. If you had a CPU-bound program, it took the same number of nanoseconds each time you ran it. Benchmarking was hard then too <0.5 wink>.

Reply

Sign in to reply online Use email software

Raymond Hettinger

April 2003

9:20 a.m.

New subject: Dictionary tuning

...

What is the effect on peak memory usage over "average" programs?

Since the amortized growth is the same, the effect is Nil on average. Special cases can be contrived with a specific number of records where the memory use is doubled, but in general it is nearly unchanged for average programs.

...

This might be a worthwhile speedup on small dicts (up to a TBD number of entries) but not worthwhile for large dicts.

Actually, it helps large dictionaries even more that small dictionaries. Collisions in large dicts are resolved through other memory probes which are almost certain not to be in the current cache line.

...

However, to add this capability in would of course add more code to a very common code path (additional test for current size to decide what factor to increase by).

Your intuition is exactly correct. All experiments to special case various sizes results in decreased performance because it added a tiny amount to some of the most heavily exercised code in python. Further, it results in an unpredicable branch which is also not a good thing. [GvR]

...

I tried the patch with my new favorite benchmark, startup time for Zope (which surely populates a lot of dicts :-). It did give about 0.13 seconds speedup on a total around 3.5 seconds, or almost 4% speedup.

[Tim]

...

Nice (in relative, not absolute terms). Do we have any numbers on memory usage during and after that period?

I found out that timing dict performance was hard. Capturing memory usage was harder. Checking entry space,space plus unused space, calls to PyMalloc, and calls to the os malloc, only the last is important, but it depends on all kinds of things that are not easily controlled. Tim Delaney _______________________________________________

Reply

Sign in to reply online Use email software

Tim Peters

10:49 a.m.

New subject: Dictionary tuning

[Tim Delaney]

...

...
What is the effect on peak memory usage over "average" programs?

[Raymond Hettinger]

...

Since the amortized growth is the same, the effect is Nil on average. Special cases can be contrived with a specific number of records where the memory use is doubled, but in general it is nearly unchanged for average programs.

That doesn't make sense. Dicts can be larger after the patch, but never smaller, so there's nothing opposing the "can be larger" part: on average, allocated address space must be strictly larger than before. Whether that *matters* on average to the average user is something we can answer rigorously just as soon as we find an average user with an average program <wink>. I'm not inclined to worry much about it.

...

...
This might be a worthwhile speedup on small dicts (up to a TBD number of entries) but not worthwhile for large dicts.

...

Actually, it helps large dictionaries even more that small dictionaries. Collisions in large dicts are resolved through other memory probes which are almost certain not to be in the current cache line.

That part makes sense. Resizing a large dict is an expensive operation too.

...

...
However, to add this capability in would of course add more code to a very common code path (additional test for current size to decide what factor to increase by).

...

Your intuition is exactly correct. All experiments to special case various sizes results in decreased performance because it added a tiny amount to some of the most heavily exercised code in python.

This part isn't clear: the changed code is in the body of an if() block that normally *isn't* entered (over an ever-growing dict's life, it's entered O(log(len(dict))) times, and independent of the number of dict lookups). The change cuts the number of times it's entered by approximately a factor of 2, but it isn't entered often even now.

...

Further, it results in an unpredicable branch which is also not a good thing.

Since the body of the loop isn't entered often, unpredictable one-shot branches within the body shouldn't have a measurable effect. The unpredictable branches when physically resizing the dict will swamp them regardless. The surrounding if-test continues to be predictable in the "branch taken" direction. What could be much worse is that stuffing code into the if-block bloats the code so much as to frustrate lookahead I-stream caching of the normal "branch taken and return 0" path: if (mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2) { if (dictresize(mp, mp->ma_used*2) != 0) return -1; } return 0; Rewriting as if (mp->ma_used <= n_used || mp->ma_fill*3 < (mp->ma_mask+1)*2) return 0; return dictresize(mp, mp->ma_used*2) ? -1 : 0; would help some compilers generate better code for the expected path, and especially if the blob after "return 0;" got hairier. IOW, if fiddling with different growth factors at different sizes slowed things down, we have to look for something that affected the *normal* paths; it's hard to imagine that the the guts of the if-block execute often enough to matter (discounting its call to dictresize(), which is an expensive routine).

...

I found out that timing dict performance was hard. Capturing memory usage was harder. Checking entry space,space plus unused space, calls to PyMalloc, and calls to the os malloc, only the last is important, but it depends on all kinds of things that are not easily controlled.

In my early Cray days, the Cray boxes were batch one-job-at-a-time, and all memory was real. If you had a CPU-bound program, it took the same number of nanoseconds each time you ran it. Benchmarking was hard then too <0.5 wink>.

Reply

Sign in to reply online Use email software

RE: [Python-Dev] Dictionary tuning

Delaney, Timothy C (Timothy)

Raymond Hettinger

Raymond Hettinger

Tim Peters

Guido van Rossum

Raymond Hettinger

Raymond Hettinger

Tim Peters

Guido van Rossum

tags

participants (4)