String hash function multiplier
Does anyone have any issues with changing the hash multiplier for the string and Unicode hash functions? Instead of 1000003, I would like to use Aho's 65599, a prime near 2**16 that is efficiently expressible as (x << 6) + (x << 16) - x. This replaces a multiply with fast adds and shifts (with the two shifts potentially being executed in parallel). Googling for "hash 65599" shows a long history of widespread use and testing without any problems. Raymond Hettinger
With -O2 -mcpu=i686 or newer, gcc uses "imul" for both 100003 and 65599, rather than shifts and adds. There may be a few people who care about some other processor, but I wouldn't listen to them. (the only non-x86 CPU I program for on a weekly basis doesn't have hardware multiply, but it's also much too small for Python) The current value goes back a long way: http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/stringo... ... all the way back to when Python did string haching instead of hashing. Other than some abstract beauty to 65599, are there some other practical advantages I'm missing? Jeff
On Apr 13, 2004, at 9:09 PM, Jeff Epler wrote:
With -O2 -mcpu=i686 or newer, gcc uses "imul" for both 100003 and 65599, rather than shifts and adds.
There may be a few people who care about some other processor, but I wouldn't listen to them. (the only non-x86 CPU I program for on a weekly basis doesn't have hardware multiply, but it's also much too small for Python)
The current value goes back a long way: http://cvs.sourceforge.net/viewcvs.py/python/python/dist/src/Objects/ stringobject.c#rev2.31 ... all the way back to when Python did string haching instead of hashing.
Other than some abstract beauty to 65599, are there some other practical advantages I'm missing?
It's not expected that GCC optimize an integer constant into shifts on its own. Anyways, a practical advantage is that with a sane instruction set, like PPC, it saves you a memory access or some instructions (depending on the compiler I guess). Both 100003 and 65599 are too big to be immediate values in a PPC instruction, but the shift constants are not. I guess the real question for Raymond is, does it really make a measurable difference? And what effect does it have on pickled dicts (or other such hash-using data structures), if any? -bob
On Tue, Apr 13, 2004 at 09:30:28PM -0400, Bob Ippolito wrote:
It's not expected that GCC optimize an integer constant into shifts on its own.
But gcc does! -- with -mcpu=i386, 65599 is optimized into some shifts, and 100003 is optimized into some very painful use of "lea" (x is in edx): lea (%edx,%edx,4),%eax // eax = 5 * edx lea (%eax,%eax,4),%eax // eax = 25 * edx lea (%eax,%eax,4),%eax // eax = 125 * edx lea (%eax,%eax,4),%eax // eax = 625 * edx lea (%eax,%eax,4),%eax // eax = 3125 * edx lea (%eax,%eax,4),%eax // eax = 15625 * edx shl $0x5,%eax // eax = 50000 * edx add %edx,%eax // eax = 50001 * edx lea (%edx,%eax,2),%edx // edx = 100003 * edx On the newer x86 CPUs (starting at i686 / k6) imul is preferred by the optimizer. Here's what 65599 gives, with -mcpu=i386 (x is in edx again): mov %edx,%eax // eax = edx shl $0xa,%eax // eax = edx * 1024 add %edx,%eax // eax = edx * 1025 shl $0x6,%eax // eax = edx * 65600 sub %edx,%eax // eax = edx * 65599 mov %eax,%edx // edx = eax If gcc can emit these tortured instruction sequences, but chooses not to, I have to suspect it knows the properties of the CPU better than me. Jeff
[Jeff Epler]
With -O2 -mcpu=i686 or newer, gcc uses "imul" for both 100003 and 65599, rather than shifts and adds.
[Bob Ippolito]
It's not expected that GCC optimize an integer constant into shifts on its own.
Right, the actual diff is: *** 1145,1151 **** p = (unsigned char *) a->ob_sval; x = *p << 7; while (--len >= 0) ! x = (1000003*x) ^ *p++; x ^= a->ob_size; if (x == -1) x = -2; --- 1152,1158 ---- p = (unsigned char *) a->ob_sval; x = *p << 7; while (--len >= 0) ! x = (x << 6) + (x << 16) - x + (long)*p++; x ^= a->ob_size; if (x == -1) x = -2;
I guess the real question for Raymond is, does it really make a measurable difference?
Yes, the loop runs 20% faster on my Pentium III. The speedup ought to be much more dramatic on the Pentium IV (where the integer ALU instructions speedup from 1 clock to 0.5 clocks while the latency on integer multiplication slows down from 4 clocks to 14-18 clocks).
And what effect does it have on pickled dicts (or other such hash-using data structures), if any?
The test suite runs fine. Dicts may display in a different order than in previous pythons. That may upset some doctests if the writer didn't take Tim's documented advice about making tests that do not depend on display order. Raymond
On Tue, Apr 13, 2004 at 10:16:16PM -0400, Raymond Hettinger wrote:
Yes, the loop runs 20% faster on my Pentium III. The speedup ought to be much more dramatic on the Pentium IV (where the integer ALU instructions speedup from 1 clock to 0.5 clocks while the latency on integer multiplication slows down from 4 clocks to 14-18 clocks).
I got different results here, also on a Pentium III (but in a laptop) I'm using gcc, I don't know what you're using. I get the best results from -O3 -mcpu=pentium3 for both MUL values and do worse with CLEVER in most cases. In my test, PyStringObject is not quite the same as in Python, so this could explain different results. Use of a different compiler, or gcc with the implied default -mcpu=i386 (?), might explain why you benchmarked the shifts as better, too. (-O{2,3} -mcpu=i386) [* = best alternative for these optimizer flags] -O -mcpu=i386 -DMUL=100003 1.78 -O -mcpu=i386 -DMUL=65599 1.37 * -O -mcpu=i386 -DCLEVER 1.40 -O -mcpu=pentium3 -DMUL=100003 1.27 * -O -mcpu=pentium3 -DMUL=65599 1.27 * -O -mcpu=pentium3 -DCLEVER 1.35 -O2 -mcpu=i386 -DMUL=100003 1.93 -O2 -mcpu=i386 -DMUL=65599 1.54 -O2 -mcpu=i386 -DCLEVER 1.28 * -O2 -mcpu=pentium3 -DMUL=100003 1.11 * -O2 -mcpu=pentium3 -DMUL=65599 1.12 -O2 -mcpu=pentium3 -DCLEVER 1.29 -O3 -mcpu=i386 -DMUL=100003 1.69 -O3 -mcpu=i386 -DMUL=65599 1.28 -O3 -mcpu=i386 -DCLEVER 1.10 * -O3 -mcpu=pentium3 -DMUL=100003 0.90 * -O3 -mcpu=pentium3 -DMUL=65599 0.90 * -O3 -mcpu=pentium3 -DCLEVER 1.05 -Os -mcpu=i386 -DMUL=100003 1.16 * -Os -mcpu=i386 -DMUL=65599 1.16 * -Os -mcpu=i386 -DCLEVER 1.45 -Os -mcpu=pentium3 -DMUL=100003 1.05 * -Os -mcpu=pentium3 -DMUL=65599 1.05 * -Os -mcpu=pentium3 -DCLEVER 1.30 # alternatives.py OPT = [ '-O -mcpu=i386', '-O -mcpu=pentium3', '-O2 -mcpu=i386', '-O2 -mcpu=pentium3', '-O3 -mcpu=i386', '-O3 -mcpu=pentium3', '-Os -mcpu=i386', '-Os -mcpu=pentium3', ] HASH = ['-DMUL=100003', '-DMUL=65599', '-DCLEVER'] import sys, os for o in OPT: for h in HASH: sys.stdout.write("%-20s %-20s " % (o, h)) sys.stdout.flush() os.system("gcc %s %s hashtest.c && TIME=%%U /usr/bin/time ./a.out" % (o, h)) print // hashtest.c #ifndef MUL #define MUL 100003 #endif typedef struct { long ob_shash; unsigned long ob_size; unsigned char ob_sval[16]; } PyStringObject; static long string_hash(PyStringObject *a) { register int len; register unsigned char *p; register long x; if (a->ob_shash != -1) return a->ob_shash; len = a->ob_size; p = (unsigned char *) a->ob_sval; x = *p << 7; while (--len >= 0) #ifdef CLEVER x = (x << 6) + (x << 16) - x + (long)*p++; #else x = (MUL*x) ^ *p++; #endif x ^= a->ob_size; if (x == -1) x = -2; a->ob_shash = x; return x; } int main(void) { PyStringObject s = {-1, 13, "hello raymond"}; int i; for(i=0; i < 1<<23; i++) { string_hash(&s); s.ob_shash = -1; } return 0; }
I benchmarked this on a Pentium 4 and Raymond's hash does indeed come out ahead, regardless of compiler flags. Pentium IV, 2.4GHz: -O2 -mcpu=i386 -DMUL=100003 1.56 -O2 -mcpu=i386 -DMUL=65599 0.78 -O2 -mcpu=i386 -DCLEVER 0.54 * -O2 -mcpu=pentium3 -DMUL=100003 0.78 -O2 -mcpu=pentium3 -DMUL=65599 0.79 -O2 -mcpu=pentium3 -DCLEVER 0.53 * -O2 -mcpu=pentium4 -DMUL=100003 0.63 -O2 -mcpu=pentium4 -DMUL=65599 0.65 -O2 -mcpu=pentium4 -DCLEVER 0.50 * With AMD CPUs, the current multiplier beats both the new multipler and the version expressed as shifts and adds/subtracts: On a Duron, 1GHz: -O2 -mcpu=i386 -DMUL=100003 2.03 -O2 -mcpu=i386 -DMUL=65599 1.04 -O2 -mcpu=i386 -DCLEVER 0.95 * -O2 -mcpu=athlon -DMUL=100003 0.92 * -O2 -mcpu=athlon -DMUL=65599 1.03 -O2 -mcpu=athlon -DCLEVER 0.94 On an Athlon XP 2600+: -O2 -mcpu=i386 -DMUL=100003 0.95 -O2 -mcpu=i386 -DMUL=65599 0.49 -O2 -mcpu=i386 -DCLEVER 0.44 * -O2 -mcpu=athlon-xp -DMUL=100003 0.43 * -O2 -mcpu=athlon-xp -DMUL=65599 0.48 -O2 -mcpu=athlon-xp -DCLEVER 0.45 If we want a fast hash function, then we should write one that works a "long" at a time. This probably can't be done if hashes must be equal on different machines, but aren't hashes already different on LP64 machines (because they're 64 bits instead of 32)? (I guess the low-order bits would be identical) Long-at-a-time hash, Duron, 1GHz: -O2 -march=athlon-tbird -DMUL=100003 0.35 -O2 -march=athlon-tbird -DMUL=65599 0.41 -O2 -march=athlon-tbird -DCLEVER 0.42 With this code, it would be necessary to allocate strings "rounded up" to 4 bytes, and zero the unused bytes. Or we could do as Tim and Guido suggest: nothing. Jeff /* long-at-a-time hashing */ typedef struct { long ob_shash; unsigned long ob_size; union { unsigned char ob_sval[16]; long ob_svall[4]; }; } PyStringObject; static long string_hash(PyStringObject *a) { register int len; register long *p; register long x; if (a->ob_shash != -1) return a->ob_shash; len = (a->ob_size+3) / 4; p = a->ob_svall; x = *p << 7; while (--len >= 0) #ifdef CLEVER x = (x << 6) + (x << 16) - x + *p++; #else x = (MUL*x) ^ *p++; #endif x ^= a->ob_size; if (x == -1) x = -2; a->ob_shash = x; return x; }
On Wed, 14 Apr 2004, Jeff Epler wrote:
Pentium IV, 2.4GHz: -O2 -mcpu=i386 -DMUL=100003 1.56
{...}
-O2 -mcpu=pentium4 -DMUL=100003 0.63
{...}
With AMD CPUs, the current multiplier beats both the new multipler and the version expressed as shifts and adds/subtracts:
{...}
On an Athlon XP 2600+: -O2 -mcpu=i386 -DMUL=100003 0.95
{...}
-O2 -mcpu=athlon-xp -DMUL=100003 0.43 *
{...}
Long-at-a-time hash, Duron, 1GHz: -O2 -march=athlon-tbird -DMUL=100003 0.35
Ummm... are you showing what you think you're showing here? As I recall, i386 gcc uses -mcpu and -march differently to most other architectures: - -mcpu just sets scheduling parameters, but not instruction set; - -march sets the instruction set. So most of the timings you show are for the i386 instruction set, but with different scheduling. The exception is the long-at-a-time hash. -- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
On Wed, Apr 14, 2004 at 11:34:13PM +1000, Andrew MacIntyre wrote:
Ummm... are you showing what you think you're showing here? As I recall, i386 gcc uses -mcpu and -march differently to most other architectures: - -mcpu just sets scheduling parameters, but not instruction set; - -march sets the instruction set.
I'm pretty sure I am. (I'm not sure why I used -march on the long-at-a-time version, though) imul is on all x86 architectures, but whether to use it or not depends on the characteristics of the target CPU. With -mcpu=i386, imul is considered quite slow and a shift sequence is (almost?) always preferred when one operand is constant. With -mcpu=i686 and newer, imul seems to be preferred. I did actually inspect the generated assembly with -mcpu=i386 and -mcpu=i686, though I didn't look at it for all the combinations I posted. Jeff
On Wed, 14 Apr 2004, Jeff Epler wrote:
imul is on all x86 architectures, but whether to use it or not depends on the characteristics of the target CPU. With -mcpu=i386, imul is considered quite slow and a shift sequence is (almost?) always preferred when one operand is constant. With -mcpu=i686 and newer, imul seems to be preferred.
My misunderstanding. I hadn't really comprehended the full extent of "instruction scheduling" in this context. -- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Does anyone have any issues with changing the hash multiplier for the string and Unicode hash functions?
Instead of 1000003, I would like to use Aho's 65599, a prime near 2**16 that is efficiently expressible as (x << 6) + (x << 16) - x. This replaces a multiply with fast adds and shifts (with the two shifts potentially being executed in parallel).
Googling for "hash 65599" shows a long history of widespread use and testing without any problems.
It would break the parrot benchmark, which has a re-implementation of the hash in Python and assumes/checks at various points that the two agree. So I'd rather not see this changed before July 30. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Raymond]
Does anyone have any issues with changing the hash multiplier for the string and Unicode hash functions?
Don't touch it unless you can prove major benefits -- it's a remarkable fact of life that the current multiplier hasn't resulted in any real-life (but non-contrived) pathological cases.
Instead of 1000003, I would like to use Aho's 65599, a prime near 2**16 that is efficiently expressible as (x << 6) + (x << 16) - x. This replaces a multiply with fast adds and shifts (with the two shifts potentially being executed in parallel).
It's unclear why that's a good thing. Perhaps you think shifts and adds are faster? I wouldn't -- the imul instruction on modern Pentiums is very fast. It is clear why it may be a bad thing: that it *can* be expressed as just a couple shifts and adds makes it suspect as a good scrambler (read Knuth).
Googling for "hash 65599" shows a long history of widespread use and testing without any problems.
Testing in the context of Python string hashing? I didn't think so <wink>. The current multiplier has been studied extensively *in* Python, both via real-life use and via large-scale focused statistical testing (we got some money to do that during BeOpen.com's brief life -- a surprise that came out of that was that CRC made a terrible string hash function as the number of input strings got large). The right thing to compare Python's string hash to is "the standard" Fowler-Noll-Vo string hash, which was developed independently, but which also ended up using a large multiplier.
Does anyone have any issues with changing the hash multiplier for
[Raymond] the
string and Unicode hash functions?
[Tim]
Don't touch it unless you can prove major benefits -- it's a remarkable fact of life that the current multiplier hasn't resulted in any real-life (but non-contrived) pathological cases.
Will leave it alone.
Perhaps you think shifts and adds are faster? I wouldn't -- the imul instruction on modern Pentiums is very fast.
On the P4, the documented latency went up from 4 cycles to 14 cycles while shifts and adds went down to 0.5 cycles and 1 cycle respectively. Timings confirm the result. It looks like the best bet is to try to speedup the code without changing the multiplier. Intel's software optimization cookbook recommends a partial unrolling and elimination of data dependencies so that a second multiply can start 4 cycles after the previous one started. If practice bears out the theory, the timings could show a three or fourfold speedup without changing the multiplier.
(read Knuth).
Of course, I already have :-)
The right thing to compare Python's string hash to is "the standard" Fowler-Noll-Vo string hash
Ditto. Raymond ################################################################# ################################################################# ################################################################# ##### ##### ##### ################################################################# ################################################################# #################################################################
Hi, Raymond wrote:
It looks like the best bet is to try to speedup the code without changing the multiplier.
Indeed. And while you are at it, there are other optimizations, that seem more promising: I compiled a recent CVS Python with profiling and here is a list of the top CPU hogs (on a Pentium III, your mileage may vary): pystone: CPU% Function Name ---------------------------- 55.44 eval_frame 7.30 lookdict_string 4.34 PyFrame_New 3.73 frame_dealloc 1.73 vgetargs1 1.65 PyDict_SetItem 1.42 string_richcompare 1.15 PyObject_GC_UnTrack 1.11 PyObject_RichCompare 1.08 PyInt_FromLong 1.08 tupledealloc 1.04 insertdict parrotbench: CPU% Function Name ---------------------------- 23.65 eval_frame 8.68 l_divmod 4.43 lookdict_string 2.95 k_mul 2.27 PyType_IsSubtype 2.23 PyObject_Malloc 2.09 x_add 2.05 PyObject_Free 2.05 tupledealloc Arguably parrotbench is a bit unrepresentative here. And beware: due to automatic inlining of static functions the real offender may be hidden (x_divmod is the hog, not l_divmod). Anyway, this just confirms that the most important optimization targets are: 1. eval_frame 2. string keyed dictionaries 3. frame handling I think 3. needs an algorithmic approach (there was some discussion about it a few weeks ago), while 1. and 2. look like a code generation issue. So I took a look at the machine code that GCC generates on x86 with -O3 for these: uh oh, not a pretty sight. The main problems are bad branch predictions and lack of inlining. About branch predictions: The worst offender is the main code path for eval_frame: it gets split across half a dozen segments. The code path for the non-argument bytecodes is in fact the slowest. Similarly the code for lookdict_string branches around like crazy. The most likely code path is not the fastest path. One solution is to use likely()/unlikely() macros (using __builtin_expect). This is a good solution for small, isolated and performance critical code. It does not work well for eval_frame, though (I tried). But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe: Go to your Python source directory and do this: $ mkdir build_profile $ cd build_profile $ ../configure # add any options you may need $ make $ mv python python_orig Edit the Makefile and add -fprofile-arcs to OPT. $ rm Python/ceval.o $ make $ mv python python_profile Run your favourite benchmark(s), but only invoke python *once*: $ ./python_profile -c 'import test.pystone; test.pystone.main(loops=100000)' Forget about the performance numbers the benchmark reports. Never use an executable compiled with profiling for comparison. But ... you should now have a new (binary) file called Python/ceval.da that contains the profiled branch probabilities. Edit the Makefile and replace -fprofile-arcs with -fbranch-probabilities $ rm Python/ceval.o $ make $ mv python python_opt Then compare the benchmarks: $ ./python_orig -c 'import test.pystone; test.pystone.main(loops=100000)' $ ./python_opt -c 'import test.pystone; test.pystone.main(loops=100000)' On my machine I get a 10-15% speedup. But we only optimized ceval.c ... So repeat the above steps, but now delete Python/ceval.o and Objects/*.o each time. Now I get about 20-25% speedup for pystone and 10% speedup for parrotbench! Not bad, huh? Now the bad news: I don't know how to integrate this into the regular build process. So this is not an option for everyone, but ambitious packagers might want to take the trouble to do this by hand. Oh and of course neither pystone nor parrotbench are representative of the branch probabilities of any particular application. But for predicting eval_frame and lookdict_string, they are probably good enough. On a related note: GCC uses random branch probabilities with -O, when no probability information is present (no __builtin_expect or *.da files). Use -fno-guess-branch-probability if you want predictable timings on recompiles. About inlining: - The opcode decoding with NEXTOP/HAS_ARG/NEXTARG does not compile well. GCC has trouble inferring the lifetimes for opcode and oparg, too. Ideas: - Use a special inline assembly variant for x86. OR - Move NEXTARG to each opcode that needs it and make oparg a local block variable. - Use NEXTOP directly for the switch and refetch opcode into a local block variable where necessary. - The Py_Ticker check and the tracing support should be moved out of the core loop. The Py_Ticker check may be put into a subroutine and called from selected opcodes only (haven't checked if that works). I have no (good) idea about how to get rid of the tracing support, though. - _PyString_Eq() is a good candidate to inline for lookdict_string(). I.e. a new macro in stringobject.h. But wait ... - GCC generates pretty slow code for _PyString_Eq() (at least on x86). This is mainly due to a bad branch prediction (we should use likely()) but also due to the penalty for 8-bit partial register load/compare generated by the ob_sval dereferences. The latter is completely useless because memcmp() is inlined (with repz cmpsb) and duplicates that comparison. Omitting the *ob_sval comparison is faster. - Create a macro that inlines the first part of PyObject_Hash(). Useful for PyDict_SetItem() and others. I bet there are more, but I'm running out of time right now -- sorry. Bye, Mike
Mike Pall
Hi,
Raymond wrote:
It looks like the best bet is to try to speedup the code without changing the multiplier.
Indeed. And while you are at it, there are other optimizations, that seem more promising:
I compiled a recent CVS Python with profiling and here is a list of the top CPU hogs (on a Pentium III, your mileage may vary):
I played this game recently, but on a G3 ibook, which is probably a much more boring processor from a scheduling point of view.
pystone:
CPU% Function Name ---------------------------- 55.44 eval_frame 7.30 lookdict_string 4.34 PyFrame_New 3.73 frame_dealloc 1.73 vgetargs1 1.65 PyDict_SetItem 1.42 string_richcompare 1.15 PyObject_GC_UnTrack 1.11 PyObject_RichCompare 1.08 PyInt_FromLong 1.08 tupledealloc 1.04 insertdict
I saw similar results to this, tho' I don't remember lookdict_string being so high on the list.
parrotbench:
CPU% Function Name ---------------------------- 23.65 eval_frame 8.68 l_divmod 4.43 lookdict_string 2.95 k_mul 2.27 PyType_IsSubtype 2.23 PyObject_Malloc 2.09 x_add 2.05 PyObject_Free 2.05 tupledealloc
Arguably parrotbench is a bit unrepresentative here. And beware: due to automatic inlining of static functions the real offender may be hidden (x_divmod is the hog, not l_divmod).
Probably a fine candidate function for rewriting in assembly too...
Anyway, this just confirms that the most important optimization targets are: 1. eval_frame 2. string keyed dictionaries 3. frame handling
[...]
But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe:
I tried this on the ibook and I found that it made a small difference *on the program you ran to generate the profile data* (e.g. pystone), but made naff all difference for something else. I can well believe that it makes more difference on a P4 or G5. [snippety]
I bet there are more, but I'm running out of time right now -- sorry.
I certainly don't want to discourage people from optimizing Python's current implementation, but... Some months ago (just after I played with -fprofile-arcs, most likely) I wrote a rant about improving Python's performance, which I've finally got around to uploading: http://starship.python.net/crew/mwh/hacks/speeding-python.html Tell me what you think! Cheers, mwh -- Hmmm... its Sunday afternoon: I could do my work, or I could do a Fourier analysis of my computer's fan noise. -- Amit Muthu, ucam.chat (from Owen Dunn's summary of the year)
Hi, mwh wrote:
(x_divmod is the hog, not l_divmod).
Probably a fine candidate function for rewriting in assembly too...
As a data point: I once had the doubtful pleasure to write a long-integer library for cryptography. Hand-crafted x86 assembler outperforms plain (but carefully optimized) C code by a factor of 2 to 3. But Python's long-int code is a lot slower than e.g. gmp (factor 15-25 for mul/div, factor 100 for modular exponentiation). I assume the difference between C and assembler is less pronounced with other processors. The register pressure issue may soon be a moot point with x86-64, though. It has been shown that 64 bit pointers slow things down a bit, but compilers just love the extra registers (R8-R15).
But GCC has more to offer: read the man page entries for -fprofile-arcs and -fbranch-probabilities. Here is a short recipe:
I tried this on the ibook and I found that it made a small difference *on the program you ran to generate the profile data* (e.g. pystone), but made naff all difference for something else. I can well believe that it makes more difference on a P4 or G5.
For x86 even profiling python -c 'pass' makes a major difference. And the speed-ups are applicable to almost any program, since the branch predictions for eval_frame and lookdict_string affect all Python programs. I'm currently engaged in a private e-mail conversation with Raymond on how to convince GCC to generate good code on x86 without the help of profiling.
I wrote a rant about improving Python's performance, which I've finally got around to uploading:
http://starship.python.net/crew/mwh/hacks/speeding-python.html
Tell me what you think!
About GC: yes, refcounting is the silent killer. But there's a lot to optimize even without discarding refcounting. E.g. the code generated for Py_DECREF is awful (spread across >3500 locations) and PyObject_GC_UnTrack needs some work, too. About psyco: I think it's wonderful. But, you are right, nobody is using it. Why? Simple: It's not 'on' by default. About type inference: maybe the right way to go with Python is lazy and pure runtime type inference? Close to what psyco does. About type declarations: this is too contrary to the Pythonic way of thinking. And before we start to implement this, we should make sure that it's a lot faster than a pure dynamic type inferencing approach. About PyPy: very interesting, will take a closer look. But there's still a long road ahead ... Bye, Mike
http://starship.python.net/crew/mwh/hacks/speeding-python.html
I recently met Mario Wolczko who worked on the Self language, and asked him if he had one sentence of advice for Python. His response was: Just-in-time compilation.
About psyco: I think it's wonderful. But, you are right, nobody is using it. Why? Simple: It's not 'on' by default.
I don't think that's the real reason. Last week Jimmy Retzlaff gave a talk at the Bay Piggies meeting on optimization, using a large commercial application he's written as an example. He mentioned that he had tried to use Psyco, and that it had been turned on in some distributions of the code, but that it was currently turned off again. I don't recall the reason he stated for that, but I suspect that it was mostly Psyco's memory bloat, plus the fact that his code was "fast enough" and he was able to code things in C if needed. (He also showed a nice anecdote about Psyco being faster than his hand-coded C in one case, but that didn't last long when he realized the C code was using doubles and the Python code was using ints. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
On Thu, 2004-04-15 at 10:27, Guido van Rossum wrote:
http://starship.python.net/crew/mwh/hacks/speeding-python.html
I recently met Mario Wolczko who worked on the Self language, and asked him if he had one sentence of advice for Python. His response was: Just-in-time compilation.
It seems like an obvious, but labor intensive solution to me. I didn't get a chance to pursue discussion much with Jim Hugunin or Mike Salib, but it would be interesting to know why they focus on static analysis; Self started out with static analysis but got better performance with runtime type feedback (http://www.cs.ucsb.edu/labs/oocsb/papers/toplas96.shtml). Another question that I'd love to hear the answer to: What's the difference between Pysco and something like this Self implementation or the HotSpot Java implementation? I've never read a clear explanation of what Pysco does or how it compares to other systems that do dynamic compilation. Jeremy
It seems like an obvious, but labor intensive solution to me.
Well, all *other* proposals seem just as labor intensive... --Guido van Rossum (home page: http://www.python.org/~guido/)
Hi, Jeremy Hylton wrote:
Another question that I'd love to hear the answer to: What's the difference between Pysco and something like this Self implementation or the HotSpot Java implementation? I've never read a clear explanation of what Pysco does or how it compares to other systems that do dynamic compilation.
Here is a link for Java HotSpot: http://java.sun.com/products/hotspot/docs/whitepaper/Java_Hotspot_v1.4.1/Jav... It seems that many advantages gained by static typing are lost since the Java codebase encourages strong polymorphism (in contrast to C++). What can we learn from this for Python: static typing is probably not all that helpful for a JIT compiler (in contrast to popular belief). About the CLR (.NET and C# target) JIT: http://www.artima.com/intv/choices.html What I found most interesting: their bytecode is polymorphic because they never targetted an interpreter and the JIT must do type inference anyway. Well, Python bytecode doesn't look that bad now ... I would be very interested to hear what Armin thinks about static vs. dynamic type inferencing and about semi-static (CLR) vs. hotspot (Java) compilation. Bye, Mike
Jeremy Hylton
Another question that I'd love to hear the answer to: What's the difference between Pysco and something like this Self implementation or the HotSpot Java implementation? I've never read a clear explanation of what Pysco does or how it compares to other systems that do dynamic compilation.
I don't know the answer to this... I suspect part of it might have to do with 'abstract interpretation'. I'll see Armin tomorrow (at PythonUK) and beat him up until he answers this question himself :-) Cheers, mwh -- please realize that the Common Lisp community is more than 40 years old. collectively, the community has already been where every clueless newbie will be going for the next three years. so relax, please. -- Erik Naggum, comp.lang.lisp
Jeremy Hylton wrote:
Another question that I'd love to hear the answer to: What's the difference between Pysco and something like this Self implementation or the HotSpot Java implementation?
Psyco is specializing, and that is the main difference compared to Java Hotspot. If you have a program def f(a, b): if a>b: return a else: return a-b f(1,2) f(1.0, 2.0) f([1,2],[3,4]) then psyco generates machine code for *three* functions int f_int(int a, int b){ // uses process int arithmethic throughout if(a>b)return a; else return a-b; } double f_double(double a, double b){ // uses FPU ops throughout if(a>b)return a; else return a-b; } list f_list(list a, list b){ // might invoke C functions if(list_gt(a, b))return a; else raise TypeError("unexpected operands"); } (it actually generates different specializations) In Hotspot, the type of f would already be defined in the Java source code, and Hotspot generates native machine instructions for it. The changes over standard JIT appear to be inlining; it also appears to do inlining of virtual functions, combined with a type check to detect cases where the a different functions should have been called compared to the last time the virtual call was made. Regards, Martin
On Thu, 2004-04-15 at 15:07, "Martin v. Löwis" wrote:
Jeremy Hylton wrote:
Another question that I'd love to hear the answer to: What's the difference between Pysco and something like this Self implementation or the HotSpot Java implementation?
Psyco is specializing, and that is the main difference compared to Java Hotspot. If you have a program
That's a helpful clarification, thanks.
In Hotspot, the type of f would already be defined in the Java source code, and Hotspot generates native machine instructions for it.
The changes over standard JIT appear to be inlining; it also appears to do inlining of virtual functions, combined with a type check to detect cases where the a different functions should have been called compared to the last time the virtual call was made.
I'm not sure what you mean by the changes over standard JIT. Do you mean the difference between Hotspot and "standard JIT"? Your descriptions sounds a bit like the Deutsch and Schiffman inline method cache, which doesn't inline the method body but does inline the method lookup. Or does Hotspot actually inline methods? Jeremy
Jeremy Hylton wrote:
I'm not sure what you mean by the changes over standard JIT. Do you mean the difference between Hotspot and "standard JIT"? Your descriptions sounds a bit like the Deutsch and Schiffman inline method cache, which doesn't inline the method body but does inline the method lookup. Or does Hotspot actually inline methods?
I lost the link, but on some page describing it, they actually claimed they do inline methods. They went on explaining that you need "deoptimization" in case a different method should be called; I assume they generate a type check just before the inlined method body. I might be that they keep collecting statistics on how often the type guess was wrong to discard the inlining, and come up with a new one. Hotspot operates by only compiling functions selectively, which have been invoked a number of times, and they claim they collect data on where virtual calls go to. So when they eventually do generate machine code, they have the data to do the inlining. A "standard" JIT might chose to compile everything the first time it is invoked, and would then just replace the virtual-call bytecode with the necessary memory fetches and an indirect call. Regards, Martin
[Raymond]
On the P4, the documented latency went up from 4 cycles to 14 cycles while shifts and adds went down to 0.5 cycles and 1 cycle respectively. Timings confirm the result.
What makes the current stepping of some flavor of P4 overwhelmingly important <0.3 wink>? Tradeoffs at this level change with every chip generation, and within a generation vary across vendors. In particular, they can make int multiply as fast as they want just by throwing more highly regular silicon at it (which is why its relative performance varies so much across chip generations, and is usually worst in the earliest iterations of a new generation). My overwhelming concern isn't micro-efficiency of the generated code for the string-hash loop, it's that the resulting hash have high quality. I don't know that the "little prime" is vulnerable to bad cases in non-contrived real life, but we have years of evidence suggesting that the "big prime" isn't. If we can develop evidence that the little prime doesn't either, great, then I lose all objections to using it.
It looks like the best bet is to try to speedup the code without changing the multiplier. Intel's software optimization cookbook recommends a partial unrolling and elimination of data dependencies so that a second multiply can start 4 cycles after the previous one started. If practice bears out the theory, the timings could show a three or fourfold speedup without changing the multiplier.
Or leave it alone. Hand unrolling a loop all but guarantees that the next generation of compiler optimizer won't be able to reverse-engineer the original "natural" loop structure, and so won't be able to do the code transformation that works best for the next generation of hardware tricks. IOW, once you start down the path of hand-optimizing for a specific compiler + HW combo, there are two outcomes: (1) someone makes this their job forever after; or, (2) the performance of the code actually gets worse over time (compared to what it would have been if the original code had been left dirt simple). I happen to have an app (spambayes) that makes very heavy use of string-keyed dicts, and where interning is impractical (so the string hashes can't be avoided). Still, speed up string hashing by a factor of 10, and it won't make enough difference in that app's throughput to mention. I also have a hyper-threaded box today, and the imul stalls in this P4 give great opportunities for other processes to get work done <0.9 wink>.
Tim> What makes the current stepping of some flavor of P4 overwhelmingly Tim> important <0.3 wink>? Unrelated to this thread, what does "stepping" mean? A quick Google search for "Pentium stepping glossary" yielded this page http://processorfinder.intel.com/scripts/help3.asp but doesn't explain why Intel uses "stepping" instead of "version" or "revision". Skip
Tim> What makes the current stepping of some flavor of P4 overwhelmingly Tim> important <0.3 wink>?
Unrelated to this thread, what does "stepping" mean? A quick Google search for "Pentium stepping glossary" yielded this page
http://processorfinder.intel.com/scripts/help3.asp
but doesn't explain why Intel uses "stepping" instead of "version" or "revision".
[Channeling Tim] In this context, "stepping" means all of the idiosyncratic implementation details of processor workflow. The word "stepping" is natural after staring at Gantt charts showing timings, data dependencies, speculative execution, and parallel operations. Tim is advising me to against focusing on one processor at one point in history because the "stepping" is guaranteed to change over time. Raymond Hettinger
Skip Montanaro
Tim> What makes the current stepping of some flavor of P4 overwhelmingly Tim> important <0.3 wink>?
Unrelated to this thread, what does "stepping" mean? A quick Google search for "Pentium stepping glossary" yielded this page
http://processorfinder.intel.com/scripts/help3.asp
but doesn't explain why Intel uses "stepping" instead of "version" or "revision".
Here is a possible answer: A stepping is the processor hardware equivalent of a new version. In software, we refer to minor version changes as 1.0, 1.1, 1.2, and so on, while in processors we call these minor revisions steppings. Found in http://www.informit.com/articles/article.asp?p=130978&seqNum=22 Thomas
[Skip]
Unrelated to this thread, what does "stepping" mean? A quick Google search for "Pentium stepping glossary" yielded this page
http://processorfinder.intel.com/scripts/help3.asp
but doesn't explain why Intel uses "stepping" instead of "version" or "revision".
It means "minor revision". Major revisions are usually given names (my current P4 was/is known as "Northwood"). I don't know why minor revisions are called steppings -- I've never been a HW guy, although the term has been in common use among HW guys I've known.
participants (13)
-
"Martin v. Löwis"
-
Andrew MacIntyre
-
Bob Ippolito
-
Guido van Rossum
-
Jeff Epler
-
Jeremy Hylton
-
Michael Hudson
-
Mike Pall
-
Raymond Hettinger
-
Raymond Hettinger
-
Skip Montanaro
-
Thomas Heller
-
Tim Peters