Hi,
It has taken a fair amount of work but I have mostly gotten tagged pointers working for small ints (currently 63 bits on 64-bit platform). The configure option is --with-fixedint. I'm trying to release early and often.
The whole test suite runs without crashing, which feels like some kind of milestone to me. The following tests still fail:
test_ctypes test_fcntl test_fileio test_gdb test_inspect test_io
test_repl test_socket test_sqlite test_unicode test_userstring
Latest code in here:
https://github.com/nascheme/cpython/tree/tagged_int
Unfortunately, there is a net slowdown with the fixedint option enabled. Full PGO benchmarks from pyperformance are below. I'm hoping there is still some good work to be done reducing the number of times fixed ints need to heap allocated. I suspect that is why the pickle_list and pickle_dict are slower. I should also try to measure the memory usage difference. The fixedint version should use less RAM.
Here is a Linux perf report for the pickle_list benchmark:
http://python.ca/nas/python/perf-fixedint-pickle-list.txt
The addition of the extra test + jmp instructions to INCREF/DECREF are hurting a fair bit. I'm not sure there is anything to be done there. Based on the Linux perf results, I have the suspicion that the extra instructions for INCREF/DECREF/_Py_TYPE are blowing up the size of _PyEval_EvalFrameDefault. I need to investigate that more.
BTW, Linux perf is amazing. Anyone who does low-level optimization work should study it.
I did consider trying to use a second tag for short strings. Not sure it will help too much as some quick analysis shows that only 25% of strings used by PyDict_GetItem are short enough to fit.
This morning I dreamed up new idea: analyze normal Python programs and build a list of strings commonly used for PyDict_GetItem. They will be strings like "self", builtin functions names, etc. Then use a tagged pointer to hold these common strings. I.e. a tag to denote a string (or interned symbol, in Lisp speak) and an integer which is the offset into the fixed array of interned strings. The savings would have to come from avoiding the INCREF/DECREF accounting of refcounts on those strings. Instead of fixed set of strings, perhaps we could make the intern process dynamically allocate the tag IDs. We could have a specialized lookdict that works for dicts containing only interned strings.
$ ./python -m perf compare_to -G
../cpython-profile-tagged-off/base4.json fixedint5.json --min-spee
d 5
Slower (24):
- pickle_list: 3.06 us +- 0.04 us -> 3.74 us +- 0.03 us: 1.22x slower (+22%)
- pickle_dict: 22.2 us +- 0.1 us -> 26.2 us +- 0.2 us: 1.18x slower (+18%)
- raytrace: 501 ms +- 5 ms -> 565 ms +- 6 ms: 1.13x slower (+13%)
- crypto_pyaes: 113 ms +- 1 ms -> 126 ms +- 0 ms: 1.12x slower (+12%)
- logging_silent: 210 ns +- 4 ns -> 234 ns +- 3 ns: 1.11x slower (+11%)
- telco: 6.00 ms +- 0.09 ms -> 6.68 ms +- 0.14 ms: 1.11x slower (+11%)
- float: 111 ms +- 2 ms -> 123 ms +- 1 ms: 1.11x slower (+11%)
- nbody: 122 ms +- 1 ms -> 135 ms +- 2 ms: 1.10x slower (+10%)
- mako: 17.1 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.10x slower (+10%)
- json_dumps: 12.3 ms +- 0.2 ms -> 13.5 ms +- 0.1 ms: 1.10x slower (+10%)
- scimark_monte_carlo: 103 ms +- 2 ms -> 113 ms +- 1 ms: 1.10x slower (+10%)
- pickle_pure_python: 467 us +- 3 us -> 508 us +- 6 us: 1.09x slower (+9%)
- logging_format: 10.2 us +- 0.1 us -> 11.1 us +- 2.2 us: 1.09x slower (+9%)
- chameleon: 9.27 ms +- 0.09 ms -> 10.1 ms +- 0.1 ms: 1.09x slower (+9%)
- sqlalchemy_imperative: 30.4 ms +- 0.8 ms -> 32.9 ms +- 0.9 ms: 1.08x slower (+8%)
- django_template: 122 ms +- 2 ms -> 131 ms +- 2 ms: 1.08x slower (+8%)
- sympy_str: 184 ms +- 2 ms -> 198 ms +- 5 ms: 1.07x slower (+7%)
- unpickle_pure_python: 368 us +- 5 us -> 394 us +- 9 us: 1.07x slower (+7%)
- sympy_expand: 426 ms +- 10 ms -> 452 ms +- 12 ms: 1.06x slower (+6%)
- sympy_sum: 90.4 ms +- 0.6 ms -> 96.0 ms +- 1.0 ms: 1.06x slower (+6%)
- regex_compile: 181 ms +- 7 ms -> 192 ms +- 7 ms: 1.06x slower (+6%)
- scimark_lu: 173 ms +- 6 ms -> 182 ms +- 5 ms: 1.05x slower (+5%)
- genshi_xml: 62.7 ms +- 0.8 ms -> 66.1 ms +- 0.8 ms: 1.05x slower (+5%)
- pickle: 9.11 us +- 0.13 us -> 9.59 us +- 0.06 us: 1.05x slower (+5%)
Faster (2):
- unpack_sequence: 49.1 ns +- 0.7 ns -> 45.0 ns +- 1.3 ns: 1.09x faster (-8%)
- scimark_sparse_mat_mult: 3.75 ms +- 0.05 ms -> 3.47 ms +- 0.05 ms: 1.08x faster (-8%)
Benchmark hidden because not significant (29): 2to3, chaos,
deltablue, dulwich_log, fannkuch, genshi_text, go, hexiom, html5lib,
json_loads, logging_simple, meteor_contest, nqueens, pathlib,
pidigits, python_startup, python_startup_no_site, regex_dna,
regex_effbot, regex_v8, richards, scimark_fft, scimark_sor,
spectral_norm, sqlite_synth, sympy_integrate, tornado_http,
unpickle, unpickle_list
Ignored benchmarks (5) of ../cpython-profile-tagged-off/base4.json:
sqlalchemy_declarative, xml_etree_generate, xml_etree_iterparse,
xml_etree_parse, xml_etree_process
Ignored benchmarks (4) of fixedint5.json:
xml_etree_pure_python_generate, xml_etree_pure_python_iterparse,
xml_etree_pure_python_parse,
xml_etree_pure_python_process