[capi-sig]fixedint patch, proof of concept working now

Hi,
It has taken a fair amount of work but I have mostly gotten tagged pointers working for small ints (currently 63 bits on 64-bit platform). The configure option is --with-fixedint. I'm trying to release early and often.
The whole test suite runs without crashing, which feels like some kind of milestone to me. The following tests still fail:
test_ctypes test_fcntl test_fileio test_gdb test_inspect test_io
test_repl test_socket test_sqlite test_unicode test_userstring
Latest code in here:
https://github.com/nascheme/cpython/tree/tagged_int
Unfortunately, there is a net slowdown with the fixedint option enabled. Full PGO benchmarks from pyperformance are below. I'm hoping there is still some good work to be done reducing the number of times fixed ints need to heap allocated. I suspect that is why the pickle_list and pickle_dict are slower. I should also try to measure the memory usage difference. The fixedint version should use less RAM.
Here is a Linux perf report for the pickle_list benchmark:
http://python.ca/nas/python/perf-fixedint-pickle-list.txt
The addition of the extra test + jmp instructions to INCREF/DECREF are hurting a fair bit. I'm not sure there is anything to be done there. Based on the Linux perf results, I have the suspicion that the extra instructions for INCREF/DECREF/_Py_TYPE are blowing up the size of _PyEval_EvalFrameDefault. I need to investigate that more.
BTW, Linux perf is amazing. Anyone who does low-level optimization work should study it.
I did consider trying to use a second tag for short strings. Not sure it will help too much as some quick analysis shows that only 25% of strings used by PyDict_GetItem are short enough to fit.
This morning I dreamed up new idea: analyze normal Python programs and build a list of strings commonly used for PyDict_GetItem. They will be strings like "self", builtin functions names, etc. Then use a tagged pointer to hold these common strings. I.e. a tag to denote a string (or interned symbol, in Lisp speak) and an integer which is the offset into the fixed array of interned strings. The savings would have to come from avoiding the INCREF/DECREF accounting of refcounts on those strings. Instead of fixed set of strings, perhaps we could make the intern process dynamically allocate the tag IDs. We could have a specialized lookdict that works for dicts containing only interned strings.
$ ./python -m perf compare_to -G
../cpython-profile-tagged-off/base4.json fixedint5.json --min-spee
d 5
Slower (24):
- pickle_list: 3.06 us +- 0.04 us -> 3.74 us +- 0.03 us: 1.22x slower (+22%)
- pickle_dict: 22.2 us +- 0.1 us -> 26.2 us +- 0.2 us: 1.18x slower (+18%)
- raytrace: 501 ms +- 5 ms -> 565 ms +- 6 ms: 1.13x slower (+13%)
- crypto_pyaes: 113 ms +- 1 ms -> 126 ms +- 0 ms: 1.12x slower (+12%)
- logging_silent: 210 ns +- 4 ns -> 234 ns +- 3 ns: 1.11x slower (+11%)
- telco: 6.00 ms +- 0.09 ms -> 6.68 ms +- 0.14 ms: 1.11x slower (+11%)
- float: 111 ms +- 2 ms -> 123 ms +- 1 ms: 1.11x slower (+11%)
- nbody: 122 ms +- 1 ms -> 135 ms +- 2 ms: 1.10x slower (+10%)
- mako: 17.1 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.10x slower (+10%)
- json_dumps: 12.3 ms +- 0.2 ms -> 13.5 ms +- 0.1 ms: 1.10x slower (+10%)
- scimark_monte_carlo: 103 ms +- 2 ms -> 113 ms +- 1 ms: 1.10x slower (+10%)
- pickle_pure_python: 467 us +- 3 us -> 508 us +- 6 us: 1.09x slower (+9%)
- logging_format: 10.2 us +- 0.1 us -> 11.1 us +- 2.2 us: 1.09x slower (+9%)
- chameleon: 9.27 ms +- 0.09 ms -> 10.1 ms +- 0.1 ms: 1.09x slower (+9%)
- sqlalchemy_imperative: 30.4 ms +- 0.8 ms -> 32.9 ms +- 0.9 ms: 1.08x slower (+8%)
- django_template: 122 ms +- 2 ms -> 131 ms +- 2 ms: 1.08x slower (+8%)
- sympy_str: 184 ms +- 2 ms -> 198 ms +- 5 ms: 1.07x slower (+7%)
- unpickle_pure_python: 368 us +- 5 us -> 394 us +- 9 us: 1.07x slower (+7%)
- sympy_expand: 426 ms +- 10 ms -> 452 ms +- 12 ms: 1.06x slower (+6%)
- sympy_sum: 90.4 ms +- 0.6 ms -> 96.0 ms +- 1.0 ms: 1.06x slower (+6%)
- regex_compile: 181 ms +- 7 ms -> 192 ms +- 7 ms: 1.06x slower (+6%)
- scimark_lu: 173 ms +- 6 ms -> 182 ms +- 5 ms: 1.05x slower (+5%)
- genshi_xml: 62.7 ms +- 0.8 ms -> 66.1 ms +- 0.8 ms: 1.05x slower (+5%)
- pickle: 9.11 us +- 0.13 us -> 9.59 us +- 0.06 us: 1.05x slower (+5%)
Faster (2):
- unpack_sequence: 49.1 ns +- 0.7 ns -> 45.0 ns +- 1.3 ns: 1.09x faster (-8%)
- scimark_sparse_mat_mult: 3.75 ms +- 0.05 ms -> 3.47 ms +- 0.05 ms: 1.08x faster (-8%)
Benchmark hidden because not significant (29): 2to3, chaos,
deltablue, dulwich_log, fannkuch, genshi_text, go, hexiom, html5lib,
json_loads, logging_simple, meteor_contest, nqueens, pathlib,
pidigits, python_startup, python_startup_no_site, regex_dna,
regex_effbot, regex_v8, richards, scimark_fft, scimark_sor,
spectral_norm, sqlite_synth, sympy_integrate, tornado_http,
unpickle, unpickle_list
Ignored benchmarks (5) of ../cpython-profile-tagged-off/base4.json:
sqlalchemy_declarative, xml_etree_generate, xml_etree_iterparse,
xml_etree_parse, xml_etree_process
Ignored benchmarks (4) of fixedint5.json:
xml_etree_pure_python_generate, xml_etree_pure_python_iterparse,
xml_etree_pure_python_parse,
xml_etree_pure_python_process

Hi Neil,
Le 18/09/2018 à 23:55, Neil Schemenauer a écrit :
Thanks for doing this and thanks for sharing the benchmark numbers. This helps bring reality into the discussion ;-)
Regards
Antoine.

On 2018-09-19, Antoine Pitrou wrote:
Thanks for doing this and thanks for sharing the benchmark numbers. This helps bring reality into the discussion ;-)
It seems to me that CPython has reached a sort of "local maximum" in terms of optimization. Change the implementation in any serious way and it is going to get slower. There are all these small optimization tricks that only work well based on the particulars of the current design. So, I don't feel too bad amount my performance numbers. Computing with fixed size integers seems to matter not so much for modern programs. So, I didn't gain much from stuff like fixedint + fixedint. However, there is a whole new set of tricks the tagged pointer support will allow now. Endless fun. ;-P
Cheers,
Neil

On Wed, Sep 19, 2018 at 10:15 AM INADA Naoki <songofacandy@gmail.com> wrote:
"self" is almost never referred to by name (it's an indexed local in the bytecode)
builtin functions names, etc.
Strings encountered in CPython compilation are already interend. Fortunately, this should make the experiment easier--you could just plug into that framework to decide which strings to represent with tagged pointers rather than having to come up with your own heuristic.
Given that branching to avoid the refcounting is more expensive than doing the refcounting itself, this might be hard, but maybe there's a slight win if we're already doing this branching for ints, etc. Or if "normal" refcounts become more expensive than increment/decrements (e.g. truly threadsafe). Unlike ints, most strings are not ephemeral, and nor can the "contents" be accessed without a deference, so tagged pointers are less of a win here.)
Instead of fixed set of strings,
CPython already optimizes for dicts containing only strings; given that most of the keys in this dict are interned, I'm not sure it would be a big win to disallow non-interned keys (which would primarily speed up the atypical key-not-found path).
Still, these are interesting experiments; thanks for exploring and sharing!

Hi Neil,
Le 18/09/2018 à 23:55, Neil Schemenauer a écrit :
Thanks for doing this and thanks for sharing the benchmark numbers. This helps bring reality into the discussion ;-)
Regards
Antoine.

On 2018-09-19, Antoine Pitrou wrote:
Thanks for doing this and thanks for sharing the benchmark numbers. This helps bring reality into the discussion ;-)
It seems to me that CPython has reached a sort of "local maximum" in terms of optimization. Change the implementation in any serious way and it is going to get slower. There are all these small optimization tricks that only work well based on the particulars of the current design. So, I don't feel too bad amount my performance numbers. Computing with fixed size integers seems to matter not so much for modern programs. So, I didn't gain much from stuff like fixedint + fixedint. However, there is a whole new set of tricks the tagged pointer support will allow now. Endless fun. ;-P
Cheers,
Neil

On Wed, Sep 19, 2018 at 10:15 AM INADA Naoki <songofacandy@gmail.com> wrote:
"self" is almost never referred to by name (it's an indexed local in the bytecode)
builtin functions names, etc.
Strings encountered in CPython compilation are already interend. Fortunately, this should make the experiment easier--you could just plug into that framework to decide which strings to represent with tagged pointers rather than having to come up with your own heuristic.
Given that branching to avoid the refcounting is more expensive than doing the refcounting itself, this might be hard, but maybe there's a slight win if we're already doing this branching for ints, etc. Or if "normal" refcounts become more expensive than increment/decrements (e.g. truly threadsafe). Unlike ints, most strings are not ephemeral, and nor can the "contents" be accessed without a deference, so tagged pointers are less of a win here.)
Instead of fixed set of strings,
CPython already optimizes for dicts containing only strings; given that most of the keys in this dict are interned, I'm not sure it would be a big win to disallow non-interned keys (which would primarily speed up the atypical key-not-found path).
Still, these are interesting experiments; thanks for exploring and sharing!
participants (4)
-
Antoine Pitrou
-
INADA Naoki
-
Neil Schemenauer
-
Robert Bradshaw