[capi-sig]fixedint patch, proof of concept working now

18 Sep 2018


      Hi,
It has taken a fair amount of work but I have mostly gotten tagged
pointers working for small ints (currently 63 bits on 64-bit
platform).  The configure option is --with-fixedint.  I'm trying to
release early and often.
The whole test suite runs without crashing, which feels like some
kind of milestone to me.  The following tests still fail:
test_ctypes test_fcntl test_fileio test_gdb test_inspect test_io
test_repl test_socket test_sqlite test_unicode test_userstring
Latest code in here:
https://github.com/nascheme/cpython/tree/tagged_int
Unfortunately, there is a net slowdown with the fixedint option
enabled.  Full PGO benchmarks from pyperformance are below.  I'm
hoping there is still some good work to be done reducing the number
of times fixed ints need to heap allocated.  I suspect that is why
the pickle_list and pickle_dict are slower.  I should also try to
measure the memory usage difference.  The fixedint version should
use less RAM.
Here is a Linux perf report for the pickle_list benchmark:
http://python.ca/nas/python/perf-fixedint-pickle-list.txt
The addition of the extra test + jmp instructions to INCREF/DECREF
are hurting a fair bit.  I'm not sure there is anything to be done
there.  Based on the Linux perf results, I have the suspicion that
the extra instructions for INCREF/DECREF/_Py_TYPE are blowing up the
size of _PyEval_EvalFrameDefault.  I need to investigate that more.
BTW, Linux perf is amazing.  Anyone who does low-level optimization
work should study it.
I did consider trying to use a second tag for short strings.  Not
sure it will help too much as some quick analysis shows that only
25% of strings used by PyDict_GetItem are short enough to fit.
This morning I dreamed up new idea: analyze normal Python programs
and build a list of strings commonly used for PyDict_GetItem.  They
will be strings like "self", builtin functions names, etc.  Then use
a tagged pointer to hold these common strings.  I.e. a tag to denote
a string (or interned symbol, in Lisp speak) and an integer which is
the offset into the fixed array of interned strings.  The savings
would have to come from avoiding the INCREF/DECREF accounting of
refcounts on those strings.  Instead of fixed set of strings,
perhaps we could make the intern process dynamically allocate the
tag IDs.  We could have a specialized lookdict that works for dicts
containing only interned strings.
$ ./python -m perf compare_to -G
../cpython-profile-tagged-off/base4.json fixedint5.json --min-spee
d 5

Slower (24):

pickle_list: 3.06 us +- 0.04 us -> 3.74 us +- 0.03 us: 1.22x
slower (+22%)
pickle_dict: 22.2 us +- 0.1 us -> 26.2 us +- 0.2 us: 1.18x slower
(+18%)
raytrace: 501 ms +- 5 ms -> 565 ms +- 6 ms: 1.13x slower (+13%)
crypto_pyaes: 113 ms +- 1 ms -> 126 ms +- 0 ms: 1.12x slower
(+12%)
logging_silent: 210 ns +- 4 ns -> 234 ns +- 3 ns: 1.11x slower
(+11%)
telco: 6.00 ms +- 0.09 ms -> 6.68 ms +- 0.14 ms: 1.11x slower
(+11%)
float: 111 ms +- 2 ms -> 123 ms +- 1 ms: 1.11x slower (+11%)
nbody: 122 ms +- 1 ms -> 135 ms +- 2 ms: 1.10x slower (+10%)
mako: 17.1 ms +- 0.1 ms -> 18.8 ms +- 0.1 ms: 1.10x slower (+10%)
json_dumps: 12.3 ms +- 0.2 ms -> 13.5 ms +- 0.1 ms: 1.10x slower
(+10%)
scimark_monte_carlo: 103 ms +- 2 ms -> 113 ms +- 1 ms: 1.10x
slower (+10%)
pickle_pure_python: 467 us +- 3 us -> 508 us +- 6 us: 1.09x slower
(+9%)
logging_format: 10.2 us +- 0.1 us -> 11.1 us +- 2.2 us: 1.09x
slower (+9%)
chameleon: 9.27 ms +- 0.09 ms -> 10.1 ms +- 0.1 ms: 1.09x slower
(+9%)
sqlalchemy_imperative: 30.4 ms +- 0.8 ms -> 32.9 ms +- 0.9 ms:
1.08x slower (+8%)
django_template: 122 ms +- 2 ms -> 131 ms +- 2 ms: 1.08x slower
(+8%)
sympy_str: 184 ms +- 2 ms -> 198 ms +- 5 ms: 1.07x slower (+7%)
unpickle_pure_python: 368 us +- 5 us -> 394 us +- 9 us: 1.07x
slower (+7%)
sympy_expand: 426 ms +- 10 ms -> 452 ms +- 12 ms: 1.06x slower
(+6%)
sympy_sum: 90.4 ms +- 0.6 ms -> 96.0 ms +- 1.0 ms: 1.06x slower
(+6%)
regex_compile: 181 ms +- 7 ms -> 192 ms +- 7 ms: 1.06x slower
(+6%)
scimark_lu: 173 ms +- 6 ms -> 182 ms +- 5 ms: 1.05x slower (+5%)
genshi_xml: 62.7 ms +- 0.8 ms -> 66.1 ms +- 0.8 ms: 1.05x slower
(+5%)
pickle: 9.11 us +- 0.13 us -> 9.59 us +- 0.06 us: 1.05x slower
(+5%)

Faster (2):

unpack_sequence: 49.1 ns +- 0.7 ns -> 45.0 ns +- 1.3 ns: 1.09x
faster (-8%)
scimark_sparse_mat_mult: 3.75 ms +- 0.05 ms -> 3.47 ms +- 0.05 ms:
1.08x faster (-8%)

Benchmark hidden because not significant (29): 2to3, chaos,
deltablue, dulwich_log, fannkuch, genshi_text, go, hexiom, html5lib,
json_loads, logging_simple, meteor_contest, nqueens, pathlib,
pidigits, python_startup, python_startup_no_site, regex_dna,
regex_effbot, regex_v8, richards, scimark_fft, scimark_sor,
spectral_norm, sqlite_synth, sympy_integrate, tornado_http,
unpickle, unpickle_list

Ignored benchmarks (5) of ../cpython-profile-tagged-off/base4.json:
sqlalchemy_declarative, xml_etree_generate, xml_etree_iterparse,
xml_etree_parse, xml_etree_process
Ignored benchmarks (4) of fixedint5.json:
xml_etree_pure_python_generate, xml_etree_pure_python_iterparse,
xml_etree_pure_python_parse,
xml_etree_pure_python_process

[capi-sig]fixedint patch, proof of concept working now

Neil Schemenauer