Faster PyArg_ParseTupleAndKeywords kwargs

Early while working on py-lmdb I noticed that a huge proportion of runtime was being lost to PyArg_ParseTupleAndKeywords, and so I subsequently wrote a specialization for this extension module. In the current code[0], parse_args() is much faster than ParseTupleAndKeywords, responsible for a doubling of performance in several of the library's faster code paths (e.g. Cursor.put(append=True)). Ever since adding the rewrite I've wanted to go back and either remove it or at least reduce the amount of custom code, but it seems there really isn't a better approach to fast argument parsing using the bare Python C API at the moment. [0] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L833 In the append=True path, parse_args() yields a method that can complete 1.1m insertions/sec on my crappy Core 2 laptop, compared to 592k/sec using the same method rewritten with PyArg_ParseTupleAndKeywords. Looking to other 'fast' projects for precedent, and studying Cython's output in particular, it seems that Cython completely ignores the standard APIs and expends a huge amount of .text on using almost every imagineable C performance trick to speed up parsing (actually Cython's output is a sheer marvel of trickery, it's worth study). So it's clear the standard APIs are somewhat non-ideal, and those concerned with performance are taking other approaches. ParseTupleAndKeywords is competitive for positional arguments (1.2m/sec vs 1.5m/sec for "Cursor.put(k, v)"), but things go south when a kwarg dict is provided. The primary goal of parse_args() was to avoid the continous temporary allocations and hashing done by PyArg_ParseTupleAndKeywords, by way of PyDict_GetItemString(), which invokes PyString_FromString() internally, which subsequently causes alloc / strlen() and memcpy(), one for each possible kwarg, on every function call. The rewrite has been hacked over time, and honestly I'm not sure which bits are responsible for the speed improvement, and which are totally redundant. The tricks are: * Intern keyword arg strings once at startup, avoiding the temporary PyString creation and also causing their hash() to be cached across calls. This uses an incredibly ugly pair of enum/const char *[] static globals.[3] [3] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L79 * Use a per-function 'static const' array of structures to describe the expected set of arguments. Since these arrays are built at compile time, they cannot directly reference the runtime-generated interned PyStrings, thus the use of an enum. A nice side effect of the array's contents being purely small integer is that each array element is small and thus quite cache-efficient. In the current code array elements are 4 bytes each. * Avoid use of variable-length argument lists. I'm not sure if this helps at all, but certainly it simplifies the parsing code and makes the call sites much more compact. Instead of a va_arg list of destination pointers, parsed output is represented as a per-function structure[1][2] definition, whose offsets are encoded into the above argspec array, and at build time. [1] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L1265 [2] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L704 This might hurt the compiler's ability to optimize the placement of what were previouly small stack variables (e.g. I'm not sure if it prevents the compiler making more use of registers). In any case the overall result is much faster than before. And most recently, giving a further 20% boost to append=True: * Cache a dict that maps interned kwarg -> argspec array offset, allowing the per-call kwarg dict to be iterated, and causing only one hash lookup per supplied kwarg. Prior to the cache, presence of kwargs would cause one hash lookup per argspec entry (e.g. potentially 15 lookups instead of 1 or 2). It's obvious this approach isn't generally useful, and looking at the CPython source we can see the interning trick is already known, and presumably not exposed in the CPython API because the method is quite ugly. Still it seems there is room to improve the public API to include something like this interning trick, and that's what this mail is about. My initial thought is for a horribly macro-heavy API like: PyObject *my_func(PyObject *self, PyObject *args, PyObject *kwargs) { Py_ssize_t foo; const char *some_buf; PyObject *list; Py_BEGIN_ARGS PY_ARG("foo", PY_ARG_SSIZE_T, NULL, PY_ARG_REQUIRED), PY_ARG("some_buf", PY_ARG_BUFFER, NULL, PY_ARG_REQUIRED), PY_ARG("list", PY_ARG_OBJECT, &PyList_Type, NULL, 0) Py_END_ARGS if(Py_PARSE_ARGS(args, kwds, &foo, &some_buf, &list)) { return NULL; } /* do stuff */ } Where: struct py_arg_info; /* Opaque */ struct py_arg_spec { const char *name; enum { ... } type; PyTypeObject *type; int options; }; #define PY_BEGIN_ARGS \ static struct py_arg_info *_py_arg_info; \ if(! _py_arg_info) { \ static const struct py_arg_spec _py_args[] = { #define PY_END_ARGS \ }; \ _Py_InitArgInfo(&_py_arg_info, _py_args, \ sizeof _py_args / sizeof _py_args[0]); \ } #define PY_ARG(name, type, type2, opts) {name, type, type2, opts} #define Py_PARSE_ARGS(a, k, ...) \ _Py_ParseArgsFromInfo(&_py_arg_info, a, k, _VA_ARG_); Here some implementation-internal py_arg_info structure is built up on first function invocation, producing the cached mapping of argument keywords to array index, and storing a reference to the py_arg_spec array, or some version of it that has been internally transformed to a more useful format. You may notice this depends on va_arg macros, which breaks at least Visual Studio, so at the very least that part is broken. The above also doesn't deal with all the cases supported by the existing PyArg_ routines, such as setting the function name and custom error message, or unpacking tuples (is this still even supported in Python 3?) Another approach might be to use a PyArg_ParseTupleAndKeywords-alike API, so that something like this was possible: static PyObject * my_method(PyObject *self, PyObject *args, *PyObject *kwds) { Py_ssize_t foo; const char *some_buf; Py_ssize_t some_buf_size; PyObject *list; static PyArgInfo arg_info; static char *keywords[] = { "foo", "some_buf", "list", NULL }; if(! PyArg_FastParse(&arg_info, args, kwds, "ns#|O!", keywords, &foo, &some_buf, &some_buf_size, &PyList_Type, &list)) { return NULL; } /* do stuff */ } In that case that API is very familiar, and PyArg_FastParse() builds the cache on first invocation itself, but the supplied va_list is full of noise that needs to be carefully skipped somehow. The work involved in doing the skipping might introduce complexity that slows things down all over again. Any thoughts on a better API? Is there a need here? I'm obviously not the first to notice PyArg_ParseTupleAndKeywords is slow, and so I wonder how many people have sighed and brushed off the fact their module is slower than it could be. David

On Fri, May 23, 2014 at 07:22:20PM +0000, dw+python-ideas@hmmz.org wrote:
Perhaps the most off-the-wall approach would be to completely preserve the existing interface, by using a dollop of assembly to fetch the return address, and use that to maintain some internal hash table. That's incredibly nasty, but as a systemic speedup it might be worth it? David

On Fri, May 23, 2014 at 8:22 PM, <dw+python-ideas@hmmz.org> wrote:
As another data point about PyArg_ParseTupleAndKeywords slowness, Numpy has tons of barely-maintainable hand-written argument parsing code. I haven't read the proposal below in detail, but anything that helps us clean that up is ok with me... You should check out Argument Clinic (PEP 436) if you haven't seen it. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Fri, May 23, 2014 at 09:08:28PM +0100, Nathaniel Smith wrote:
You should check out Argument Clinic (PEP 436) if you haven't seen it.
Thanks! I'd seen this but forgotten about it. The use of a preprocessor seems excessive, and a potential PITA when combined with other preprocessors - e.g. Qt's moc, but the language is a very cool idea. If the DSL definition was expressed as a string constant, that pointer could key an internal hash table. Still not as fast as specialized code, but perhaps an interesting middleground. David

On Fri, May 23, 2014 at 9:22 PM, <dw+python-ideas@hmmz.org> wrote:
Yes, but OTOH it's working and shipping code with a substantial user base (lots of the CPython implementation), so making it fast and usable in third-party libraries might still be the most efficient approach. And IIRC it's not (necessarily) a build-time thing, the usual mode is for it to update your checked-in source directly, so integration with other preprocessors might be a non-issue. A preprocessor approach might make it easier to support older Python's in the generated code, compared to a library approach. (It's easier to say "developers/the person making the source release must have Python 3 installed, but the generated code works everywhere" than to say "this library only works on Python 3.5+ because that's the first version that ships the new argument parsing API".) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 24 May 2014 06:39, "Nathaniel Smith" <njs@pobox.com> wrote:
Note there are two key goals behind Argument Clinic: 1. Add introspection metadata to functions implemented in C without further reducing maintainability (adding an arg to a C function already touched 6 places, signature metadata would have been a 7th) 2. Eventually switch the generated code to something faster than PyArg_ParseTupleAndKeywords. What phase 2 actually looks like hasn't been defined yet (enabling phase 1 ended up being a big enough challenge for 3.4), but the ideas in this thread would definitely be worth exploring further in that context. As Nathaniel noted, once checked in, Argument Clinic code is just ordinary C code with some funny comments, so it introduces no additional build time dependencies. Cheers, Nick.

On Sat, May 24, 2014 at 11:36:29AM +1000, Nick Coghlan wrote:
Hadn't realized it was already in use! It isn't nearly as intrusive as I might have expected, it seems 'preprocessor' is just a scary word. :)
2. Eventually switch the generated code to something faster than PyArg_ParseTupleAndKeywords.
The previous mail's hint led to thinking about how to actually implement a no-API-changes internal cache for PyArg_ParseTupleAndKeywords. While not a perfect solution, that approach has the tremendous benefit of backwards compatibility with every existing extension. It seems after 20 years' evolution, getargs.c is quite resistant to change (read: it afflicts headaches and angst on the unweary), so instead I've spent my Friday evening exploring a rewrite. David

On Fri, May 23, 2014 at 07:22:20PM +0000, dw+python-ideas@hmmz.org wrote:
Perhaps the most off-the-wall approach would be to completely preserve the existing interface, by using a dollop of assembly to fetch the return address, and use that to maintain some internal hash table. That's incredibly nasty, but as a systemic speedup it might be worth it? David

On Fri, May 23, 2014 at 8:22 PM, <dw+python-ideas@hmmz.org> wrote:
As another data point about PyArg_ParseTupleAndKeywords slowness, Numpy has tons of barely-maintainable hand-written argument parsing code. I haven't read the proposal below in detail, but anything that helps us clean that up is ok with me... You should check out Argument Clinic (PEP 436) if you haven't seen it. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Fri, May 23, 2014 at 09:08:28PM +0100, Nathaniel Smith wrote:
You should check out Argument Clinic (PEP 436) if you haven't seen it.
Thanks! I'd seen this but forgotten about it. The use of a preprocessor seems excessive, and a potential PITA when combined with other preprocessors - e.g. Qt's moc, but the language is a very cool idea. If the DSL definition was expressed as a string constant, that pointer could key an internal hash table. Still not as fast as specialized code, but perhaps an interesting middleground. David

On Fri, May 23, 2014 at 9:22 PM, <dw+python-ideas@hmmz.org> wrote:
Yes, but OTOH it's working and shipping code with a substantial user base (lots of the CPython implementation), so making it fast and usable in third-party libraries might still be the most efficient approach. And IIRC it's not (necessarily) a build-time thing, the usual mode is for it to update your checked-in source directly, so integration with other preprocessors might be a non-issue. A preprocessor approach might make it easier to support older Python's in the generated code, compared to a library approach. (It's easier to say "developers/the person making the source release must have Python 3 installed, but the generated code works everywhere" than to say "this library only works on Python 3.5+ because that's the first version that ships the new argument parsing API".) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 24 May 2014 06:39, "Nathaniel Smith" <njs@pobox.com> wrote:
Note there are two key goals behind Argument Clinic: 1. Add introspection metadata to functions implemented in C without further reducing maintainability (adding an arg to a C function already touched 6 places, signature metadata would have been a 7th) 2. Eventually switch the generated code to something faster than PyArg_ParseTupleAndKeywords. What phase 2 actually looks like hasn't been defined yet (enabling phase 1 ended up being a big enough challenge for 3.4), but the ideas in this thread would definitely be worth exploring further in that context. As Nathaniel noted, once checked in, Argument Clinic code is just ordinary C code with some funny comments, so it introduces no additional build time dependencies. Cheers, Nick.

On Sat, May 24, 2014 at 11:36:29AM +1000, Nick Coghlan wrote:
Hadn't realized it was already in use! It isn't nearly as intrusive as I might have expected, it seems 'preprocessor' is just a scary word. :)
2. Eventually switch the generated code to something faster than PyArg_ParseTupleAndKeywords.
The previous mail's hint led to thinking about how to actually implement a no-API-changes internal cache for PyArg_ParseTupleAndKeywords. While not a perfect solution, that approach has the tremendous benefit of backwards compatibility with every existing extension. It seems after 20 years' evolution, getargs.c is quite resistant to change (read: it afflicts headaches and angst on the unweary), so instead I've spent my Friday evening exploring a rewrite. David
participants (3)
-
dw+python-ideas@hmmz.org
-
Nathaniel Smith
-
Nick Coghlan