Re: [Python-ideas] INSANE FLOAT PERFORMANCE!!!

To answer your question: I special-case unicode (strings), ints, and floats. I am working on special-casing tuples (can even be different types, just need homogeneity column-wise). The best speedups will be tuples of floats: it'll bypass three layers of useless checks. If I run it without special-casing floats (just using tp->rich_compare) I only get like a 5-10% speedup. I'm working on rigorous benchmarks for all this stuff, will post a pdf along with the patch once it's all done. But it's certainly <10%. However, part of this is because my special case is really low-level; for strings I've actually found the opposite, using tp->richcompare gives me almost the same results as my special case compare, since it still has to PyUnicode_READY the strings (or whatever it's called). Regarding generalization: the general technique for special-casing is you just substitute all type checks with 1 or 0 by applying the type assumption you're making. That's the only way to guarantee it's safe and compliant. Elliot On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:

On Wed, Oct 12, 2016 at 12:25:16AM +0000, Elliot Gorokhovsky wrote:
I'm confused -- I don't understand how *removing* type checks can possible guarantee the code is safe and compliant. It's all very well and good when you are running tests that meet your type assumption, but what happens if they don't? If I sort a list made up of (say) mixed int and float (possibly including subclasses), does your "all type checks are 1 or 0" sort segfault? If not, why not? Where's the safety coming from? By the way, your emails in this thread have reminded me of a quote from the late Sir Terry Pratchett's novel "Maskerade" (the odd spelling is intentional): "What sort of person," said Salzella patiently, "sits down and *writes* a maniacal laugh? And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head." :-) -- Steve

On 12 October 2016 at 11:16, Steven D'Aprano <steve@pearwood.info> wrote:
My understanding is that the code does a per-check that all the elements of the list are the same type (float, for example). This is a relatively quick test (O(n) pointer comparisons). If everything *is* a float, then an optimised comparison routine that skips all the type checks and goes straight to a float comparison (single machine op). Because there are more than O(n) comparisons done in a typical sort, this is a win. And because the type checks needed in rich comparison are much more expensive than a pointer check, it's a *big* win. What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible. Surely you have to check either way? It's not that it's a particularly important question - if the optimisation works, it's not that big a deal what triggered the insight. It's just that I'm not sure if there's some other point that I've not properly understood. Paul

On 12 October 2016 at 21:35, Paul Moore <p.f.moore@gmail.com> wrote:
It's probably more relevant that cmp() went away, since that simplified the comparison logic to just PyObject_RichCompareBool, without the custom comparison function path. It *might* have still been possible to do something like this in the Py2 code (since the main requirement is to do the pre-check for consistent types if the first object is of a known type with an optimised fast path), but I don't know anyone that actually *likes* adding new special cases to already complex code and trying to figure out how to test whether or not they've broken anything :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Nick Coghlan]
Well, the current sort is old by now, and was written for Python 2. But it did anticipate that rich comparisons were the future, and deliberately restricted itself to using only "<" (Py_LT) comparisons. So, same as now, only the "<" path needed to be examined.
It shouldn't really matter whether it's a known type. For any type, if it's known that all the objects are of that type, that type's tp_richcompare slot can be read up once, and if non-NULL used throughout. That would save several levels of function call per comparison during the sort; although that's not factor-of-3-speedup potential, it should still be a significant win.
A nice thing about this one is that special cases are a one-time thing at the start, and don't change anything in the vast bulk of the current sorting code. So when it breaks, it should be pretty easy to figure out why ;-)

[Paul Moore]
That matches my understanding.
Because there are more than O(n) comparisons done in a typical sort, this is a win.
If the types are in fact all the same, it should be a win even for n==2 (at n < 2 no comparisons are done; at n==2 exactly 1 comparison is done): one pointer compare + go-straight-to-C-float-"x<y"-later (proposed) should be cheaper than the multiple layers of conditionals and function calls currently required to dynamically deduce that C-float-"x<y" is all that's really needed. So "more than O(n)" comparisons sweetens the deal, when it applies, but even in the case of a sorted or monotonically decreasing list (which require exactly n-1 comparisons now) it should win. Where it should predictably lose is when the types aren't all the same. Then between 1 and n-2 pointer compares are wasted to determine that. We still need timings to quantify the damage in such cases, but I don't _expect_ it to be significant.
And because the type checks needed in rich comparison
And layers of function calls.
are much more expensive than a pointer check, it's a *big* win.
Bingo :-)
What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible.
It doesn't. It would also apply in Python 2. I simply expect the optimization will pay off more frequently in Python 3 code. For example, in Python 2 I used to create lists with objects of wildly mixed types, and sort them merely to bring objects of the same type next to each other. Things "like that" don't work at all in Python 3.
Well, either your understanding is fine, or we're both confused :-)

On Wed, Oct 12, 2016 at 9:20 AM Tim Peters <tim.peters@gmail.com> wrote:
Yup. Actually, the initial version of this work was with Python 2. What happened was this: I had posted earlier something along the lines of "hey everybody let's radix sort strings instead of merge sort because that will be more fun ok". And everyone wrote me back "no please don't are you kidding". Tim Peters wrote back "try it but just fyi it's not gonna work". So I set off to try it. I had never used the C API before, but luckily I found some Python 2 documentation that gives an example of subclassing list, so I was able to mostly just copy-paste to get a working list extension module. I then copied over the implementation of listsort. My first question was how expensive python compares are vs C compares. And since python 2 has PyString_AS_STRING, which just gives you a char* pointer to a C string, I went in and replaced PyObject_RichCompareBool with strcmp and did a simple benchmark. And I was just totally blown away; it turns out you get something like a 40-50% improvement (at least on my simple benchmark). So that was the motivation for all this. Actually, if I wrote this for python 2, I might be able to get even better numbers (at least for strings), since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings are strcmp-able, so maybe if we go through and verify all the strings are UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff works to do this safely). My string special case currently just bypasses the typechecks and goes to unicode_compare(), which is still wayyy overkill for the common case of ASCII or Latin-1 strings, since it uses a for loop to go through and check characters, and strcmp uses compiler magic to do it in like, negative time or something. I even PyUnicode_READY the strings before comparing; I'm not sure if that's really necessary, but that's how PyUnicode_Compare does it.

On Wed, Oct 12, 2016 at 2:19 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote: [...]
It looks like PyUnicode_Compare already has a special case to use memcmp when both of the strings fit into latin1: https://github.com/python/cpython/blob/cfc517e6eba37f1bd61d57bf0dbece9843bff... I suppose the for loops that are used for multibyte strings could potentially be sped up with SIMD or something, but that gets complicated fast, and modern compilers might even be doing it already. -n -- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 12, 2016 at 3:39 PM Nathaniel Smith <njs@pobox.com> wrote:
It looks like PyUnicode_Compare already has a special case to use memcmp when both of the strings fit into latin1:
Wow! That's great! I didn't even try reading through unicode_compare, because I felt I might miss some subtle detail that would break everything. But ya, that's great! Since surely latin1 is the most common use case. So I'll just add a latin1 check in the check-loop, and then I'll have two unsafe_unicode_compare functions. I felt bad about not being able to get the same kind of string performance I had gotten with python2, so this is nice.

On Thu, Oct 13, 2016 at 8:19 AM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I'm not sure what you mean by "strcmp-able"; do you mean that the lexical ordering of two Unicode strings is guaranteed to be the same as the byte-wise ordering of their UTF-8 encodings? I don't think that's true, but then, I'm not entirely sure how Python currently sorts strings. Without knowing which language the text represents, it's not possible to sort perfectly. https://en.wikipedia.org/wiki/Collation#Automated_collation """ Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür. """ Which means these lists would already be considered sorted, in their respective languages: rosuav@sikorsky:~$ python3 Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55) [GCC 6.1.1 20160802] on linux Type "help", "copyright", "credits" or "license" for more information.
So what's Python doing? Is it a codepoint ordering? ChrisA

So what's Python doing? Is it a codepoint ordering?
...ya...how is the python interpreter supposed to know what language strings are in? There is a unique ordering of unicode strings defined by the unicode standard, AFAIK. If you want to sort by natural language ordering, see here: https://pypi.python.org/pypi/natsort

The comparison methods on Python's str are codepoint-by-codepoint. A neat fact about UTF-8 is that bytewise comparisons on UTF-8 are equivalent to doing codepoint comparisons. But this isn't relevant to Python's str, because Python's str never uses UTF-8. -n On Wed, Oct 12, 2016 at 2:45 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
-- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith <njs@pobox.com> wrote:
But this isn't relevant to Python's str, because Python's str never uses UTF-8.
Really? I thought in python 3, strings are all unicode... so what encoding do they use, then?

On 12 October 2016 at 22:57, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
They are stored internally as arrays of code points, 1-byte (0-255) if all code points fit in that range, otherwise 2-byte or if needed 4 byte. See PEP 393 (https://www.python.org/dev/peps/pep-0393/) for details. Paul

On Wed, Oct 12, 2016 at 5:57 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
No encoding is used. The actual code points are stored as integers of the same size. If all code points are less than 256, they are stored as 8-bit integers (bytes). If some code points are greater or equal to 256 but less than 65536, they are stored as 16-bit integers and so on.

Ah. That makes a lot of sense, actually. Anyway, so then Latin1 strings are memcmp-able, and others are not. That's fine; I'll just add a check for that (I think there are already helper functions for this) and then have two special-case string functions. Thanks! On Wed, Oct 12, 2016 at 4:08 PM Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:

On Wed, Oct 12, 2016 at 3:34 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
I don't think this is true on little-endian systems. -n -- Nathaniel J. Smith -- https://vorpus.org

On 10/12/2016 5:57 PM, Elliot Gorokhovsky wrote:
They are ...
so what encoding do they use, then?
Since 3.3, essentially ascii, latin1, utf-16 without surrogates (ucs2), or utf-32, depending on the hightest codepoint. This is the 'kind' field. If we go this route, I suspect that optimizing string sorting will take some experimentation. If the initial item is str, it might be worthwhile to record the highest 'kind' during the type scan, so that strncmp can be used if all are ascii or latin-1. -- Terry Jan Reedy

On Wed, Oct 12, 2016 at 4:26 PM Terry Reedy <tjreedy@udel.edu> wrote:
My thoughts exactly. One other optimization along these lines: the reason ints don't give quite as shocking results as floats is that comparisons are a bit more expensive: one first has to check that the int would fit in a c long before comparing; if not, then a custom procedure has to be used. However, in practice ints being sorted are almost always smaller in absolute value than 2**32 or whatever. So I think, just as it might pay off to check for latin-1 and use strcmp, it may also pay off to check for fits-in-a-C-long and use a custom function for that case as well, since the performance would be precisely as awesome as the float performance that started this thread: comparisons would just be the cost of pointer dereference plus the cost of C long comparison, i.e. the minimum possible cost.

On Wed, Oct 12, 2016 at 5:36 AM Paul Moore <p.f.moore@gmail.com> wrote:
Yes, that's correct. I'd like to emphasize that I'm not "*removing* type checks" -- I'm checking them in advance, and then substituting in the values I already know are correct. To put it rigorously: there are expressions of the form PyWhatever_Check. I can be eager or lazy about how I calculate these. The current implementation is lazy: it waits until the values are actually called for before calculating them. This is expensive, because they are called for many, many times. My implementation is eager: I calculate all the values in advance, and then if they all happen to be the same, I plug in that value (1 or 0 as the case may be) wherever it appears in the code. If they don't happen to all be the same, like for "mixed int and float", then I just don't change anything and use the default implementation. The code for this is really very simple: int keys_are_all_same_type = 1; PyTypeObject* key_type = lo.keys[0]->ob_type; for (i=0; i< saved_ob_size; i++){ if (lo.keys[i]->ob_type != key_type){ keys_are_all_same_type = 0; break; } } if (keys_are_all_same_type){ if (key_type == &PyUnicode_Type) compare_function = unsafe_unicode_compare; if (key_type == &PyLong_Type) compare_function = unsafe_long_compare; if (key_type == &PyFloat_Type) compare_function = unsafe_float_compare; else compare_function = key_type->tp_richcompare; } else { compare_function = PyObject_RichCompare; } Those unsafe_whatever* functions are derived by substituting in, like I said, the known values for the typechecks (known since keys_are_all_same_type=1 and key_type = whatever) in the existing implementations of the compare functions. Hope everything is clear now! Elliot

Paul Moore wrote:
What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible.
I think the idea was that it's likely to be *useful* a higher proportion of the time, because Python 3 programmers have to be careful that the types they're sorting are compatible. I'm not sure how true that is -- just because you *could* sort lists containing a random selection of types in Python 2 doesn't necessarily mean it was done often. -- Greg

On Wed, Oct 12, 2016 at 12:25:16AM +0000, Elliot Gorokhovsky wrote:
I'm confused -- I don't understand how *removing* type checks can possible guarantee the code is safe and compliant. It's all very well and good when you are running tests that meet your type assumption, but what happens if they don't? If I sort a list made up of (say) mixed int and float (possibly including subclasses), does your "all type checks are 1 or 0" sort segfault? If not, why not? Where's the safety coming from? By the way, your emails in this thread have reminded me of a quote from the late Sir Terry Pratchett's novel "Maskerade" (the odd spelling is intentional): "What sort of person," said Salzella patiently, "sits down and *writes* a maniacal laugh? And all those exclamation marks, you notice? Five? A sure sign of someone who wears his underpants on his head." :-) -- Steve

On 12 October 2016 at 11:16, Steven D'Aprano <steve@pearwood.info> wrote:
My understanding is that the code does a per-check that all the elements of the list are the same type (float, for example). This is a relatively quick test (O(n) pointer comparisons). If everything *is* a float, then an optimised comparison routine that skips all the type checks and goes straight to a float comparison (single machine op). Because there are more than O(n) comparisons done in a typical sort, this is a win. And because the type checks needed in rich comparison are much more expensive than a pointer check, it's a *big* win. What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible. Surely you have to check either way? It's not that it's a particularly important question - if the optimisation works, it's not that big a deal what triggered the insight. It's just that I'm not sure if there's some other point that I've not properly understood. Paul

On 12 October 2016 at 21:35, Paul Moore <p.f.moore@gmail.com> wrote:
It's probably more relevant that cmp() went away, since that simplified the comparison logic to just PyObject_RichCompareBool, without the custom comparison function path. It *might* have still been possible to do something like this in the Py2 code (since the main requirement is to do the pre-check for consistent types if the first object is of a known type with an optimised fast path), but I don't know anyone that actually *likes* adding new special cases to already complex code and trying to figure out how to test whether or not they've broken anything :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

[Nick Coghlan]
Well, the current sort is old by now, and was written for Python 2. But it did anticipate that rich comparisons were the future, and deliberately restricted itself to using only "<" (Py_LT) comparisons. So, same as now, only the "<" path needed to be examined.
It shouldn't really matter whether it's a known type. For any type, if it's known that all the objects are of that type, that type's tp_richcompare slot can be read up once, and if non-NULL used throughout. That would save several levels of function call per comparison during the sort; although that's not factor-of-3-speedup potential, it should still be a significant win.
A nice thing about this one is that special cases are a one-time thing at the start, and don't change anything in the vast bulk of the current sorting code. So when it breaks, it should be pretty easy to figure out why ;-)

[Paul Moore]
That matches my understanding.
Because there are more than O(n) comparisons done in a typical sort, this is a win.
If the types are in fact all the same, it should be a win even for n==2 (at n < 2 no comparisons are done; at n==2 exactly 1 comparison is done): one pointer compare + go-straight-to-C-float-"x<y"-later (proposed) should be cheaper than the multiple layers of conditionals and function calls currently required to dynamically deduce that C-float-"x<y" is all that's really needed. So "more than O(n)" comparisons sweetens the deal, when it applies, but even in the case of a sorted or monotonically decreasing list (which require exactly n-1 comparisons now) it should win. Where it should predictably lose is when the types aren't all the same. Then between 1 and n-2 pointer compares are wasted to determine that. We still need timings to quantify the damage in such cases, but I don't _expect_ it to be significant.
And because the type checks needed in rich comparison
And layers of function calls.
are much more expensive than a pointer check, it's a *big* win.
Bingo :-)
What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible.
It doesn't. It would also apply in Python 2. I simply expect the optimization will pay off more frequently in Python 3 code. For example, in Python 2 I used to create lists with objects of wildly mixed types, and sort them merely to bring objects of the same type next to each other. Things "like that" don't work at all in Python 3.
Well, either your understanding is fine, or we're both confused :-)

On Wed, Oct 12, 2016 at 9:20 AM Tim Peters <tim.peters@gmail.com> wrote:
Yup. Actually, the initial version of this work was with Python 2. What happened was this: I had posted earlier something along the lines of "hey everybody let's radix sort strings instead of merge sort because that will be more fun ok". And everyone wrote me back "no please don't are you kidding". Tim Peters wrote back "try it but just fyi it's not gonna work". So I set off to try it. I had never used the C API before, but luckily I found some Python 2 documentation that gives an example of subclassing list, so I was able to mostly just copy-paste to get a working list extension module. I then copied over the implementation of listsort. My first question was how expensive python compares are vs C compares. And since python 2 has PyString_AS_STRING, which just gives you a char* pointer to a C string, I went in and replaced PyObject_RichCompareBool with strcmp and did a simple benchmark. And I was just totally blown away; it turns out you get something like a 40-50% improvement (at least on my simple benchmark). So that was the motivation for all this. Actually, if I wrote this for python 2, I might be able to get even better numbers (at least for strings), since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings are strcmp-able, so maybe if we go through and verify all the strings are UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff works to do this safely). My string special case currently just bypasses the typechecks and goes to unicode_compare(), which is still wayyy overkill for the common case of ASCII or Latin-1 strings, since it uses a for loop to go through and check characters, and strcmp uses compiler magic to do it in like, negative time or something. I even PyUnicode_READY the strings before comparing; I'm not sure if that's really necessary, but that's how PyUnicode_Compare does it.

On Wed, Oct 12, 2016 at 2:19 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote: [...]
It looks like PyUnicode_Compare already has a special case to use memcmp when both of the strings fit into latin1: https://github.com/python/cpython/blob/cfc517e6eba37f1bd61d57bf0dbece9843bff... I suppose the for loops that are used for multibyte strings could potentially be sped up with SIMD or something, but that gets complicated fast, and modern compilers might even be doing it already. -n -- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 12, 2016 at 3:39 PM Nathaniel Smith <njs@pobox.com> wrote:
It looks like PyUnicode_Compare already has a special case to use memcmp when both of the strings fit into latin1:
Wow! That's great! I didn't even try reading through unicode_compare, because I felt I might miss some subtle detail that would break everything. But ya, that's great! Since surely latin1 is the most common use case. So I'll just add a latin1 check in the check-loop, and then I'll have two unsafe_unicode_compare functions. I felt bad about not being able to get the same kind of string performance I had gotten with python2, so this is nice.

On Thu, Oct 13, 2016 at 8:19 AM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
I'm not sure what you mean by "strcmp-able"; do you mean that the lexical ordering of two Unicode strings is guaranteed to be the same as the byte-wise ordering of their UTF-8 encodings? I don't think that's true, but then, I'm not entirely sure how Python currently sorts strings. Without knowing which language the text represents, it's not possible to sort perfectly. https://en.wikipedia.org/wiki/Collation#Automated_collation """ Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür. """ Which means these lists would already be considered sorted, in their respective languages: rosuav@sikorsky:~$ python3 Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55) [GCC 6.1.1 20160802] on linux Type "help", "copyright", "credits" or "license" for more information.
So what's Python doing? Is it a codepoint ordering? ChrisA

So what's Python doing? Is it a codepoint ordering?
...ya...how is the python interpreter supposed to know what language strings are in? There is a unique ordering of unicode strings defined by the unicode standard, AFAIK. If you want to sort by natural language ordering, see here: https://pypi.python.org/pypi/natsort

The comparison methods on Python's str are codepoint-by-codepoint. A neat fact about UTF-8 is that bytewise comparisons on UTF-8 are equivalent to doing codepoint comparisons. But this isn't relevant to Python's str, because Python's str never uses UTF-8. -n On Wed, Oct 12, 2016 at 2:45 PM, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
-- Nathaniel J. Smith -- https://vorpus.org

On Wed, Oct 12, 2016 at 3:51 PM Nathaniel Smith <njs@pobox.com> wrote:
But this isn't relevant to Python's str, because Python's str never uses UTF-8.
Really? I thought in python 3, strings are all unicode... so what encoding do they use, then?

On 12 October 2016 at 22:57, Elliot Gorokhovsky <elliot.gorokhovsky@gmail.com> wrote:
They are stored internally as arrays of code points, 1-byte (0-255) if all code points fit in that range, otherwise 2-byte or if needed 4 byte. See PEP 393 (https://www.python.org/dev/peps/pep-0393/) for details. Paul

On Wed, Oct 12, 2016 at 5:57 PM, Elliot Gorokhovsky < elliot.gorokhovsky@gmail.com> wrote:
No encoding is used. The actual code points are stored as integers of the same size. If all code points are less than 256, they are stored as 8-bit integers (bytes). If some code points are greater or equal to 256 but less than 65536, they are stored as 16-bit integers and so on.

Ah. That makes a lot of sense, actually. Anyway, so then Latin1 strings are memcmp-able, and others are not. That's fine; I'll just add a check for that (I think there are already helper functions for this) and then have two special-case string functions. Thanks! On Wed, Oct 12, 2016 at 4:08 PM Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:

On Wed, Oct 12, 2016 at 3:34 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
I don't think this is true on little-endian systems. -n -- Nathaniel J. Smith -- https://vorpus.org

On 10/12/2016 5:57 PM, Elliot Gorokhovsky wrote:
They are ...
so what encoding do they use, then?
Since 3.3, essentially ascii, latin1, utf-16 without surrogates (ucs2), or utf-32, depending on the hightest codepoint. This is the 'kind' field. If we go this route, I suspect that optimizing string sorting will take some experimentation. If the initial item is str, it might be worthwhile to record the highest 'kind' during the type scan, so that strncmp can be used if all are ascii or latin-1. -- Terry Jan Reedy

On Wed, Oct 12, 2016 at 4:26 PM Terry Reedy <tjreedy@udel.edu> wrote:
My thoughts exactly. One other optimization along these lines: the reason ints don't give quite as shocking results as floats is that comparisons are a bit more expensive: one first has to check that the int would fit in a c long before comparing; if not, then a custom procedure has to be used. However, in practice ints being sorted are almost always smaller in absolute value than 2**32 or whatever. So I think, just as it might pay off to check for latin-1 and use strcmp, it may also pay off to check for fits-in-a-C-long and use a custom function for that case as well, since the performance would be precisely as awesome as the float performance that started this thread: comparisons would just be the cost of pointer dereference plus the cost of C long comparison, i.e. the minimum possible cost.

On Wed, Oct 12, 2016 at 5:36 AM Paul Moore <p.f.moore@gmail.com> wrote:
Yes, that's correct. I'd like to emphasize that I'm not "*removing* type checks" -- I'm checking them in advance, and then substituting in the values I already know are correct. To put it rigorously: there are expressions of the form PyWhatever_Check. I can be eager or lazy about how I calculate these. The current implementation is lazy: it waits until the values are actually called for before calculating them. This is expensive, because they are called for many, many times. My implementation is eager: I calculate all the values in advance, and then if they all happen to be the same, I plug in that value (1 or 0 as the case may be) wherever it appears in the code. If they don't happen to all be the same, like for "mixed int and float", then I just don't change anything and use the default implementation. The code for this is really very simple: int keys_are_all_same_type = 1; PyTypeObject* key_type = lo.keys[0]->ob_type; for (i=0; i< saved_ob_size; i++){ if (lo.keys[i]->ob_type != key_type){ keys_are_all_same_type = 0; break; } } if (keys_are_all_same_type){ if (key_type == &PyUnicode_Type) compare_function = unsafe_unicode_compare; if (key_type == &PyLong_Type) compare_function = unsafe_long_compare; if (key_type == &PyFloat_Type) compare_function = unsafe_float_compare; else compare_function = key_type->tp_richcompare; } else { compare_function = PyObject_RichCompare; } Those unsafe_whatever* functions are derived by substituting in, like I said, the known values for the typechecks (known since keys_are_all_same_type=1 and key_type = whatever) in the existing implementations of the compare functions. Hope everything is clear now! Elliot

Paul Moore wrote:
What I'm *not* quite clear on is why Python 3's change to reject comparisons between unrelated types makes this optimisation possible.
I think the idea was that it's likely to be *useful* a higher proportion of the time, because Python 3 programmers have to be careful that the types they're sorting are compatible. I'm not sure how true that is -- just because you *could* sort lists containing a random selection of types in Python 2 doesn't necessarily mean it was done often. -- Greg
participants (11)
-
Alexander Belopolsky
-
Chris Angelico
-
Elliot Gorokhovsky
-
Greg Ewing
-
MRAB
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Steven D'Aprano
-
Terry Reedy
-
Tim Peters