
On Thu, Oct 13, 2016 at 8:19 AM, Elliot Gorokhovsky elliot.gorokhovsky@gmail.com wrote:
My first question was how expensive python compares are vs C compares. And since python 2 has PyString_AS_STRING, which just gives you a char* pointer to a C string, I went in and replaced PyObject_RichCompareBool with strcmp and did a simple benchmark. And I was just totally blown away; it turns out you get something like a 40-50% improvement (at least on my simple benchmark).
So that was the motivation for all this. Actually, if I wrote this for python 2, I might be able to get even better numbers (at least for strings), since we can't use strcmp in python 3. (Actually, I've heard UTF-8 strings are strcmp-able, so maybe if we go through and verify all the strings are UTF-8 we can strcmp them? I don't know enough about how PyUnicode stuff works to do this safely).
I'm not sure what you mean by "strcmp-able"; do you mean that the lexical ordering of two Unicode strings is guaranteed to be the same as the byte-wise ordering of their UTF-8 encodings? I don't think that's true, but then, I'm not entirely sure how Python currently sorts strings. Without knowing which language the text represents, it's not possible to sort perfectly.
https://en.wikipedia.org/wiki/Collation#Automated_collation """ Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ökonomisch comes between offenbar and olfaktorisch, while Turkish dictionaries treat o and ö as different letters, placing oyun before öbür. """
Which means these lists would already be considered sorted, in their respective languages:
rosuav@sikorsky:~$ python3 Python 3.7.0a0 (default:a78446a65b1d+, Sep 29 2016, 02:01:55) [GCC 6.1.1 20160802] on linux Type "help", "copyright", "credits" or "license" for more information.
sorted(["offenbar", "ökonomisch", "olfaktorisch"])
['offenbar', 'olfaktorisch', 'ökonomisch']
sorted(["oyun", "öbür", "parıldıyor"])
['oyun', 'parıldıyor', 'öbür']
So what's Python doing? Is it a codepoint ordering?
ChrisA