Thanks for looking at this!
That's why I spent months of my life (overall) devising a sequence of
sorting algorithms for Python that reduced the number of comparisons needed.
Yes, that's why I think this is so cool: for a couple dozen lines of code,
we can get (at least for some cases, according to my questionable
benchmarks) the kinds of massive improvements you had to use actual
computer science to achieve (as opposed to mere hackery).
Note that when Python's current sort was adopted in Java, they still kept a
quicksort variant for "unboxed" builtin types. The adaptive merge sort
incurs many overheads that often cost more than they save unless
comparisons are in fact very expensive compared to the cost of pointer
copying (and in Java comparison of unboxed types is cheap). Indeed, for
native numeric types, where comparison is dirt cheap, quicksort generally
runs faster than mergesort despite that the former does _more_ comparisons
(because mergesort does so much more pointer-copying).
Ya, I think this may be a good approach for floats: if the list is all
floats, just copy all the floats into a seperate array, use the standard
library quicksort, and then construct a sorted PyObject* array. Like maybe
set up a struct { PyObject* payload, float key } type of deal. This
wouldn't work for strings (unicode is scary), and probably not for ints
(one would have to check that all the ints are within C long bounds).
Though on the other hand perhaps this would be too expensive?
I had considered something "like this" for Python 2, but didn't pursue it
because comparison was defined between virtually any two types (34 < [1],
etc), and people were careless about that (both by design and by
accident). In Python 3, comparison "blows up" for absurdly mixed types, so
specializing for homogeneously-typed lists is a more promising idea on the
face of it.
The comparisons needed to determine _whether_ a list's objects have a
common type is just len(list)-1 C-level pointer comparisons, and so goes
fast. So I expect that, when it applies, this would speed even sorting an
already-ordered list with at least 2 elements.
That's what my crude benchmarks indicate... when I applied my sort to a
list of 1e7 ints with a float tacked on the end, my sort actually ended up
being a bit faster over several trials (which I attribute to
PyObject_RichCompare == Py_True being faster than PyObject_RichCompareBool
== 1, apologies for any typos in that code).
For a mixed-type list with at least 2 elements, it will always be pure
loss. But (a) I expect such lists are uncommon (and especially uncommon in
Python 3); and (b) a one-time scan doing C-level pointer comparisons until
finding a mismatched type is bound to be a relatively tiny cost compared to
the expense of all the "rich comparisons" that follow.
So +1 from me on pursuing this.
Elliot, please:
- Keep this on python-ideas. python-dev is for current issues in Python
development, not for speculating about changes.
- Open an issue on the tracker: https://bugs.python.org/
OK
- At least browse the info for developers:
https://docs.python.org/devguide/
Ya, I'm working on setting this up as a patch in the hg repo as opposed to
an extension module to make benchmarking cleaner/more sane.
- Don't overlook Lib/test/sortperf.py. As is, it should be a good test of
what your approach so far _doesn't_ help, since it sorts only lists of
floats (& I don't think you're special-casing them). If the timing results
it reports aren't significantly hurt (and I expect they won't be), then add
specialization for floats too and gloat about the speedup :-)
Ya, I mean they aren't special-cased, but homogenous lists of floats still
fit in the tp->rich_compare case, which still bypasses the expensive
PyObject_RichCompare. I'll guess I'll see when I implement this as a patch
and can run it on sortperf.py.
- I expect tuples will also be worth specializing (complex sort keys are
often implemented as tuples).
I'm not sure what you mean here... I'm looking at the types of lo.keys, not
of saved_ob_item (I think I said that earlier in this thread by mistake
actually). So if someone is passing tuples and using itemgetter to extract
ints or strings or whatever, the current code will work fine; lo.keys will
be scalar types. Unless I misunderstand you here. I mean, when would
lo.keys actually be tuples?
Nice start! :-)
Thanks!
Sorry if I missed the boat, but only just now saw this PEP.
Glancing through the PEP, I don't see mentioned anywhere the SQL
alternative of having a coalesce() function:
https://www.postgresql.org/docs/9.6/static/functions-conditional.html#FUNCT…
In Python, something like this:
def coalesce(*args):
for arg in args:
if arg is not None:
return arg
return None
Just drop it into builtins, and voila. No need for lengthy discussions
about which operator to use because IMHO it needs no operator.
Sure, it's not as sexy as a fancy new operator, nor as headline grabbing,
but it is pretty useful.
Just my 2p.
--
Gustavo J. A. M. Carneiro
Gambit Research
"The universe is always one step beyond logic." -- Frank Herbert
On Oct 14, 2016 9:14 AM, "Gustavo Carneiro" <gjcarneiro(a)gmail.com> wrote:
>
> Sorry if I missed the boat, but only just now saw this PEP.
>
> Glancing through the PEP, I don't see mentioned anywhere the SQL
alternative of having a coalesce() function:
https://www.postgresql.org/docs/9.6/static/functions-conditional.html#FUNCT…
>
> In Python, something like this:
>
> def coalesce(*args):
> for arg in args:
> if arg is not None:
> return arg
> return None
>
> Just drop it into builtins, and voila. No need for lengthy discussions
about which operator to use because IMHO it needs no operator.
>
> Sure, it's not as sexy as a fancy new operator, nor as headline grabbing,
but it is pretty useful.
That function is for a different purpose. It selects the first non-null
value from a flat collection. The coalesce operators are for traveling down
a reference tree, and shortcutting out without an exception if the path
ends.
For example:
return x?.a?.b?.c
instead of:
if x is None: return None
if x.a is None: return None
if x.a.b is None: return None
return x.a.b.c
You can use try-catch, but you might catch an irrelevant exception.
try:
return x.a.b.c
except AttributeError:
return None
If `x` is an int, `x.a` will throw an AttributeError even though `x` is not
None.
A function for the above case is:
def coalesce(obj, *names):
for name in names:
if obj is None:
return None
obj = getattr(obj, name)
return obj
return coalesce(x, 'a', 'b', 'c')
See this section for some examples:
https://www.python.org/dev/peps/pep-0505/#behavior-in-other-languages
(The PEP might need more simple examples. The Motivating Examples are full
chunks of code from real libraries, so they're full of distractions.)
On Thu, Oct 13, 2016 at 1:36 PM, Марк Коренберг <socketpair(a)gmail.com> wrote:
>
> I mean mutable containers that are always sorted when iterating over them.
>
> See http://bugs.python.org/issue28433
>
> for example:
>
> * SortedSet (sorted unique elements, implemented using (rb?)tree instead of hash)
> * SortedList (sorted elements, the same as SortedSet, but without uniquiness constraint) - actually a (rb?)tree, not a list (i.e. not an array)
> * SortedDict (sorted by key when interating) - like C++'s ordered_map
Can you share more about your use cases for these containers? What are
you making?
Nick Coghlan gave an answer to this question on StackOverflow at
http://stackoverflow.com/a/5958960/232571 The answer kind of boils
down to "there should be one obvious way to do it" and existing Python
features like lists, sorted, bisect, and heapq cover many use cases.
I wrote the answer that is now the second highest rated for that
question. I've noticed that the upvotes have been accumulating at a
slightly higher rate than Nick's answer. I think that reflects an
increase in interest and maybe gradual tide change of opinion.
> There are many implementations in the net, like:
>
> http://www.grantjenks.com/docs/sortedcontainers
That's my project. I am the primary developer of the SortedContainers project.
You may also be interested in the
[SortedCollections](http://www.grantjenks.com/docs/sortedcollections/)
module which builds atop SortedContainers with data types like
ValueSortedDict and ItemSortedDict. Because it's pure-Python,
SortedContainers offers a lot of opportunity for
extension/customization. That's also made it easier for the API to
adapt/expand over time.
> I think it should be one standardized implementation of such containers in CPython.
>
> Instead of trees, implementation may use SkipList structure, but this is just implementation details.
>
> Such structres imply fast insertion and deletion, ability to iterate, and also memory efficiency.
I gave a talk at PyCon 2016 about Python Sorted Collections[1] that's
worth watching. The first third discusses about six different
implementations with different strengths and weaknesses. The choice of
data type is more than implementation details. One of the biggest
issues is the various tradeoffs of data types like blists, rbtrees,
etc.
I have been meaning to discuss sorted collections further with Raymond
Hettinger (author of the Python collections module). We spoke after
the PyCon talk and wanted to continue the conversation. But I had a
busy summer and just a few weeks ago welcomed my first son into the
world. So realistically that conversation won't happen until 2017.
[1]: http://www.grantjenks.com/docs/sortedcontainers/pycon-2016-talk.html
> I recommend to read thorough review articles written by Andrew Barnert:
>
> http://stupidpythonideas.blogspot.com/2013/07/sorted-collections-in-stdlib.…
>
> http://stupidpythonideas.blogspot.com/2014/04/sortedcontainers.html
One of Andrew Barnert's conclusions is that SortedContainers could not
scale. I did a pretty rigorous performance analysis and benchmarking
at http://www.grantjenks.com/docs/sortedcontainers/performance-scale.html
Short answer: I scaled SortedContainers up through ten billion
elements, well past the memory limits of most machines.
Background: I asked a stackoverflow question here
<http://stackoverflow.com/questions/40029807/why-does-the-class-definitions-…>
.
The Python documentation
<https://docs.python.org/3.6/reference/datamodel.html#determining-the-approp…>
is very confusing to me. It says that:
if an explicit metaclass is given and it is not an instance of type, then
it is used directly as the metaclass
This seems to suggest that in this case, the "explicit metaclass" does not
need to be "subtype of all of these candidate metaclasses" as it would in
the third case. (This is not true.)
Also, providing a callable as a metaclass doesn't seem to be any more
flexible, readable, or powerful than providing an instance of type.
Therefore, I suggest that we deprecate the second case and replace the
entire section (3.3.3.2) of the documentation to say:
"The metaclass of a class definition is selected from the explicitly
specified metaclass (if any) and the metaclasses (i.e. type(cls)) of all
specified base classes. The most derived metaclass is one which is a
subtype of all of these candidate metaclasses. If none of the candidate
metaclasses meets that criterion, then the class definition will fail with
TypeError. If provided, the explicit metaclass must be an instance of
type()."
On Thu, Oct 13, 2016 at 5:17 PM, Stephen J. Turnbull
<turnbull.stephen.fw(a)u.tsukuba.ac.jp> wrote:
> Chris Angelico writes:
>
> > I'm not sure what you mean by "strcmp-able"; do you mean that the
> > lexical ordering of two Unicode strings is guaranteed to be the same
> > as the byte-wise ordering of their UTF-8 encodings?
>
> This is definitely not true for the Han characters. In Japanese, the
> most commonly used lexical ordering is based on the pronunciation,
> meaning that there are few characters (perhaps none) in common use
> that has a unique place in lexical ordering (most individual
> characters have multiple pronunciations, and even many whole personal
> names do).
Yeah, and even just with Latin-1 characters, you have (a) non-ASCII
characters that sort between ASCII characters, and (b) characters that
have different meanings in different languages, and should be sorted
differently. So lexicographical ordering is impossible in a generic
string sort.
ChrisA
To answer your question: I special-case unicode (strings), ints, and
floats. I am working on special-casing tuples (can even be different types,
just need homogeneity column-wise). The best speedups will be tuples of
floats: it'll bypass three layers of useless checks.
If I run it without special-casing floats (just using tp->rich_compare) I
only get like a 5-10% speedup. I'm working on rigorous benchmarks for all
this stuff, will post a pdf along with the patch once it's all done. But
it's certainly <10%. However, part of this is because my special case is
really low-level; for strings I've actually found the opposite, using
tp->richcompare gives me almost the same results as my special case
compare, since it still has to PyUnicode_READY the strings (or whatever
it's called).
Regarding generalization: the general technique for special-casing is you
just substitute all type checks with 1 or 0 by applying the type assumption
you're making. That's the only way to guarantee it's safe and compliant.
Elliot
On Tue, Oct 11, 2016, 5:19 PM Jim J. Jewett <jimjjewett(a)gmail.com> wrote:
> Excellent.
> I'm surprised cache didn't save more, but less surprised than I was ... I
> hadn't realized that you were skipping the verifications in
> PyFloat_RichCompare as well. Does that generalize to other element types
> without exposing too much of the per-type internals to list.sort?
>
> Oh ... and I appreciate your not quoting private email as a general
> courtesy, but I hereby give you permission if it was mine that was private.
> [Though I think your summary was better than a quote anyhow.]
>
> -jJ
>
> On Oct 11, 2016 4:58 PM, "Elliot Gorokhovsky" <
> elliot.gorokhovsky(a)gmail.com> wrote:
>
> So I got excited here. And the reason why is that I got those numbers *on
> Tim's benchmark*. When I got these kinds of numbers on my benchmarks, I
> figured there was probably a problem with they way I was timing, and
> certainly the gains couldn't be as extreme as they suggested. But this is
> on a benchmark that's already in the codebase!
>
>
> Here is a detailed explanation of how to reproduce my results, and the
> circumstances that would lead them to be invalid:
>
> ******************************************
>
> To reproduce, just activate a virtualenv, and then clone
> https://github.com/embg/python-fast-listsort.git. Then python setup.py
> install and python sortperf.py.
>
>
> Now let's look at what sortperf.py does and how it relates to Tim's
> benchmark at Lib/test/sortperf.py. If you diff the two, you'll find I made
> three changes:
>
>
> 1. I added an import, "import fastlist". This obviously would not make
> sorting twice faster.
>
>
> 2. I changed the way it formats the output: I changed "fmt = ("%2s %7s" +
> " %7s"*len(cases))" to "fmt = ("%2s %7s" + " %6s"*len(cases))". Again
> irrelevant.
>
>
> 3. I changed the timing function
>
> #from this
>
>
> def doit_fast(L):
> t0 = time.perf_counter()
> L.fastsort()
> t1 = time.perf_counter()
> print("%6.2f" % (t1-t0), end=' ')
> flush()
>
>
>
> #to this
>
>
> def doit(L):
> F = FastList(L)
> f0 = time.perf_counter()
> F.fastsort()
> f1 = time.perf_counter()
> F = FastList(L)
> t0 = time.perf_counter()
> F.sort()
> t1 = time.perf_counter()
> print("%6.2f%%" % (100*(1-(f1-f0)/(t1-t0))), end=' ')
> flush()
>
>
> *******************************************
>
> So what we've shown is that (1) if you trust the existing sorting
> benchmark and (2) if my modification to doit() doesn't mess anything up (I
> leave this up to you to judge), then the measurements are as valid. Which
> is a pretty big deal (50% !!!!!!!), hence my overexcitement.
>
> ****************************************
>
>
> Now I'd like to respond to responses (the one I'm thinking of was off-list
> so I don't want to quote it) I've gotten questioning how it could be
> possible for such a small optimization, bypassing the typechecks, to
> possibly have such a large effect, even in theory. Here's my answer:
>
> Let's ignore branch prediction and cache for now and just look at a high
> level. The cost of sorting is related to the cost of a single comparison,
> because the vast majority of our time (let's say certainly at least 90%,
> depending on the list) is spent in comparisons. So let's look at the cost
> of a comparison.
>
> Without my optimization, comparisons for floats (that's what this
> benchmark looks at) go roughly like this:
>
> 1. Test type of left and right for PyObject_RichCompare (which costs two
> pointer dereferences) and compare them. "3 ops" (quotes because counting
> ops like this is pretty hand-wavy). "2 memory accesses".
>
> 2. Get the address of the float compare method from
> PyFloat_Type->tp_richcompare. "1 op". "1 memory access".
>
> 3. Call the function whose address we just got. "1 op". "Basically 0
> memory accesses because we count the stack stuff in that 1 op".
>
> 4. Test type of left and right again in PyFloat_RichCompare and compare
> both of them to PyFloat_Type. "4 ops". "2 memory accesses".
>
> 5. Get floats from the PyObject* by calling PyFloat_AS_DOUBLE or whatever.
> "2 ops". "2 memory accesses".
>
> 6. Compare the floats and return. "2 ops".
>
> Now let's tally the "cost" (sorry for use of quotes here, just trying to
> emphasize this is an intuitive, theoretical explanation for the numbers
> which doesn't take into account the hardware):
> "13 ops, 7 memory accesses".
>
> Here's what it looks like in my code:
>
> 1. Call PyFloat_AS_DOUBLE on left and right. "2 ops". "2 memory acceses".
>
> 2. Compare the floats and return. "2 ops".
>
> Tally: "4 ops, 2 memory accesses".
>
> Now you can argue branch prediction alleviates a lot of this cost, since
> we're taking the same branches every time. But note that, branch prediction
> or not, we still have to do all of those memory acceses, and since they're
> pointers to places all over memory, they miss the cache basically every
> time (correct me if I'm wrong). So memory-wise, we really are doing
> something like a 7:2 ratio, and op-wise, perhaps not as bad because of
> branch prediction, but still, 13:4 is probably bad no matter what's going
> on in the hardware.
>
> Now consider that something like 90% of our time is spent in those steps.
> Are my numbers really that unbelievable?
>
> Thanks for everything, looking forward to writing this up as a nice latex
> doc with graphs and perf benchmarks and all the other rigorous goodies, as
> well as a special case cmp func for homogeneous tuples and a simple patch
> file,
>
> Elliot
>
>
[please restrict follow-ups to python-ideas]
Let's not get hung up on meta-discussion here - I always thought "massive
clusterf**k" was a precise technical term anyway ;-)
While timing certainly needs to be done more carefully, it's obvious to me
that this approach _should_ pay off significantly when it applies.
Comparisons are extraordinarily expensive in Python, precisely because of
the maze of test-and-branch code it requires just to figure out which
bottom-level comparison function to invoke each time. That's why I spent
months of my life (overall) devising a sequence of sorting algorithms for
Python that reduced the number of comparisons needed.
Note that when Python's current sort was adopted in Java, they still kept a
quicksort variant for "unboxed" builtin types. The adaptive merge sort
incurs many overheads that often cost more than they save unless
comparisons are in fact very expensive compared to the cost of pointer
copying (and in Java comparison of unboxed types is cheap). Indeed, for
native numeric types, where comparison is dirt cheap, quicksort generally
runs faster than mergesort despite that the former does _more_ comparisons
(because mergesort does so much more pointer-copying).
I had considered something "like this" for Python 2, but didn't pursue it
because comparison was defined between virtually any two types (34 < [1],
etc), and people were careless about that (both by design and by
accident). In Python 3, comparison "blows up" for absurdly mixed types, so
specializing for homogeneously-typed lists is a more promising idea on the
face of it.
The comparisons needed to determine _whether_ a list's objects have a
common type is just len(list)-1 C-level pointer comparisons, and so goes
fast. So I expect that, when it applies, this would speed even sorting an
already-ordered list with at least 2 elements.
For a mixed-type list with at least 2 elements, it will always be pure
loss. But (a) I expect such lists are uncommon (and especially uncommon in
Python 3); and (b) a one-time scan doing C-level pointer comparisons until
finding a mismatched type is bound to be a relatively tiny cost compared to
the expense of all the "rich comparisons" that follow.
So +1 from me on pursuing this.
Elliot, please:
- Keep this on python-ideas. python-dev is for current issues in Python
development, not for speculating about changes.
- Open an issue on the tracker: https://bugs.python.org/
- At least browse the info for developers:
https://docs.python.org/devguide/
- Don't overlook Lib/test/sortperf.py. As is, it should be a good test of
what your approach so far _doesn't_ help, since it sorts only lists of
floats (& I don't think you're special-casing them). If the timing results
it reports aren't significantly hurt (and I expect they won't be), then add
specialization for floats too and gloat about the speedup :-)
- I expect tuples will also be worth specializing (complex sort keys are
often implemented as tuples).
Nice start! :-)
Hi guys,
I'm writing some code that uses `threading.Condition` and I found that I
want to access condition._waiters. I want to do it in two different parts
of my code for two different reasons:
1. When shutting down the thread that manages the condition, I want to be
sure that there are no waiters on the condition, so I check whether
`condition._waiters` is empty before exiting, otherwise I'll let them
finish and only then exit.
2. When I do notify_all, I actually want to do as many notify actions as
needed until there's a full round of notify_all in which none of the
conditions for any of the waiters have been met. Only then do I want my
code to continue. (It's because these waiters are waiting for resources
that I'm giving them, each wanting a different number of resources, and I
want to be sure that all of them are starved before I get more resources
for them.)
Do you think it'll be a good idea to add non-private functionality like
that to threading.Condition?
Thanks,
Ram.