[Python-ideas] Exploiting type-homogeneity in list.sort() (again!)

Wed Mar 8 06:07:45 EST 2017

On Wed, Mar 08, 2017 at 01:20:19AM +0000, Erik wrote:

> >Part of the complexity here is that I'd like this flag to be available
> >to Python code, not just a hidden internal state of the list.
> 
> Out of interest, for what purpose? Generally, I thought Python code 
> should not need to worry about low-level optimisations such as this 
> (which are C-Python specific AIUI). 

I mentioned earlier that I have code which has to track the type of list 
items, and swaps to a different algorithm when the types are not all the 
same. In practice, I just run the "mixed types" algorithm regardless, 
because the cost of doing a scan of the items in pure Python is too 
expensive relative to the time saved. Moving some of the work into the C 
infrastructure might change that.

I'm not completely sure that this would, in fact, be useful to me, but 
I'd like the opportunity to experiment. I could try using a list 
subclass, but again, the cost of doing the type-checking in Python 
instead of C is (I think) prohibitive. Nevertheless, I understand that 
the burden of proving the usefulness of this is on me. (Or anyone else 
that wants to argue the case.)

> A list.is_heterogeneous() method 
> could be implemented if it was necessary, but how would that be used?

I would prefer to get the list item's type:

if mylist.__type_hint__ is float:
    # run optimized float version
    ...
elif mylist.__type_hint__ is int:
    ...
else:
    # unoptimized version

> >But also avoids bothering with an O(N) scan in some situations where
> >the list really is hetrogeneous. So there's both an opportunity cost and
> >a benefit.
> 
> O(N) is worst case.

It is also the best and average case.

Given a homogenous list of N items, for some arbitrary N between 0 and 
∞, you have to look at all N items before knowing that they're all the 
same type. So for the homogenous case, the best, worst and average are 
identically O(N).

Given a hetrogeneous list of N items where the first difference is 
found at index K, K can range from 1 through N-1. (By definition, it 
cannot be at index 0: you can only detect a difference in types after 
checking *two* items.) The worst case is that you don't find the 
difference until the last item, which is O(N). The best case is that you 
find the difference in position 1 and bail out early. On average, you 
will find the first difference halfway through the list, which makes it 
O(N/2) but that's just O(N). (Constant factors don't matter.)

If you want to call that O(1) for the best hetrogeneous case, I won't 
argue except to say that's rather against the spirit of Big Oh analysis 
in my opinion. I think it's more realistic to say its O(N) across all 
combinations of best/worst/average and homogeneous/hetrogeneous. But of 
course if you bail out early, the constant multiplier may differ.

> Most of the anecdotal evidence in this thread so far seems to suggest 
> that heterogeneous lists are not common. May or may not be true. 
> Empirically, for me, it is true. Who knows? (and there is the question).

That will strongly depend on where the data is coming from, but when it 
comes to sorting, randomly mixed types will usually fail since 
comparisons between different types are generally illegal:

py> [1, "a"].sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: str() < int()

-- 
Steve