[Python-ideas] Exploiting type-homogeneity in list.sort() (again!)

Tue Mar 7 19:18:15 EST 2017

On Tue, Mar 07, 2017 at 08:46:45PM +0000, Erik wrote:

> I don't think anyone has mentioned this yet, but FWIW I think the 'type 
> hint' may need to be tri-state: heterogeneous (NULL), homogeneous (the 
> pointer to the type structure) and also "unknown" (a sentinel value - 
> the address of a static char or something).

I thought about that and rejected it as an unnecessary complication. 
Hetrogeneous and unknown might as well be the same state: either way, 
you cannot use the homogeneous-type optimization.

Part of the complexity here is that I'd like this flag to be available 
to Python code, not just a hidden internal state of the list.

But my instinct for this could be wrong -- if anyone is interested to do 
some experiments on this, it might be that a three-state flag works out 
better in practice than two-states.

> Otherwise, a delete operation on the list would need to scan the list to 
> work out if it had changed from heterogeneous to homogeneous (unless it 
> was acceptable that once heterogeneous, a list is always considered 
> heterogeneous - i.e., delete always sets the hint to NULL).

In a later email, you corrected this: a delete operation need not touch 
the type-hint (except when the last item is deleted, at which point it 
resets to None/unknown.

With a three-state flag, you can make a three-way decision:

If the flag is Unknown:
    - do a O(N) scan to determine whether the list is homogeneous 
      or hetrogeneous, then choose the optimized or unoptimized
      routine;

if the flag is Hetrogeneous:
    - there is no point doing a scan, so always choose the 
      unoptimized routine;

if the flag is a type:
    - the list is definitely homogeneous, so depending on the
      type, you may be able to choose an optimized rountine.

Compared to that, a two-state flag misses some opportunities to run 
the optimized routine:

- list contains [1, 2, 3, 4, "foo"] so the hint is set to None;
- delete the last item;
- list is now homogeneous but the hint is still None.

But also avoids bothering with an O(N) scan in some situations where 
the list really is hetrogeneous. So there's both an opportunity cost and 
a benefit.

There may be other opportunities to do the scan, such as when the 
underlying pre-allocated array resizes, so even with the two-state flag, 
"unknown" need not stay unknown forever.

> I'd prefer the sort optimization to be based on what my list contains 
> NOW, not on what it may have contained some time in the past, 

Remember, we're talking about opportunities for applying an optimization 
here, nothing more. You're not giving up anything: at worst, the 
ordinary, unoptimized routine will run and you're no worse off than you 
are today.

> so I'm not 
> a fan of the "once heterogeneous, always considered heterogeneous" 
> behaviour if it's cheap enough to avoid it.

It is not just a matter of the cost of tracking three states versus two. 
It is a matter of the complexity of the interface.

I suppose this could be reported to Python code as None, False or the 
type. Although, I don't know about you, but I know I'll never be able to 
remember whether None means Unknown and False means Hetrogeneous, or the 
other way around.

Ultimately, this is all very pie-in-the-sky unless somebody tests just 
how expensive this is and whether the benefit is worthwhile. That's not 
going to be me: I'm not able to hack on the list C code.

-- 
Steve