[Python-ideas] Exploiting type-homogeneity in list.sort() (again!)

Elliot Gorokhovsky elliot.gorokhovsky at gmail.com
Sun Mar 5 21:57:44 EST 2017


This is exactly what I would have done if I didn't think the changes would
be too large-scale to ever be accepted (at least if proposed by someone
without any experience). This is also what my friend suggested when I
explained my project to him. In fact, this is exactly what dictionaries do:
at each insert, they check whether a string is being inserted. If yes, then
stay in a "string" state, if no, then switch to your None state.

Again, the only reason I did it differently is because I wanted to be
realistic about what changes I could get accepted. My patch does everything
inside the list sort function, and doesn't modify anything outside of that.
So it's presumably much easier to review. I mean, I submitted my patch in
November, and it still hasn't been accepted :) so presumably something as
large-scale as this would have very little chance.

One more thing: changing mutating methods to check type would make them
non-trivially slower; if you aren't resizing the list, appending should
really just be as simple as storing a pointer and incrementing a counter.
If we have to check type, that turns two simple things into three simple
things, which could have significant performance implications. Most
applications of lists *don't* care about type homogeneity (i.e. iterating
through the list or whatever), so one would have to justify imposing this
cost on all list use just to optimize for the special cases where you *do*
care. I'm not sure how one would go about checking if that trade-off is a
good one or not: what is the distribution of list use across all Python
code? It's not something you can really check statically (it's undecidable
if you're being pedantic). With my patch, it's easy: if you aren't sorting,
you don't have to pay anything. If you are sorting, you either win or lose,
based on whether or not you're sorting type-homogeneous data or not (which
you basically always are).

If anything, I think my patch could be a starting point for later
optimization along these lines, depending on whether the problem raised in
the previous paragraph can be addressed. What do you think?

On Sun, Mar 5, 2017 at 7:47 PM Steven D'Aprano <steve at pearwood.info> wrote:

On Sun, Mar 05, 2017 at 07:19:43AM +0000, Elliot Gorokhovsky wrote:

> You may remember seeing some messages on here about optimizing list.sort()
> by exploiting type-homogeneity: since comparing apples and oranges is
> uncommon (though possible, i.e. float to int), it pays off to check if the
> list is type-homogeneous

I sometimes need to know if a list is homogenous, but unfortunately
checking large lists for a common type in pure Python is quote slow.

Here is a radical thought... why don't lists track their common type
themselves? There's only a few methods which can add items:

- append
- extend
- insert
- __setitem__


Suppose we gave lists a read-only attrribute, __type_hint__, which
returns None for hetrogeneous lists and the type for homogeneous lists.
Adding
an item to the list does as follows:

- if the list is empty, adding an item sets __type_hint__ to type(item);
- if the list is not empty, adding an item tests whether type(item)
  is identical to (not a subclass) of __type_hint__, and if not, sets
  __type_hint__ to None;
- removing an item doesn't change the __type_hint__ unless the list
  becomes empty, in which case it is reset to None;
- if the internal allocated space of the list shrinks, that triggers
  a recalculation of the __type_hint__ if it is currently None.

(There's no need to recalculate the hint if it is not None.)

Optional: doing a list.sort() could also recalculate the hint.


The effect will be:

- if __type_hint__ is a type object, then you can be sure that
  the list is homogeneous;

- if the __type_hint__ is None, then it might still be
  homogeneous, but it isn't safe to assume so.

Not only could sorting take advantage of the type hint without needing
to do a separate O(N) scan of the list, but so could other code. I know
I would be interested in using this. I have a fair amount of code that
has to track the type of any items seen in a list, and swap to a "type
agnostic but slow" version if the list is not homogeneous. I could
probably replace that with some variation of:

if thelist.__type_hint__ is None:
    process_slow(thelist)
else:
    process_fast(thelist)


At the very least, I'd be interested in experimenting with this.

Thoughts?



--
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas at python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20170306/04856f5f/attachment-0001.html>


More information about the Python-ideas mailing list