[Python-ideas] Uniquify attribute for lists

Sat Nov 17 20:37:38 CET 2012

Surely the best choice is two have *two* caches; one for hashables and
another for the rest.

This might be improvable with a *third* chache if some non-hashables had
total ordering, but that could also introduce bugs I think. It'd also be a
lot harder and likely be slower anyway.

Timings (approximate and rough):
"""
Using 16 elements

% Hashable:    100     90     66     33      0
Original:     1.47   1.00   1.06   1.09   1.01
w/ key=str:   3.73   1.91   2.12   3.15   3.00
New:          1.20   1.46   1.81   2.13   3.38
w/ key=str:   1.72   2.00   2.48   2.76   3.01

Using 64 elements

% Hashable:    100     90     66     33      0
Original:     1.15   1.29   1.61   1.64   1.43
w/ key=str:   1.98   2.50   3.09   3.55   3.99
New:          1.00   1.47   2.18   3.01   3.60
w/ key=str:   1.87   2.30   2.79   3.41   3.84

Using 256 elements

% Hashable:    100     90     66     33      0
Original:     2.70   3.66   5.19   5.34   4.41
w/ key=str:   4.06   5.07   6.26   6.93   6.98
New:          1.00   1.65   2.92   5.28   7.62
w/ key=str:   2.28   2.71   3.76   4.36   4.93

Using 1024 elements

% Hashable:    100     90     66     33      0
Original:     9.30   12.4   18.8   21.4   16.9
w/ key=str:   11.1   13.1   16.3   17.5   13.9
New:          1.00   1.84   6.20   13.1   19.8
w/ key=str:   2.31   2.79   3.59   4.50   5.16

Using 4096 elements

% Hashable:    100     90     66     33      0
Original:     33.7   44.3   69.1   79.4   60.5
w/ key=str:   36.7   44.2   59.3   60.1   40.4
New:          1.00   3.73   18.1   42.2   63.7
w/ key=str:   2.23   2.56   3.33   4.19   4.93

Using 16384 elements

% Hashable:    100     90     66     33      0
Original:     126.   173.   265.   313.   243.
w/ key=str:   136.   164.   215.   213.   147.
New:          1.00   12.5   68.6   173.   263.
w/ key=str:   2.24   2.60   3.28   4.14   4.80
"""

--------------

Code attached, unless I forget ;).
No guarantees that it still works the same way, or works at all, or is the
right file.

Every item is repeated 5 times on average for any length of the list being
unique-ified. I'll try it with this changed later.

Basically, the new version is faster on all but ~100% non-hashable lists
when there are more than ~1000 elements, and on more-hashable lists it's
quadratically faster. When slower, it's by only about 10% to 20%. Of
course, specifying whether or not your list is fully-hashable would be more
efficient (probably 10%) but that's not as nice to use.

Additionally, if you use key="" to uniquify by hashable values you are able
to get a good speed up with the new version.

Example of use:
iteruniq(<list of set>, key=frozenset)

With really small non-hashable lists, the original is significantly better
(3x).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/569601d9/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iteruniq.py
Type: application/octet-stream
Size: 6268 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20121117/569601d9/attachment.obj>