Cache memory and its effect on list searching
sohcahtoa82 at gmail.com
sohcahtoa82 at gmail.com
Fri Dec 16 23:44:12 EST 2016
On Friday, December 16, 2016 at 6:27:24 PM UTC-8, Chris Angelico wrote:
> On Sat, Dec 17, 2016 at 1:20 PM, <sohcahtoa82 at gmail.com> wrote:
> > I thought this was curious behavior. I created a list of random-looking strings, then made a sorted copy. I then found that using "in" to see if a string exists in the sorted list took over 9 times as long!
> >
>
> My numbers replicate yours (though my computer's faster). But my
> conclusion is different:
>
> Python 3.7.0a0 (default:ecd218c41cd4, Dec 16 2016, 03:08:47)
> [GCC 6.2.0 20161027] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import hashlib
> >>> import timeit
> >>> hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
> >>> sortedhashes = sorted(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.8167107938788831
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.029693723190576
> >>> timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10)
> 0.855183657258749
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10)
> 5.0585526106879115
> >>> sethashes = set(hashes)
> >>> timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10)
> 6.13601878285408e-06
>
> You want containment checks? Use a set :)
>
> ChrisA
Ah, I forgot about the optimizations of a set. The only time I ever find myself using set is when I want to remove all the duplicates in a list. I convert it to a set and then back.
For fun, I ran an experiment:
### BEGIN listsearch.py
import hashlib
import timeit
import sys
from bisect import bisect_left
def bisect_search(a, x, lo=0, hi=None): # can't use a to specify default for hi
# From http://stackoverflow.com/questions/212358/binary-search-bisection-in-python
hi = hi if hi is not None else len(a) # hi defaults to len(a)
pos = bisect_left(a,x,lo,hi) # find insertion position
return (pos if pos != hi and a[pos] == x else -1) # don't walk off the end
def bin_search(haystack, needle):
start = 0
end = len(haystack) - 1
while True:
if start > end:
return False
middle = (start + end) // 2
if needle < haystack[middle]:
end = middle - 1
continue
elif needle > haystack[middle]:
start = middle + 1
continue
elif needle == haystack[middle]:
return True
print('Python version: ', sys.version_info)
print('building hashes...')
hashes = [hashlib.md5(bytes(str(i), "utf-8")).hexdigest() for i in range(10000000)]
print('sorting...')
sortedhashes = sorted(hashes)
print('creating set...')
sethashes = set(hashes)
print('Unsorted list:', timeit.timeit('"x" in hashes', globals={'hashes': hashes}, number=10))
print('Sorted:', timeit.timeit('"x" in hashes', globals={'hashes': sortedhashes}, number=10))
print('set:', timeit.timeit('"x" in hashes', globals={'hashes': sethashes}, number=10))
print('binary search:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bin_search}, number=10))
print('binary search with bisect:', timeit.timeit('binsearch(hashes, "x")', globals={'hashes': sortedhashes, 'binsearch': bisect_search}, number=10))
### END listsearch.py
> python3 listsearch.py
Python version: sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
building hashes...
sorting...
creating set...
Unsorted list: 1.7384763684627569
Sorted: 9.248799958145042
set: 1.4614161294446149e-06
binary search: 0.00010902164328108199
binary search with bisect: 1.7829276782066472e-05
Yup. set is definitely the way to go!
More information about the Python-list
mailing list