Mapping with continguous ranges of keys

Steve D'Aprano steve+python at pearwood.info
Sat Dec 17 06:24:25 EST 2016


On Sat, 17 Dec 2016 08:31 pm, Peter Otten wrote:

> Steve D'Aprano wrote:
> 
>> I have experimented with two solutions, and hope somebody might be able
>> to suggest some improvements. Attached is the test code I ran,
>> suggestions for improving the performance will be appreciated.
> 
> If there was an attachment to this text -- that didn't make it to the
> mailing list or the news group.

I see it on comp.lang.python but not the mailing list or gmane. That's
annoying because its a *text* attachment and should be allowed. I'll attach
it again, inline this time, at the end of this post.


[...]
>> I tested the memory consumption and speed of these two solutions with
>> (approximately) one million keys. I found:
>> 
>> - the dict solution uses a lot more memory, about 24 bytes per key,
>> compared to the pair of lists solution, which is about 0.3 bytes per key;
>> 
>> - but on my computer, dict lookups are about 3-4 times faster.
> 
> Only three to four times? You basically get that from a no-op function
> call:
[...]
> Even adding a dummy value to V to go from
> 
>>     V[bisect.bisect_right(L, i) - 1]
> 
> to
> 
> V[bisect.bisect_right(L, i)]
> 
> might be visible in your benchmark.

Good thinking! I'll try that.

And... it doesn't appear to make any real difference.


>> Any suggestions for improving the speed of the binary search version, or
>> the memory consumption of the dict?
> 
> Depending on the usage pattern you might try bisect combined with a LRU
> cache.

Adding a cache will probably eat up the memory savings, but I may play
around with that.

And here's the code, this time inline. I've added the dummy value to the
values list V as suggested.

# --- cut ---

import bisect
import random
import sys

from timeit import default_timer as timer

def make_keys():
    """Yield (start, end) values suitable for passing to range().

    Distance between start and end increase from 1 to 51, then
    repeat, for a total of 800*25*51 = 1020000 values all together.
    """
    n = 1
    for j in range(800):
        for i in range(50):
            yield (n, n+i+1)
            n += i+1

def populate():
    D = {}
    L = []
    V = [None]
    value = 1
    last_b = 1
    for a, b in make_keys():
        assert a == last_b
        for i in range(a, b):
            D[i] = value
        L.append(a)
        V.append(value)
        last_b = b
    L.append(last_b)
    return (D, L, V)


class Stopwatch:
    """Context manager for timing long running chunks of code."""

    def __enter__(self):
        self._start = timer()
        return self

    def __exit__(self, *args):
        elapsed = timer() - self._start
        del self._start
        if elapsed < 0.01:
            print("Elapsed time is very small, consider using timeit
instead.")
        print('time taken: %f seconds' % elapsed)



D, L, V = populate()
assert len(D) == 800*25*51
assert len(L) == len(V)
print("size of dict", sys.getsizeof(D))
print("size of two lists", sys.getsizeof(L) + sys.getsizeof(V))


# Confirm that values are the same whether using dict lookup
# or binary search.
for i in range(1, 800*25*51 + 1):
    index = bisect.bisect_right(L, i)
    assert D[i] == V[index]


# Simulate a large number of lookups in random order.
nums = list(range(1, 800*25*51 + 1))
random.shuffle(nums)

with Stopwatch():
    for i in nums:
        x = D[i]

with Stopwatch():
    for i in nums:
        x = V[bisect.bisect_right(L, i)]

# --- cut ---



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.



More information about the Python-list mailing list