[Tutor] memoize, lookup, or KIS?

Mon Nov 19 13:05:27 CET 2012

On 19/11/12 22:02, Albert-Jan Roskam wrote:
> Hi,
>
> I have a function that converts a date value, expressed as the number of
>seconds sinds start of the gregorian calendar, into a human-readable format
>(typically an iso-date). So if a record contains x date values, and a data
>set contains y records, the number of function calls are x * y. Imagine a
>  data set with 1M records with dob and enrollment_date in in it: the number
>  of function calls is huge (well, 2M).
>
>
> I was reading about memoize decorators the other day and I realized that
>this function might benefit from memoizing, or a lookup table.

Emphasis on "might". Unless you have timed the code with or without a lookup
table, you're just guessing whether it is an optimization or a pessimization.

On the other hand, my intuition is that it *will* benefit from memoization,
so my guess is the same as your guess :)

> On the other hand, it might complicate the code too much, so it might
>be better to Keep It Simple (KIS). Is the code below a sound approach?

No. You should separate the cache code from the non-cache code. Decorators
are ideal for that, but even without decorators you should split the code.
See below.

Also, you should have some way to stop the lookup table from growing forever.
If you are running Python 3.3, you can use functools.lru_cache, which
implements a Least Recently Used cache. Once the cache reaches a certain size,
the element which was least recently used is flushed.

Usage is trivially simple, at least for functions, I've never used it with
methods and there may be some complications. But the usage is:

@functools.lru_cache(maxsize)
def myfunction(a, b, c):
     result = some_calculation(a, b) + c  # whatever
     return result

and the decorator will automatically look after storing results in the lookup
table, retrieving them afterwards, and ensuring it never gets too big.

Writing completely general purpose memoize decorators can be a bit tricky, but
you can find a whole lot of recipes here:

http://code.activestate.com/search/recipes/#q=memoize

Here's a trivial approach which doesn't use a decorator at all. The important
feature is that the caching logic isn't mixed up with the uncached logic.

class Test(object):

     def __init__(self):
         self._cache = {}

     def clear_cache(self):
         self._cache.clear()

     def spss2strDate(self, gregorianDate, fmt="%Y-%m-%d", recodeSysmisTo=""):
         # cached wrapper
         t = (gregorianDate, fmt, recodeSysmisTo)
         if t in self._cache:
             # cache hit
             return self._cache[t]
         # otherwise cache miss
         result = self._calculate_spss2strDate(*t)  # this does the real work
         if len(self._cache) > 1000:
             self._cache.popitem()  # discard an arbitrary (key,value) pair
         self._cache[t] = result
         return result

     def _calculate_spss2strDate(self, gregorianDate, fmt, recodeSysmisTo):
         # uncached version
         return some_calculation(...)

You can fill in the details for the uncached version :)

One good improvement would be to add instrumentation[1] to the cache so you can
tell whether or not the cache is worthwhile. E.g. record how many times each
set of arguments are used. If you find that hardly any triple of arguments is
used multiple times (i.e. nearly every call is unique), then the cache is a
waste of time.

[1] A fancy term for any sort of extra data associated with the cache, such as
a count of how many cache hits and misses there are.

-- 
Steven