Optimizing size of very large dictionaries
Terry Reedy
tjreedy at udel.edu
Fri Aug 1 00:57:33 EDT 2008
Raymond Hettinger wrote:
>>> Background: I'm trying to identify duplicate records in very
>>> large text based transaction logs. I'm detecting duplicate
>>> records by creating a SHA1 checksum of each record and using this
>>> checksum as a dictionary key. This works great except for several
>>> files whose size is such that their associated checksum
>>> dictionaries are too big for my workstation's 2G of RAM.
>
> Tons of memory can be saved by not storing the contents of the
> record. Just make an initial pass to identify the digest values of
> possible duplicates. The use a set to identify actual dups but only
> store the records for those whose digest is a possible duplicate:
>
> bag = collections.defaultdict(int)
> for record in logs:
> bag[sha1(record).digest()] += 1
> possible_dups = set()
> while bag:
> hashval, cnt = bag.popitem()
> if cnt > 1:
> possible_dups.add(hashvalue)
Since actual counts above 1 are not needed, I believe a bit more memory
could be saved by computing possible_dups incrementally. The logic is a
bit trickier, and seeing the above helps.
once, dups = set(), set()
for record in logs:
d = sha1(record).digest()
if d in once:
once.remove(d)
dups.add(d)
elif d not in dups:
once.add(d)
> seen = set()
> for record in logs:
> if record in seen:
> print 'Duplicate:', record
> elif sha1(record).digest() in possible_dups:
> seen.add(record)
>
> Raymond
>
> P.S. If the log entries are one liners, maybe it would be better to
> use the operating system's sort/uniq filters.
> --
> http://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list