Advice on optimium data structure for billion long list?
Mark blobby Robinson
m.1.robinson at herts.ac.uk
Sat May 12 12:12:22 EDT 2001
I'd just like to pick the best of the worlds python brains if thats ok.
I am building a program that is pattern searching in DNA sequences and
generating a list of combinations of 3 patterns that meet certain
criteria. My problem is that this list could potentially get as large as
~1.4 billion entries. Now originally I was using a dictionary with the
key as a 15 length string (the patterns catted) and the value simply a
count of the number of hots for that pattern combination. Obviously that
hit memory problems pretty soon, so started storing the information as
shelves which worked fine except that as the size of the shelf increased
it started taking forever update the shelf file (involved a check for
key entry and then either updated or created new entry). This was
slowing things down to the extent that it would have taken literally
weeks to run. So next I have written an extension in C which basically
creates a huge c array on disk and the offset positions for each patt
combo is a hash generated from the unique string. This is still way too
slow and I am just wondering if I am ignorant of some far easier or more
effective way of tackling this problem. Maybe I should be using some
kinda database, but I thought that basically the shelve operation did
that ....I was using with gdbm.
More information about the Python-list