Advice on optimium data structure for billion long list?

Sat May 12 12:12:22 EDT 2001

Hey guys,

I'd just like to pick the best of the worlds python brains if thats ok. 
I am building a program that is pattern searching in DNA sequences and 
generating a list of combinations of 3 patterns that meet certain 
criteria. My problem is that this list could potentially get as large as 
~1.4 billion entries. Now originally I was using a dictionary with the 
key as a 15 length string (the patterns catted) and the value simply a 
count of the number of hots for that pattern combination. Obviously that 
hit memory problems pretty soon,  so started storing the information as 
shelves which worked fine except that as the size of the shelf increased 
it started taking forever update the shelf file (involved a check for 
key entry and then either updated or created new entry).  This was 
slowing things down to the extent that it would have taken literally 
weeks to run. So next I have written an extension in C which basically 
creates a huge c array on disk and the offset positions for each patt 
combo is a hash generated from the unique string.  This is still way too 
slow and I am just wondering if I am ignorant of some far easier or more 
effective way of tackling this problem. Maybe I should be using some 
kinda database, but I thought that basically the shelve operation did 
that ....I was using with gdbm.

any ideas?

Blobby