Advice on optimium data structure for billion long list?

Sun May 13 14:17:50 EDT 2001

[CC'ed response to Blobby as well]

Mark blobby Robinson:
> [asked about storing a large number of 15-mer substrings for
> some pattern match work he's doing against DNA data]

Here's a few ideas.

Don't store your keys as a full string.  Encode them.
You're already assuming ATCG are enough, so you can use
2 bits instead of 8.  This can be done with a C extension.
If all strings are 15 bases then you only need 30 bits.

You say there are ~1.4 billion possible entries, which is
4**15 + 4**14 + 4**13 + 4**12 + ... 4**1 + 4 ** 0
and implies you also have strings of length less than 15.  This
means you also need to encode the length since 4 bits
doesn't encode a termination character.  A simple length
encoding needs at least 4 bits (2**4 = 16), which puts each
key as 34 bits.

HOWEVER, suppose you encode the first few bits to mean
  00 - 15 bases encoded (plus 30 for the bases = 32 bits total)
  01 - 14 bases encoded (plus 28 for the bases = 30 bits total)
  11101 - 13 bases encoded (plus 26 for the bases = 31 bits total)
  11100 - 12 bases encoded (plus 24 for the bases = 29 bits total)
  11011 - 11 bases encoded (plus 22 for the bases = 27 bits total)
  11010 -  ...

so this is a way (amoung many) to fit all of your fragments
in a 32 bit word.  This encoding can be done with a simple C
extension.  The 32 bits works nicely as the hash for the entry.

You haven't said how you store the values for each entry.  Most
likely it will be a list of record identifiers.  I know there
are fewer than 2**32 sequence records  - GenBank has O(11M)
records so they can be stored in 24 bits.  It's likely even
less since there are ways to encode lists of integers of this
sort.  Get a copy of "Managing Gigabytes" which describes several
ways to do this encoding.

There are 12G bases in GenBank.  Assuming you have every 15 base
substring in GenBank, that's 180G.  Only 1.4G are unique, so
there are on average (in the worst case) 130 entries per key,
or about 0.5KBytes per value.

You haven't mentioned the memory contraints on your machine.
You really, really want everything to be in memory.  One way
to do this is read everything into memory as much as possible.
When memory fills, write the intermediate data structure to
disk, reset the data structure and process the next input block.
Repeat until done.  After the data is read, sort and merge the
written blocks to produce on on-disk data structure for the
whole input data.  Again, Managing Gigabytes describes some
ways to do this.  You'll likely want all of this as a C extension
in order to manipulate all of the bits directly.

This option (multiple writes followed by merge) can be easily
parallized.  (As Alexandre Fayolle and others suggested.)

Most likely you will not have all 1.4G possible combinations.
There are lots of tradeoffs you can do depending on how
much memory you have, the amount of data and what you want
to do with the result.  Still, this should be enough to point
out a few possibilities or enable you to better evaluate other
possibilities.

Eg, consider Dave LeBlanc's suggestion to use bsddb (I assume
bsddb3).  It implements a key-value data structure which is
persistent on disk and as I recall uses an in-memory cache.
If there is a distribution of 15mers or your data size is small
then you should be able to get good cache coherency and use it
to store the the 32 bit encoded key and <~0.5KByte encoded values.
It's only when you have these really large data sets that that
might begin to yield problems.

It all depends on what resources and limitation you have.

                    Andrew
                    dalke at acm.org
P.S.
  You could also try this question on one of the biopython.org
mailing lists.