Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a persistent key-store thing (
https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (
bcolz.blosc.org). Here are some performance numbers:
python key-store.py -f numpy -d __test
-l 0
########## Checking method: numpy (via .npz files) ############
Building database. Wait please...
Time ( creation) --> 1.906
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.191
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
75M __test
So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes:
$ python key-store.py -f numpy -d __test -l 9
########## Checking method: numpy (via .npz files) ############
Building database. Wait please...
Time ( creation) --> 6.636
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.384
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
28M __test
Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode:
$ python key-store.py -f bcolz -d __test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############
Building database. Wait please...
Time ( creation) --> 0.479
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.103
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
82M __test
Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default:
$ python key-store.py -f bcolz -d __test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############
Building database. Wait please...
Time ( creation) --> 0.487
Retrieving 100 keys in arbitrary order...
Time ( query) --> 0.98
Number of elements out of getitem: 10518976
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
29M __test
So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds.
In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB):
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0
########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############
Building database. Wait please...
Time ( creation) --> 133.650
Retrieving 100 keys in arbitrary order...
Time ( query) --> 2.881
Number of elements out of getitem: 91907396
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
39G /media/faltet/docker/__test
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9
########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############
Building database. Wait please...
Time ( creation) --> 145.633
Retrieving 100 keys in arbitrary order...
Time ( query) --> 1.339
Number of elements out of getitem: 91907396
faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
12G /media/faltet/docker/__test
So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected).
But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too.