key/value store optimized for disk storage

Steve Howell showell30 at yahoo.com
Thu May 3 23:12:02 EDT 2012


On May 3, 1:42 am, Steve Howell <showel... at yahoo.com> wrote:
> On May 2, 11:48 pm, Paul Rubin <no.em... at nospam.invalid> wrote:
>
> > Paul Rubin <no.em... at nospam.invalid> writes:
> > >looking at the spec more closely, there are 256 hash tables.. ...
>
> > You know, there is a much simpler way to do this, if you can afford to
> > use a few hundred MB of memory and you don't mind some load time when
> > the program first starts.  Just dump all the data sequentially into a
> > file.  Then scan through the file, building up a Python dictionary
> > mapping data keys to byte offsets in the file (this is a few hundred MB
> > if you have 3M keys).  Then dump the dictionary as a Python pickle and
> > read it back in when you start the program.
>
> > You may want to turn off the cyclic garbage collector when building or
> > loading the dictionary, as it badly can slow down the construction of
> > big lists and maybe dicts (I'm not sure of the latter).
>
> I'm starting to lean toward the file-offset/seek approach.  I am
> writing some benchmarks on it, comparing it to a more file-system
> based approach like I mentioned in my original post.  I'll report back
> when I get results, but it's already way past my bedtime for tonight.
>
> Thanks for all your help and suggestions.


I ended up going with the approach that Paul suggested (except I used
JSON instead of pickle for persisting the hash).  I like it for its
simplicity and ease of troubleshooting.

My test was to write roughly 4GB of data, with 2 million keys of 2k
bytes each.

The nicest thing was how quickly I was able to write the file.
Writing tons of small files bogs down the file system, whereas the one-
big-file approach finishes in under three minutes.

Here's the code I used for testing:

https://github.com/showell/KeyValue/blob/master/test_key_value.py

Here are the results:

~/WORKSPACE/KeyValue > ls -l values.txt hash.txt
-rw-r--r--  1 steve  staff    44334161 May  3 18:53 hash.txt
-rw-r--r--  1 steve  staff  4006000000 May  3 18:53 values.txt

2000000 out of 2000000 records yielded (2k each)
Begin READING test
num trials 100000
time spent 39.8048191071
avg delay 0.000398048191071

real	2m46.887s
user	1m35.232s
sys	0m19.723s



More information about the Python-list mailing list