Fast Access to Container of Numpy Arrays on Disk?

Hi, I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys. For a visual: { key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … } I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction that the numpy arrays in each “column” must be of fixed and same size. - I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays. Any ideas? I greatly appreciate any guidance you can provide. Thanks, Ryan

I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can afford it for your data size. On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote:
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction that the numpy arrays in each “column” must be of fixed and same size. - I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- http://vorpus.org

Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily). As a demonstration I did a quick and dirty implementation for such a persistent key-store thing ( https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (bcolz.blosc.org). Here are some performance numbers: python key-store.py -f numpy -d __test -l 0 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 1.906 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.191 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 75M __test So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes: $ python key-store.py -f numpy -d __test -l 9 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 6.636 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.384 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 28M __test Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode: $ python key-store.py -f bcolz -d __test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.479 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.103 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 82M __test Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default: $ python key-store.py -f bcolz -d __test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.487 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.98 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 29M __test So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds. In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB): $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 133.650 Retrieving 100 keys in arbitrary order... Time ( query) --> 2.881 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test 39G /media/faltet/docker/__test and now, with compression on: $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 145.633 Retrieving 100 keys in arbitrary order... Time ( query) --> 1.339 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test 12G /media/faltet/docker/__test So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected). But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too. Francesc 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs@pobox.com>:
I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can afford it for your data size.
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction
On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote: that the numpy arrays in each “column” must be of fixed and same size.
- I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted

From what I know this would be the use case that Dask seems to solve.
I think this blog post can help: https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python Notice that I haven't used any of these projects myself. On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <faltet@gmail.com> wrote:
Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a persistent key-store thing ( https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (bcolz.blosc.org). Here are some performance numbers:
python key-store.py -f numpy -d __test -l 0 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 1.906 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.191 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
75M __test
So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes:
$ python key-store.py -f numpy -d __test -l 9 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 6.636 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.384 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 28M __test
Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode:
$ python key-store.py -f bcolz -d __test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.479 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.103 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 82M __test
Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default:
$ python key-store.py -f bcolz -d __test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.487 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.98 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 29M __test
So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds.
In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB):
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 133.650 Retrieving 100 keys in arbitrary order... Time ( query) --> 2.881 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
39G /media/faltet/docker/__test
and now, with compression on:
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 145.633 Retrieving 100 keys in arbitrary order... Time ( query) --> 1.339 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
12G /media/faltet/docker/__test
So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected).
But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too.
Francesc
2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs@pobox.com>:
I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can afford it for your data size.
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction
On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote: that the numpy arrays in each “column” must be of fixed and same size.
- I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

A warning about HDF5. It is not a database format, so you have to be extremely careful if the data is getting updated while it is open for reading by anybody else. If it is strictly read-only, and no body else is updating it, then have at it! Cheers! Ben Root On Thu, Jan 14, 2016 at 9:16 AM, Edison Gustavo Muenz < edisongustavo@gmail.com> wrote:
From what I know this would be the use case that Dask seems to solve.
I think this blog post can help: https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
Notice that I haven't used any of these projects myself.
On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <faltet@gmail.com> wrote:
Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a persistent key-store thing ( https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (bcolz.blosc.org). Here are some performance numbers:
python key-store.py -f numpy -d __test -l 0 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 1.906 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.191 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
75M __test
So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes:
$ python key-store.py -f numpy -d __test -l 9 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 6.636 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.384 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 28M __test
Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode:
$ python key-store.py -f bcolz -d __test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.479 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.103 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 82M __test
Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default:
$ python key-store.py -f bcolz -d __test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.487 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.98 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 29M __test
So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds.
In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB):
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 133.650 Retrieving 100 keys in arbitrary order... Time ( query) --> 2.881 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
39G /media/faltet/docker/__test
and now, with compression on:
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 145.633 Retrieving 100 keys in arbitrary order... Time ( query) --> 1.339 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
12G /media/faltet/docker/__test
So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected).
But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too.
Francesc
2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs@pobox.com>:
I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can afford it for your data size.
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of
On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote: them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that
- Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction
low-level code threw an exception due to format and monkey-patching wasn’t successful. that the numpy arrays in each “column” must be of fixed and same size.
- I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Jan 14, 2016 at 8:16 AM, Edison Gustavo Muenz < edisongustavo@gmail.com> wrote:
From what I know this would be the use case that Dask seems to solve.
I think this blog post can help: https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
Notice that I haven't used any of these projects myself.
I don't know enough about xray to know whether it supports this kind of general labeling to be able to build your entire data-structure as an x-ray object. Dask could definitely be used to process your data in an easy to describe manner (creating a dask.bag of dask.arrays would work though I'm not sure there are any methods that would buy you from just having a standard dictionary of dask.arrays). You can definitely use dask imperative to parallelize your data-manipulation algorithms. But, dask doesn't take a strong opinion as to how you store your data --- it can use anything python can read. I believe your question was "how do I store this?" If you think of the file-system as a simple key-value store, then you could easily construct this kind of scenario on disk with directory names for your keys and two files in each directory for your arrays. Then, you could mmap the individual arrays directly for processing. Those individual arrays could be stored as bcolz, npy files, or anything else. Will your multiple processes need to write to these files or will they be read-only? -Travis
On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <faltet@gmail.com> wrote:
Well, maybe something like a simple class emulating a dictionary that stores a key-value on disk would be more than enough. Then you can use whatever persistence layer that you want (even HDF5, but not necessarily).
As a demonstration I did a quick and dirty implementation for such a persistent key-store thing ( https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the KeyStore class (less than 40 lines long) is responsible for storing the value (2 arrays) into a key (a directory). As I am quite a big fan of compression, I implemented a couple of serialization flavors: one using the .npz format (so no other dependencies than NumPy are needed) and the other using the ctable object from the bcolz package (bcolz.blosc.org). Here are some performance numbers:
python key-store.py -f numpy -d __test -l 0 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 1.906 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.191 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
75M __test
So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ can compress data as well, so let's see how it goes:
$ python key-store.py -f numpy -d __test -l 9 ########## Checking method: numpy (via .npz files) ############ Building database. Wait please... Time ( creation) --> 6.636 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.384 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 28M __test
Ok, in this case we have got almost a 3x compression ratio, which is not bad. However, the performance has degraded a lot. Let's use now bcolz. First in non-compressed mode:
$ python key-store.py -f bcolz -d __test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.479 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.103 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 82M __test
Without compression, bcolz takes a bit more (~10%) space than NPZ. However, bcolz is actually meant to be used with compression on by default:
$ python key-store.py -f bcolz -d __test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 0.487 Retrieving 100 keys in arbitrary order... Time ( query) --> 0.98 Number of elements out of getitem: 10518976 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test 29M __test
So, the final disk usage is quite similar to NPZ, but it can store and retrieve lots faster. Also, the data decompression speed is on par to using non-compression. This is because bcolz uses Blosc behind the scenes, which is much faster than zlib (used by NPZ) --and sometimes faster than a memcpy(). However, even we are doing I/O against the disk, this dataset is so small that fits in the OS filesystem cache, so the benchmark is actually checking I/O at memory speeds, not disk speeds.
In order to do a more real-life comparison, let's use a dataset that is much larger than the amount of memory in my laptop (8 GB):
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 0 ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 133.650 Retrieving 100 keys in arbitrary order... Time ( query) --> 2.881 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
39G /media/faltet/docker/__test
and now, with compression on:
$ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d /media/faltet/docker/__test -l 9 ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') ############ Building database. Wait please... Time ( creation) --> 145.633 Retrieving 100 keys in arbitrary order... Time ( query) --> 1.339 Number of elements out of getitem: 91907396 faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh /media/faltet/docker/__test
12G /media/faltet/docker/__test
So, we are still seeing the 3x compression ratio. But the interesting thing here is that the compressed version works a 50% faster than the uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a SSD (hence the low query times), so the compression advantage is even more noticeable than when using memory as above (as expected).
But anyway, this is just a demonstration that you don't need heavy tools to achieve what you want. And as a corollary, (fast) compressors can save you not only storage, but processing time too.
Francesc
2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs@pobox.com>:
I'd try storing the data in hdf5 (probably via h5py, which is a more basic interface without all the bells-and-whistles that pytables adds), though any method you use is going to be limited by the need to do a seek before each read. Storing the data on SSD will probably help a lot if you can afford it for your data size.
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of
On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote: them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that
- Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction
low-level code threw an exception due to format and monkey-patching wasn’t successful. that the numpy arrays in each “column” must be of fixed and same size.
- I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- *Travis Oliphant* *Co-founder and CEO* @teoliphant 512-222-5440 http://www.continuum.io

On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know enough about xray to know whether it supports this kind of general labeling to be able to build your entire data-structure as an x-ray object. Dask could definitely be used to process your data in an easy to describe manner (creating a dask.bag of dask.arrays would work though I'm not sure there are any methods that would buy you from just having a standard dictionary of dask.arrays). You can definitely use dask imperative to parallelize your data-manipulation algorithms.
Indeed, xray's data model is not flexible enough to represent this sort of data -- it's designed around cases where multiple arrays use shared axes. However, I would indeed recommend dask.array (coupled with some sort of on-disk storage) as a possible solution for this problem, if you need to be able manipulate these arrays with an API that looks like NumPy. That said, the fact that your data consists of ragged arrays suggests that the dask.array API may be less useful for you. Tools like dask.imperative, coupled with HDF5 for storage, could still be very useful, though.

On Thu, Jan 14, 2016 at 2:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know enough about xray to know whether it supports this kind of general labeling to be able to build your entire data-structure as an x-ray object. Dask could definitely be used to process your data in an easy to describe manner (creating a dask.bag of dask.arrays would work though I'm not sure there are any methods that would buy you from just having a standard dictionary of dask.arrays). You can definitely use dask imperative to parallelize your data-manipulation algorithms.
Indeed, xray's data model is not flexible enough to represent this sort of data -- it's designed around cases where multiple arrays use shared axes.
However, I would indeed recommend dask.array (coupled with some sort of on-disk storage) as a possible solution for this problem, if you need to be able manipulate these arrays with an API that looks like NumPy. That said, the fact that your data consists of ragged arrays suggests that the dask.array API may be less useful for you.
Tools like dask.imperative, coupled with HDF5 for storage, could still be very useful, though.
The reason I didn't suggest dask is that I had the impression that dask's model is better suited to bulk/streaming computations with vectorized semantics ("do the same thing to lots of data" kinds of problems, basically), whereas it sounded like the OP's algorithm needed lots of one-off unpredictable random access. Obviously even if this is true then it's useful to point out both because the OP's problem might turn out to be a better fit for dask's model than they indicated -- the post is somewhat vague :-). But, I just wanted to check, is the above a good characterization of dask's strengths/applicability? -n -- Nathaniel J. Smith -- http://vorpus.org

On Thu, Jan 14, 2016 at 2:30 PM, Nathaniel Smith <njs@pobox.com> wrote:
The reason I didn't suggest dask is that I had the impression that dask's model is better suited to bulk/streaming computations with vectorized semantics ("do the same thing to lots of data" kinds of problems, basically), whereas it sounded like the OP's algorithm needed lots of one-off unpredictable random access.
Obviously even if this is true then it's useful to point out both because the OP's problem might turn out to be a better fit for dask's model than they indicated -- the post is somewhat vague :-).
But, I just wanted to check, is the above a good characterization of dask's strengths/applicability?
Yes, dask is definitely designed around setting up a large streaming computation and then executing it all at once. But it is pretty flexible in terms of what those specific computations are, and can also work for non-vectorized computation (especially via dask imperative). It's worth taking a look at dask's collections for a sense of what it can do here. The recently refreshed docs provide a nice overview: http://dask.pydata.org/ Cheers, Stephan

Hi Ryan, Did you consider packing the arrays into one(two) giant array stored with mmap? That way you only need to store the start & end offsets, and there is no need to use a dictionary. It may allow you to simplify some numerical operations as well. To be more specific, start : numpy.intp end : numpy.intp data1 : numpy.int32 data2 : numpy.float64 Then your original access to the dictionary can be rewritten as data1[start[key]:end[key] data2[start[key]:end[key] Whether to wrap this as a dictionary-like object is just a matter of taste -- depending you like it raw or fine. If you need to apply some global transformation to the data, then something like data2[...] *= 10 would work. ufunc.reduceat(data1, ....) can be very useful as well. (with some tricks on start /end) I was facing a similar issue a few years ago, and you may want to look at this code (It wasn't very well written I had to admit) https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L... Best, - Yu On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote:
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction that the numpy arrays in each “column” must be of fixed and same size. - I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (8)
-
Benjamin Root
-
Edison Gustavo Muenz
-
Feng Yu
-
Francesc Alted
-
Nathaniel Smith
-
Ryan R. Rosario
-
Stephan Hoyer
-
Travis Oliphant