[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Thu Jan 14 11:26:48 EST 2016

On Thu, Jan 14, 2016 at 8:16 AM, Edison Gustavo Muenz <
edisongustavo at gmail.com> wrote:

> From what I know this would be the use case that Dask seems to solve.
>
> I think this blog post can help:
> https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
>
> Notice that I haven't used any of these projects myself.
>

I don't know enough about xray to know whether it supports this kind of
general labeling to be able to build your entire data-structure as an x-ray
object.   Dask could definitely be used to process your data in an easy to
describe manner (creating a dask.bag of dask.arrays would work though I'm
not sure there are any methods that would buy you from just having a
standard dictionary of dask.arrays).   You can definitely use dask
imperative to parallelize your data-manipulation algorithms.

But, dask doesn't take a strong opinion as to how you store your data ---
it can use anything python can read.    I believe your question was "how do
I store this?"

If you think of the file-system as a simple key-value store, then you could
easily construct this kind of scenario on disk with directory names for
your keys and two files in each directory for your arrays.      Then, you
could mmap the individual arrays directly for processing.    Those
individual arrays could be stored as bcolz, npy files, or anything else.

Will your multiple processes need to write to these files or will they be
read-only?

-Travis

>
> On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <faltet at gmail.com> wrote:
>
>> Well, maybe something like a simple class emulating a dictionary that
>> stores a key-value on disk would be more than enough.  Then you can use
>> whatever persistence layer that you want (even HDF5, but not necessarily).
>>
>> As a demonstration I did a quick and dirty implementation for such a
>> persistent key-store thing (
>> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
>> KeyStore class (less than 40 lines long) is responsible for storing the
>> value (2 arrays) into a key (a directory).  As I am quite a big fan of
>> compression, I implemented a couple of serialization flavors: one using the
>> .npz format (so no other dependencies than NumPy are needed) and the other
>> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
>> are some performance numbers:
>>
>> python key-store.py -f numpy -d __test -l 0
>> ########## Checking method: numpy (via .npz files) ############
>> Building database.  Wait please...
>> Time (            creation) --> 1.906
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.191
>> Number of elements out of getitem: 10518976
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>>
>> 75M     __test
>>
>> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
>> can compress data as well, so let's see how it goes:
>>
>> $ python key-store.py -f numpy -d __test -l 9
>> ########## Checking method: numpy (via .npz files) ############
>> Building database.  Wait please...
>> Time (            creation) --> 6.636
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.384
>> Number of elements out of getitem: 10518976
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 28M     __test
>>
>> Ok, in this case we have got almost a 3x compression ratio, which is not
>> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
>> First in non-compressed mode:
>>
>> $ python key-store.py -f bcolz -d __test -l 0
>> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 0.479
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.103
>> Number of elements out of getitem: 10518976
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 82M     __test
>>
>> Without compression, bcolz takes a bit more (~10%) space than NPZ.
>> However, bcolz is actually meant to be used with compression on by default:
>>
>> $ python key-store.py -f bcolz -d __test -l 9
>> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 0.487
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.98
>> Number of elements out of getitem: 10518976
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 29M     __test
>>
>> So, the final disk usage is quite similar to NPZ, but it can store and
>> retrieve lots faster.  Also, the data decompression speed is on par to
>> using non-compression.  This is because bcolz uses Blosc behind the scenes,
>> which is much faster than zlib (used by NPZ) --and sometimes faster than a
>> memcpy().  However, even we are doing I/O against the disk, this dataset is
>> so small that fits in the OS filesystem cache, so the benchmark is actually
>> checking I/O at memory speeds, not disk speeds.
>>
>> In order to do a more real-life comparison, let's use a dataset that is
>> much larger than the amount of memory in my laptop (8 GB):
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
>> /media/faltet/docker/__test -l 0
>> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 133.650
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 2.881
>> Number of elements out of getitem: 91907396
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
>> /media/faltet/docker/__test
>>
>> 39G     /media/faltet/docker/__test
>>
>> and now, with compression on:
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
>> /media/faltet/docker/__test -l 9
>> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 145.633
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 1.339
>> Number of elements out of getitem: 91907396
>> faltet at faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
>> /media/faltet/docker/__test
>>
>> 12G     /media/faltet/docker/__test
>>
>> So, we are still seeing the 3x compression ratio.  But the interesting
>> thing here is that the compressed version works a 50% faster than the
>> uncompressed one (13 ms/query vs 29 ms/query).  In this case I was using a
>> SSD (hence the low query times), so the compression advantage is even more
>> noticeable than when using memory as above (as expected).
>>
>> But anyway, this is just a demonstration that you don't need heavy tools
>> to achieve what you want.  And as a corollary, (fast) compressors can save
>> you not only storage, but processing time too.
>>
>> Francesc
>>
>>
>> 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <njs at pobox.com>:
>>
>>> I'd try storing the data in hdf5 (probably via h5py, which is a more
>>> basic interface without all the bells-and-whistles that pytables
>>> adds), though any method you use is going to be limited by the need to
>>> do a seek before each read. Storing the data on SSD will probably help
>>> a lot if you can afford it for your data size.
>>>
>>> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan at bytemining.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I have a very large dictionary that must be shared across processes
>>> and does not fit in RAM. I need access to this object to be fast. The key
>>> is an integer ID and the value is a list containing two elements, both of
>>> them numpy arrays (one has ints, the other has floats). The key is
>>> sequential, starts at 0, and there are no gaps, so the “outer” layer of
>>> this data structure could really just be a list with the key actually being
>>> the index. The lengths of each pair of arrays may differ across keys.
>>> >
>>> > For a visual:
>>> >
>>> > {
>>> > key=0:
>>> >         [
>>> >                 numpy.array([1,8,15,…, 16000]),
>>> >                 numpy.array([0.1,0.1,0.1,…,0.1])
>>> >         ],
>>> > key=1:
>>> >         [
>>> >                 numpy.array([5,6]),
>>> >                 numpy.array([0.5,0.5])
>>> >         ],
>>> > …
>>> > }
>>> >
>>> > I’ve tried:
>>> > -       manager proxy objects, but the object was so big that
>>> low-level code threw an exception due to format and monkey-patching wasn’t
>>> successful.
>>> > -       Redis, which was far too slow due to setting up connections
>>> and data conversion etc.
>>> > -       Numpy rec arrays + memory mapping, but there is a restriction
>>> that the numpy arrays in each “column” must be of fixed and same size.
>>> > -       I looked at PyTables, which may be a solution, but seems to
>>> have a very steep learning curve.
>>> > -       I haven’t tried SQLite3, but I am worried about the time it
>>> takes to query the DB for a sequential ID, and then translate byte arrays.
>>> >
>>> > Any ideas? I greatly appreciate any guidance you can provide.
>>> >
>>> > Thanks,
>>> > Ryan
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion at scipy.org
>>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>>
>>> --
>>> Nathaniel J. Smith -- http://vorpus.org
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion at scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>>
>> --
>> Francesc Alted
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

-- 

*Travis Oliphant*
*Co-founder and CEO*

@teoliphant
512-222-5440
http://www.continuum.io
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20160114/23e5feb9/attachment.html>