Hi Ryan, Did you consider packing the arrays into one(two) giant array stored with mmap? That way you only need to store the start & end offsets, and there is no need to use a dictionary. It may allow you to simplify some numerical operations as well. To be more specific, start : numpy.intp end : numpy.intp data1 : numpy.int32 data2 : numpy.float64 Then your original access to the dictionary can be rewritten as data1[start[key]:end[key] data2[start[key]:end[key] Whether to wrap this as a dictionary-like object is just a matter of taste -- depending you like it raw or fine. If you need to apply some global transformation to the data, then something like data2[...] *= 10 would work. ufunc.reduceat(data1, ....) can be very useful as well. (with some tricks on start /end) I was facing a similar issue a few years ago, and you may want to look at this code (It wasn't very well written I had to admit) https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L... Best, - Yu On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <ryan@bytemining.com> wrote:
Hi,
I have a very large dictionary that must be shared across processes and does not fit in RAM. I need access to this object to be fast. The key is an integer ID and the value is a list containing two elements, both of them numpy arrays (one has ints, the other has floats). The key is sequential, starts at 0, and there are no gaps, so the “outer” layer of this data structure could really just be a list with the key actually being the index. The lengths of each pair of arrays may differ across keys.
For a visual:
{ key=0: [ numpy.array([1,8,15,…, 16000]), numpy.array([0.1,0.1,0.1,…,0.1]) ], key=1: [ numpy.array([5,6]), numpy.array([0.5,0.5]) ], … }
I’ve tried: - manager proxy objects, but the object was so big that low-level code threw an exception due to format and monkey-patching wasn’t successful. - Redis, which was far too slow due to setting up connections and data conversion etc. - Numpy rec arrays + memory mapping, but there is a restriction that the numpy arrays in each “column” must be of fixed and same size. - I looked at PyTables, which may be a solution, but seems to have a very steep learning curve. - I haven’t tried SQLite3, but I am worried about the time it takes to query the DB for a sequential ID, and then translate byte arrays.
Any ideas? I greatly appreciate any guidance you can provide.
Thanks, Ryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion