On Thu, Jan 14, 2016 at 2:13 PM, Stephan Hoyer <shoyer@gmail.com> wrote:
On Thu, Jan 14, 2016 at 8:26 AM, Travis Oliphant <travis@continuum.io> wrote:
I don't know enough about xray to know whether it supports this kind of general labeling to be able to build your entire data-structure as an x-ray object. Dask could definitely be used to process your data in an easy to describe manner (creating a dask.bag of dask.arrays would work though I'm not sure there are any methods that would buy you from just having a standard dictionary of dask.arrays). You can definitely use dask imperative to parallelize your data-manipulation algorithms.
Indeed, xray's data model is not flexible enough to represent this sort of data -- it's designed around cases where multiple arrays use shared axes.
However, I would indeed recommend dask.array (coupled with some sort of on-disk storage) as a possible solution for this problem, if you need to be able manipulate these arrays with an API that looks like NumPy. That said, the fact that your data consists of ragged arrays suggests that the dask.array API may be less useful for you.
Tools like dask.imperative, coupled with HDF5 for storage, could still be very useful, though.
The reason I didn't suggest dask is that I had the impression that dask's model is better suited to bulk/streaming computations with vectorized semantics ("do the same thing to lots of data" kinds of problems, basically), whereas it sounded like the OP's algorithm needed lots of one-off unpredictable random access. Obviously even if this is true then it's useful to point out both because the OP's problem might turn out to be a better fit for dask's model than they indicated -- the post is somewhat vague :-). But, I just wanted to check, is the above a good characterization of dask's strengths/applicability? -n -- Nathaniel J. Smith -- http://vorpus.org