
On Thu, Apr 14, 2016 at 1:14 PM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
Hi Nathaniel, [...] I really like the idea of the DOK buffer of pending updates. I'll note that what the COO structure you describe looks similar to CJ's SpArray which he mentioned in a parallel thread, and my MapArray can serve as a buffer. Tacking them together seems easy, so one can fairly easily test-drive what you describe.
Cool :-)
Looking a bit further though, I doubt that one-size-fits-all data data format is realistic. For one, there's a balance to strike between a optimized 2D structure (e.g., PySparse's ll_mat) and a general n-D structure, and there will always be at least one application which requires one or another. (My crude experiments so far give about a factor of 2--5 for CPU and/or memory footprint depending on general tradeoffs.)
Ultimately then, I think it makes sense to consider three separate layers: a python frontend, a low-level data structure and a middle layer which transforms, e.g., the logic required by python's __setitem__ into iterators of the low-level data structure. For shaping up this middle layer, an std::map is fine, esp because it's already implemented :-).
I guess my tactical preference would be to start by implementing a one-size-fits-all data format, make it as awesome as one can, and then only add extra special-cases once (if) it becomes clear that there is really a need. ll_mat is just a list-of-lists format, right? why would we need that? And CSR/CSC are nice, but even for 2-d matrices the savings compared to COO seem underwhelming. I guess CSR/CSC are at most 30% smaller than COO, and if you need random access to rows/columns then you can do that in O(1) instead of O(log log n)? Obviously there will always be some specialized applications where this matters, but...
For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation.
For ufuncs, it's simple: it does not work ATM :-( . I tried simple conbinations of __array__, __array_prepare__ and __array_finalize__, and they fail for a reason or another. Some details are available in https://github.com/ev-br/sparr/issues/35 In a nutshell, these can probably made to work with some rather simple modifications of to __array_prepare__/__array_wrap__, but I guess it's not very likely given that __numpy_ufunc__ is supposed to come to town, which I'm looking forward to :-).
Yeah, you probably do need __numpy_ufunc__. But in the mean time you could see about something like def generic_ufunc_adapter(ufunc, *args, **kwargs): ... do something clever ... def add(*args, **kwargs): return generic_ufunc_adapter(np.add, *args, **kwargs) and then be prepared to switch over to __numpy_ufunc__ when available. The big advantage of this approach is that you can hopefully handle all ufuncs, including user-defined ufuncs (e.g. scipy.special, numba @vectorize functions) without having to special case all of them.
For dtypes, I currently just dispatch on numpy typecodes. Do you mean that, or do you mean actual PyArray_Descr objects? Those I'm not using, and I do not see how to use them, likely because I simply do not understand what they actually are :-). In that department, I'd really really appreciate an explanation!
I meant PyArray_Descr objects :-). They're a Python object whose C layout looks like this: https://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c... The crucial bit is that PyArray_ArrFuncs pointer at the end. That points to a table of function pointers: https://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c... and those function pointers implement generic operations like "copy elements of this dtype from this strided array to that strided array", "convert a Python object into a buffer using this dtype" (used by e.g. arr[0] = some_object), etc. They're not terribly well documented, and we need to rework that API (but will have to preserve those entrypoints for backcompat purposes), but this is how numpy.ndarray is able to work with arbitrary dtypes without dispatching on typecodes. -n -- Nathaniel J. Smith -- https://vorpus.org