[AstroPy] Projects involving irregularly shaped data

Thu Oct 8 10:02:21 EDT 2020

Hi Jim,

I actually do have a good application for this: Very-High-Energy (VHE) gamma-ray astronomy using Imaging Atmospheric Cherenkov Telescopes (IACTs).  I'm the data processing coordinator for CTA (the next generation IACT) as well as lead developer of a python-based low-level IACT reconstruction package (ctapipe), though this response is not made in any official capacity. In our field, we have nearly the same problem as in particle physics:  we measure gamma rays by looking at air-showers produced in the atmosphere using an array of highly sensitive optical telescopes, whose cameras are very similar to particle physics detectors.  So for each detected "event" (which could be a gamma ray or cosmic ray), we readout a sparse data block consisting of N sequences of images of the shower taken at ns time resolution (where N is around 1-100), and for each image sequence we only store the readout of pixels that have a signal or are close to those that do (so M_pixels is also a variable length array).  So everything is variable-length record arrays of variable-length arrays.  Once data are fully processed, the final result looks more similar to "traditional" astronomy: gamma-ray sky images, spectra, light curves etc, so most of this is hidden to the end-user.   

Due to this complexity, we've so far had to process raw data "event-by-event", and "telescope-by-telescope" at least at the first stages of analysis, and have had to make a series of complex data structures and loops to handle it all. We use numpy and numba heavily at the lowest-levels (avoiding loops over pixels and time-slices), but not for the event  or telescope loops.   Storage of the data is also somewhat complex, as we have to break it into flat tables to avoid slowness introduced by variable-length arrays in most storage formats like HDF5 or FITS, and to support HPC optimization.   Also, this is "big" data, meaning that we will generate and process about 10 PB of real data per year, and a similar volume of simulated data, so use of parallel processing is critical, machine-learning is necessary, and even GPUs and other HPC methods are interesting. 

So the point of all this is: In an ideal world, we could easily apply algorithms to all events and telescopes at once (or at least as many as can fit into memory), and that requires something like awkward-array.  I've followed with interest the evolution of awkward array, but so far we have not used it due to a few factors: 1. it didn't exist when we started development, and 2. it wasn't yet stable enough to be the core data structure of our whole framework.  However, I think it's a really interesting technology to consider for a future refactoring.  Would be happy to discuss more offline. 

Karl

-- 
Karl Kosack
CEA Saclay / CTA Observatory
https://www.cta-observatory.org/

> On Oct 7, 2020, at 21:59, Jim Pivarski <jpivarski at gmail.com> wrote:
> 
> Hi everyone,
> 
> Adrian Price-Whelan recommended that I ask my question here, since it would reach a greater number of people involved in astronomical software.
> 
> I'm a developer of Awkward Array, a Python package for manipulating large, irregularly shaped datasets: arrays with variable-length lists, nested records, missing values, or mixed data types. The interface is a strict generalization of NumPy: you can slice jagged arrays as though they were ordinary multidimensional arrays, and there are new functions that only make sense in the context of irregular data. Like NumPy, the actual calculations are precompiled loops on internally homogeneous arrays, and we're expanding it to include GPUs transparently (irregular data on GPUs in a NumPy-like syntax).
> 
> This package was developed for particle physics (variable numbers of particles emerging from an array of collision events), but it seems like these problems would exist in other fields as well. Right now, we're working on a proposal to find data analysis projects that need to deal with large, irregularly structured data to see if Awkward Array is applicable and if it can be made more useful for them. Ideally, this would motivate more interoperability with other scientific Python libraries. (We can already use Awkward Arrays in Numba; we're working on cuDF, Dask, and Zarr. Adrian also recommended ASDF, which I'm looking into now.)
> 
> Does anyone have or know about a data analysis project that is currently limited by this combination of large + irregular data? Is anyone interested in collaborating?
> 
> Thank you!
> -- Jim
> 
> _______________________________________________
> AstroPy mailing list
> AstroPy at python.org
> https://mail.python.org/mailman/listinfo/astropy