Hi Stephan, 
thanks for the note. The progress over last two years wasn't impressive IMO, but I hope you'll manage.

As you suggest, I'll have a look at xarray too, as I see xarray.Dataset. 
I was sure that it doesn't work with non-homogeneous data at all, clearly I need to refresh my opinion.



22 февр. 2017 г., в 20:55, Stephan Hoyer <shoyer@gmail.com> написал(а):

On Wed, Feb 22, 2017 at 8:57 AM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru> wrote:
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be minimal

The pandas development team (myself included) is well aware of these issues. There are long term plans/hopes to fix this, but there's a lot of work to be done and some hard choices to make:

 That's why I am looking for a reliable pandas substitute, which should be: 
- completely consistent with numpy and should fail when this wasn't implemented / impossible
- fewer new abstractions, nobody wants to learn one-more-way-to-manipulate-the-data, specifically other researchers
- it may be less convenient for interactive data mungling
  - in particular, less methods is ok
- written code should be interpretable, and hardly can be misinterpreted.
- not super slow, 1-10 gigabytes datasets are a normal situation

This has some overlap with our motivations for writing Xarray (http://xarray.pydata.org), so I encourage you to take a look. It still might be more complex than you're looking for, but we did try to clean up the really ambiguous APIs from pandas like indexing.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion