Hi,
a question about numpy.recarray: There is a parameter order in constructor https://docs.scipy.org/doc/numpy1.10.1/reference/generated/numpy.recarray.h... https://docs.scipy.org/doc/numpy1.10.1/reference/generated/numpy.recarray.html, but it seems to have no effect:
import numpy x = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[1000], order='C') y = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[1000], order='F') print numpy.array(x.ctypes.get_strides()) # [16] print numpy.array(y.ctypes.get_strides()) # [16]
is this an intended behavior or bug?
Thanks, Alex.
On Tue, Feb 21, 2017 at 3:05 PM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
a question about numpy.recarray: There is a parameter order in constructor https://docs.scipy.org/doc/ numpy1.10.1/reference/generated/numpy.recarray.html, but it seems to have no effect: x = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[1000], order='C')
you are creating a 1D array here  there is no difference between Fortran and C order for a 1D array. For 2D:
In [2]: x = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[10,10], order='C')
In [3]: x.strides Out[3]: (160, 16)
In [4]: y = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[10,10], order='F')
In [5]: y.strides Out[5]: (16, 160)
note the easier way to get the strides, too :)
CHB
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Alex.
22 февр. 2017 г., в 2:10, Chris Barker chris.barker@noaa.gov написал(а):
On Tue, Feb 21, 2017 at 3:05 PM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: a question about numpy.recarray: There is a parameter order in constructor https://docs.scipy.org/doc/numpy1.10.1/reference/generated/numpy.recarray.h... https://docs.scipy.org/doc/numpy1.10.1/reference/generated/numpy.recarray.html, but it seems to have no effect: x = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[1000], order='C')
you are creating a 1D array here  there is no difference between Fortran and C order for a 1D array. For 2D:
In [2]: x = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[10,10], order='C')
In [3]: x.strides Out[3]: (160, 16)
In [4]: y = numpy.recarray(dtype=[('a', int), ('b', float)], shape=[10,10], order='F')
In [5]: y.strides Out[5]: (16, 160)
note the easier way to get the strides, too :)
CHB

Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 5266959 voice 7600 Sand Point Way NE (206) 5266329 fax Seattle, WA 98115 (206) 5266317 main reception
Chris.Barker@noaa.gov mailto:Chris.Barker@noaa.gov_______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the ctable, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru>: Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith <njs@pobox.com mailto:njs@pobox.com> написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
20170222 16:23 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the *ctable*, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
http://www.lfd.uci.edu/~gohlke/pythonlibs/#bcolz Check if the link solves the issue with installing.
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
20170222 16:30 GMT+01:00 Kiko kikocorreoso@gmail.com:
20170222 16:23 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the *ctable*, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
You can disable compression for bcolz by default too:
http://bcolz.blosc.org/en/latest/defaults.html#listofdefaultvalues%E2%80%...
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
http://www.lfd.uci.edu/~gohlke/pythonlibs/#bcolz Check if the link solves the issue with installing.
Yeah. Also, there are binaries for conda:
http://bcolz.blosc.org/en/latest/install.html#installingfromcondaforge%E2...
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Well, if copying is not a problem for you, then you can just create a new numpy container and do the copy by yourself.
Francesc
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Just as a note, Appveyor supports uploading modules to "public websites":
https://packaging.python.org/appveyor/
The main issue I would see from this, is the PyPi has my password stored on my machine in a plain text file. I'm not sure whether there's a way to provide Appveyor with a SSH key instead.
On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the *ctable*, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Alex,
Can you please post some code showing exactly what you are trying to do and any issues you are having, particularly the "irritating problems with its row indexing and some other problems" you quote above?
On Wed, Feb 22, 2017 at 10:34 AM, Robert McLeod robbmcleod@gmail.com wrote:
Just as a note, Appveyor supports uploading modules to "public websites":
https://packaging.python.org/appveyor/
The main issue I would see from this, is the PyPi has my password stored on my machine in a plain text file. I'm not sure whether there's a way to provide Appveyor with a SSH key instead.
On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the *ctable*, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru:
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (CCINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 <+41%2061%20387%2032%2025> robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch robert.mcleod@ethz.ch robbmcleod@gmail.com
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
Hi Matthew, maybe it is not the best place to discuss problems of pandas, but to show that I am not missing something, let's consider a simple example.
# simplest DataFrame x = pandas.DataFrame(dict(a=numpy.arange(10), b=numpy.arange(10, 20)))
# simplest indexing. Can you predict results without looking at comments? x[:2] # returns two first rows, as expected x[[0, 1]] # returns copy of x, whole dataframe x[numpy.array(2)] # fails with IndexError: indices are outofbounds (can you guess why?) x[[0, 1], :] # unhashable type: list
just in case  I know about .loc and .iloc, but when you write code with many subroutines, you concentrate on numpy inputs, and at some point you simply forget to convert some of the data you operated with to numpy and it continues to work, but it yields wrong results (while you tested everything, but you tested this for numpy). Checking all the inputs in each small subroutine is strange.
Ok, a bit more: x[x['a'] > 5] # works as expected x[x['a'] > 5, :] # 'Series' objects are mutable, thus they cannot be hashed lookup = numpy.arange(10) x[lookup[x['a']] > 5] # works as expected x[lookup[x['a']] > 5, :] # TypeError: unhashable type: 'numpy.ndarray'
x[lookup]['a'] # indexError x['a'][lookup] # works as expected
Now let's go a bit further: train/test splitted the data for machine learning (again, the most frequent operation)
from sklearn.model_selection import train_test_split x1, x2 = train_test_split(x, random_state=42)
# compare next to operations with pandas.DataFrame col = x1['a'] print col[:2] # first two elements print col[[0, 1]] # doesn't fail (while there in no row with index 0), fills it with NaN print col[numpy.arange(2)] # same as previous
print col[col > 4] # as expected print col[col.values > 4] # as expected print col.values[col > 4] # converts boolean to int, uses int indexing, but at least raises warning
Mistakes done by such silent misoperating are not easy to detect (when your data pipeline consists of several steps), quite hard to locate the source of problem and almost impossible to be sure that you indeed avoided all such caveats. Code review turns into paranoidal process (if you care about the result, of course).
Things are even worse, because I've demonstrated this for my installation, and probably if you run this with some other pandas installation, you get some other results (that were really basic operations). So things that worked ok in one version, may work different way in the other, this becomes completely intractable.
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be minimal.
That's why I am looking for a reliable pandas substitute, which should be:  completely consistent with numpy and should fail when this wasn't implemented / impossible  fewer new abstractions, nobody wants to learn onemorewaytomanipulatethedata, specifically other researchers  it may be less convenient for interactive data mungling  in particular, less methods is ok  written code should be interpretable, and hardly can be misinterpreted.  not super slow, 110 gigabytes datasets are a normal situation
Well, that's it. Sorry for large letter.
Alex.
22 февр. 2017 г., в 18:38, Matthew Harrigan harrigan.matthew@gmail.com написал(а):
Alex,
Can you please post some code showing exactly what you are trying to do and any issues you are having, particularly the "irritating problems with its row indexing and some other problems" you quote above?
On Wed, Feb 22, 2017 at 10:34 AM, Robert McLeod <robbmcleod@gmail.com mailto:robbmcleod@gmail.com> wrote: Just as a note, Appveyor supports uploading modules to "public websites":
https://packaging.python.org/appveyor/ https://packaging.python.org/appveyor/
The main issue I would see from this, is the PyPi has my password stored on my machine in a plain text file. I'm not sure whether there's a way to provide Appveyor with a SSH key instead.
On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the ctable, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted <faltet@gmail.com mailto:faltet@gmail.com> написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru>: Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith <njs@pobox.com mailto:njs@pobox.com> написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (CCINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 tel:+41%2061%20387%2032%2025 robert.mcleod@unibas.ch mailto:robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch mailto:robert.mcleod@ethz.ch robbmcleod@gmail.com mailto:robbmcleod@gmail.com
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
On Wed, Feb 22, 2017 at 11:57 AM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
Hi Matthew, maybe it is not the best place to discuss problems of pandas, but to show that I am not missing something, let's consider a simple example.
# simplest DataFrame x = pandas.DataFrame(dict(a=numpy.arange(10), b=numpy.arange(10, 20))) # simplest indexing. Can you predict results without looking at comments? x[:2] # returns two first rows, as expected x[[0, 1]] # returns copy of x, whole dataframe x[numpy.array(2)] # fails with IndexError: indices are outofbounds (can you guess why?) x[[0, 1], :] # unhashable type: list
just in case  I know about .loc and .iloc, but when you write code with many subroutines, you concentrate on numpy inputs, and at some point you simply *forget* to convert some of the data you operated with to numpy and it *continues* to work, but it yields wrong results (while you tested everything, but you tested this for numpy). Checking all the inputs in each small subroutine is strange.
Ok, a bit more:
x[x['a'] > 5] # works as expected x[x['a'] > 5, :] # 'Series' objects are mutable, thus they cannot be hashed lookup = numpy.arange(10) x[lookup[x['a']] > 5] # works as expected x[lookup[x['a']] > 5, :] # TypeError: unhashable type: 'numpy.ndarray'
x[lookup]['a'] # indexError x['a'][lookup] # works as expected
Now let's go a bit further: train/test splitted the data for machine learning (again, the most frequent operation)
from sklearn.model_selection import train_test_split x1, x2 = train_test_split(x, random_state=42) # compare next to operations with pandas.DataFrame col = x1['a']print col[:2] # first two elementsprint col[[0, 1]] # doesn't fail (while there in no row with index 0), fills it with NaNprint col[numpy.arange(2)] # same as previous print col[col > 4] # as expectedprint col[col.values > 4] # as expectedprint col.values[col > 4] # converts boolean to int, uses int indexing, but at least raises warning
Mistakes done by such silent misoperating are not easy to detect (when your data pipeline consists of several steps), quite hard to locate the source of problem and almost impossible to be sure that you indeed avoided all such caveats. Code review turns into paranoidal process (if you care about the result, of course).
Things are even worse, because I've demonstrated this for my installation, and probably if you run this with some other pandas installation, you get some other results (that were really basic operations). So things that worked ok in one version, may work different way in the other, this becomes completely intractable.
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be *minimal*.
That's why I am looking for a reliable pandas substitute, which should be:
 completely consistent with numpy and should fail when this wasn't
implemented / impossible
 fewer new abstractions, nobody wants to learn onemorewaytomanipulatethedata,
specifically other researchers
 it may be less convenient for interactive data mungling
 in particular, less methods is ok
 written code should be interpretable, and hardly can be misinterpreted.
 not super slow, 110 gigabytes datasets are a normal situation
Just to the pandas part
statsmodels supported pandas almost from the very beginning (or maybe after 1.5 years) when the new pandas was still very young.
However, what I insisted on is that pandas is in the wrapper/interface code, and internally only numpy arrays are used. Besides the confusing "magic" indexing of early pandas, there were a lot of details that silently produced different results, e.g. default iteration on axis=1, ddof in std and var =1 instead of numpy =0.
Essentially, every interface corresponds to np.asarry, but we store the DataFrame information, mainly the index and column names, wo we can return the appropriate pandas object if a pandas object was used for the input.
This has worked pretty well. Users can have their dataframes, and we have pure numpy algorithms.
Recently we have started to use pandas inside a few functions or classes that are less tightly integrated into the overall setup. We also use pandas for some things that are not convenient or not available in numpy. Our internal use of pandas groupby and similar will most likely increase over time. (One of the main issues we had was date and time index because that was a moving target in both numpy and pandas.)
One issue for computational efficiency that we do not control is whether `asarray` creates a view or needs to make a copy because that depends on whether the dtype and memory layout that the user has in the data frame corresponds to what we need in the algorithms. If it matches, then no copies should be made except where explicitly needed.
The intention is to extend this over time to other array structures like xarray and likely dask arrays.
Josef
Well, that's it. Sorry for large letter.
Alex.
22 февр. 2017 г., в 18:38, Matthew Harrigan harrigan.matthew@gmail.com написал(а):
Alex,
Can you please post some code showing exactly what you are trying to do and any issues you are having, particularly the "irritating problems with its row indexing and some other problems" you quote above?
On Wed, Feb 22, 2017 at 10:34 AM, Robert McLeod robbmcleod@gmail.com wrote:
Just as a note, Appveyor supports uploading modules to "public websites":
https://packaging.python.org/appveyor/
The main issue I would see from this, is the PyPi has my password stored on my machine in a plain text file. I'm not sure whether there's a way to provide Appveyor with a SSH key instead.
On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the *ctable*, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted faltet@gmail.com написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov alex.rogozhnikov@yandex.ru :
Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith njs@pobox.com написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" alex.rogozhnikov@yandex.ru wrote:
Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (CCINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 <+41%2061%20387%2032%2025> robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch robert.mcleod@ethz.ch robbmcleod@gmail.com
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
22 февр. 2017 г., в 20:39, josef.pktd@gmail.com написал(а):
On Wed, Feb 22, 2017 at 11:57 AM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Hi Matthew, maybe it is not the best place to discuss problems of pandas, but to show that I am not missing something, let's consider a simple example.
# simplest DataFrame x = pandas.DataFrame(dict(a=numpy.arange(10), b=numpy.arange(10, 20)))
# simplest indexing. Can you predict results without looking at comments? x[:2] # returns two first rows, as expected x[[0, 1]] # returns copy of x, whole dataframe x[numpy.array(2)] # fails with IndexError: indices are outofbounds (can you guess why?) x[[0, 1], :] # unhashable type: list
just in case  I know about .loc and .iloc, but when you write code with many subroutines, you concentrate on numpy inputs, and at some point you simply forget to convert some of the data you operated with to numpy and it continues to work, but it yields wrong results (while you tested everything, but you tested this for numpy). Checking all the inputs in each small subroutine is strange.
Ok, a bit more: x[x['a'] > 5] # works as expected x[x['a'] > 5, :] # 'Series' objects are mutable, thus they cannot be hashed lookup = numpy.arange(10) x[lookup[x['a']] > 5] # works as expected x[lookup[x['a']] > 5, :] # TypeError: unhashable type: 'numpy.ndarray'
x[lookup]['a'] # indexError x['a'][lookup] # works as expected
Now let's go a bit further: train/test splitted the data for machine learning (again, the most frequent operation)
from sklearn.model_selection import train_test_split x1, x2 = train_test_split(x, random_state=42)
# compare next to operations with pandas.DataFrame col = x1['a'] print col[:2] # first two elements print col[[0, 1]] # doesn't fail (while there in no row with index 0), fills it with NaN print col[numpy.arange(2)] # same as previous
print col[col > 4] # as expected print col[col.values > 4] # as expected print col.values[col > 4] # converts boolean to int, uses int indexing, but at least raises warning
Mistakes done by such silent misoperating are not easy to detect (when your data pipeline consists of several steps), quite hard to locate the source of problem and almost impossible to be sure that you indeed avoided all such caveats. Code review turns into paranoidal process (if you care about the result, of course).
Things are even worse, because I've demonstrated this for my installation, and probably if you run this with some other pandas installation, you get some other results (that were really basic operations). So things that worked ok in one version, may work different way in the other, this becomes completely intractable.
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be minimal.
That's why I am looking for a reliable pandas substitute, which should be:
 completely consistent with numpy and should fail when this wasn't implemented / impossible
 fewer new abstractions, nobody wants to learn onemorewaytomanipulatethedata, specifically other researchers
 it may be less convenient for interactive data mungling
 in particular, less methods is ok
 written code should be interpretable, and hardly can be misinterpreted.
 not super slow, 110 gigabytes datasets are a normal situation
Just to the pandas part
statsmodels supported pandas almost from the very beginning (or maybe after 1.5 years) when the new pandas was still very young.
However, what I insisted on is that pandas is in the wrapper/interface code, and internally only numpy arrays are used. Besides the confusing "magic" indexing of early pandas, there were a lot of details that silently produced different results, e.g. default iteration on axis=1, ddof in std and var =1 instead of numpy =0.
Essentially, every interface corresponds to np.asarry, but we store the DataFrame information, mainly the index and column names, wo we can return the appropriate pandas object if a pandas object was used for the input.
Yes, it seems to be the best practice.
But apart from libraries, there is lots of code for my research / research in my team, and we don't make such checks all the time, moreover many functions are intended to operate with DataFrames (and use particular feature names). So, the approach is not completely applicable for research code, which is very diverse, and has many functions which are used 23 times. It is irrational to make all the code more complex to protect yourself from one library  because all the benefit is lost (as for user of package, you'll anyway need checks to protect him from passing something inappropriate)
This has worked pretty well. Users can have their dataframes, and we have pure numpy algorithms.
Recently we have started to use pandas inside a few functions or classes that are less tightly integrated into the overall setup. We also use pandas for some things that are not convenient or not available in numpy. Our internal use of pandas groupby and similar will most likely increase over time. (One of the main issues we had was date and time index because that was a moving target in both numpy and pandas.)
One issue for computational efficiency that we do not control is whether `asarray` creates a view or needs to make a copy because that depends on whether the dtype and memory layout that the user has in the data frame corresponds to what we need in the algorithms. If it matches, then no copies should be made except where explicitly needed.
The intention is to extend this over time to other array structures like xarray and likely dask arrays.
Josef
Well, that's it. Sorry for large letter.
Alex.
22 февр. 2017 г., в 18:38, Matthew Harrigan <harrigan.matthew@gmail.com mailto:harrigan.matthew@gmail.com> написал(а):
Alex,
Can you please post some code showing exactly what you are trying to do and any issues you are having, particularly the "irritating problems with its row indexing and some other problems" you quote above?
On Wed, Feb 22, 2017 at 10:34 AM, Robert McLeod <robbmcleod@gmail.com mailto:robbmcleod@gmail.com> wrote: Just as a note, Appveyor supports uploading modules to "public websites":
https://packaging.python.org/appveyor/ https://packaging.python.org/appveyor/
The main issue I would see from this, is the PyPi has my password stored on my machine in a plain text file. I'm not sure whether there's a way to provide Appveyor with a SSH key instead.
On Wed, Feb 22, 2017 at 4:23 PM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Hi Francesc, thanks a lot for you reply and for your impressive job on bcolz!
Bcolz seems to make stress on compression, which is not of much interest for me, but the ctable, and chunked operations look very appropriate to me now. (Of course, I'll need to test it much before I can say this for sure, that's current impression).
The strongest concern with bcolz so far is that it seems to be completely nontrivial to install on windows systems, while pip provides binaries for most (or all?) OS for numpy. I didn't build pip binary wheels myself, but is it hard / impossible to cook pipinstallabel binaries?
You can change shapes of numpy arrays, but that usually involves copies of the whole container.
sure, but this is ok for me, as I plan to organize column editing in 'batches', so this should require seldom copying. It would be nice to see an example to understand how deep I need to go inside numpy.
Cheers, Alex.
22 февр. 2017 г., в 17:03, Francesc Alted <faltet@gmail.com mailto:faltet@gmail.com> написал(а):
Hi Alex,
20170222 12:45 GMT+01:00 Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru>: Hi Nathaniel,
pandas
yup, the idea was to have minimal pandas.DataFramelike storage (which I was using for a long time), but without irritating problems with its row indexing and some other problems like interaction with matplotlib.
A dict of arrays?
that's what I've started from and implemented, but at some point I decided that I'm reinventing the wheel and numpy has something already. In principle, I can ignore this 'columnoriented' storage requirement, but potentially it may turn out to be quite slowish if dtype's size is large.
Suggestions are welcome.
You may want to try bcolz:
https://github.com/Blosc/bcolz https://github.com/Blosc/bcolz
bcolz is a columnar storage, basically as you require, but data is compressed by default even when stored inmemory (although you can disable compression if you want to).
Another strange question: in general, it is considered that once numpy.array is created, it's shape not changed. But if i want to keep the same recarray and change it's dtype and/or shape, is there a way to do this?
You can change shapes of numpy arrays, but that usually involves copies of the whole container. With bcolz you can change length and add/del columns without copies. If your containers are large, it is better to inform bcolz on its final estimated size. See:
http://bcolz.blosc.org/en/latest/opttips.html http://bcolz.blosc.org/en/latest/opttips.html
Francesc
Thanks, Alex.
22 февр. 2017 г., в 3:53, Nathaniel Smith <njs@pobox.com mailto:njs@pobox.com> написал(а):
On Feb 21, 2017 3:24 PM, "Alex Rogozhnikov" <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Ah, got it. Thanks, Chris! I thought recarray can be only onedimensional (like tables with named columns).
Maybe it's better to ask directly what I was looking for: something that works like a table with named columns (but no labelling for rows), and keeps data (of different dtypes) in a columnbycolumn way (and this is numpy, not pandas).
Is there such a magic thing?
Well, that's what pandas is for...
A dict of arrays?
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Francesc Alted _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
 Robert McLeod, Ph.D. Center for Cellular Imaging and Nano Analytics (CCINA) Biozentrum der Universität Basel Mattenstrasse 26, 4058 Basel Work: +41.061.387.3225 tel:+41%2061%20387%2032%2025 robert.mcleod@unibas.ch mailto:robert.mcleod@unibas.ch robert.mcleod@bsse.ethz.ch mailto:robert.mcleod@ethz.ch robbmcleod@gmail.com mailto:robbmcleod@gmail.com
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org mailto:NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion https://mail.scipy.org/mailman/listinfo/numpydiscussion
On Wed, Feb 22, 2017 at 8:57 AM, Alex Rogozhnikov < alex.rogozhnikov@yandex.ru> wrote:
Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be *minimal*.
The pandas development team (myself included) is well aware of these issues. There are long term plans/hopes to fix this, but there's a lot of work to be done and some hard choices to make: https://github.com/pandasdev/pandas/issues/10000 https://github.com/pandasdev/pandas/issues/13862
That's why I am looking for a reliable pandas substitute, which should be:
 completely consistent with numpy and should fail when this wasn't
implemented / impossible
 fewer new abstractions, nobody wants to learn onemorewaytomanipulatethedata,
specifically other researchers
 it may be less convenient for interactive data mungling
 in particular, less methods is ok
 written code should be interpretable, and hardly can be misinterpreted.
 not super slow, 110 gigabytes datasets are a normal situation
This has some overlap with our motivations for writing Xarray ( http://xarray.pydata.org), so I encourage you to take a look. It still might be more complex than you're looking for, but we did try to clean up the really ambiguous APIs from pandas like indexing.
Hi Stephan, thanks for the note. The progress over last two years wasn't impressive IMO, but I hope you'll manage.
As you suggest, I'll have a look at xarray too, as I see xarray.Dataset. I was sure that it doesn't work with nonhomogeneous data at all, clearly I need to refresh my opinion.
22 февр. 2017 г., в 20:55, Stephan Hoyer shoyer@gmail.com написал(а):
On Wed, Feb 22, 2017 at 8:57 AM, Alex Rogozhnikov <alex.rogozhnikov@yandex.ru mailto:alex.rogozhnikov@yandex.ru> wrote: Pandas may be nice, if you need a report, and you need get it done tomorrow. Then you'll throw away the code. When we initially used pandas as main data storage in yandex/rep, it looked like an good idea, but a year later it was obvious this was a wrong decision. In case when you build data pipeline / research that should be working several years later (using some other installation by someone else), usage of pandas shall be minimal.
The pandas development team (myself included) is well aware of these issues. There are long term plans/hopes to fix this, but there's a lot of work to be done and some hard choices to make: https://github.com/pandasdev/pandas/issues/10000 https://github.com/pandasdev/pandas/issues/10000 https://github.com/pandasdev/pandas/issues/13862 https://github.com/pandasdev/pandas/issues/13862
That's why I am looking for a reliable pandas substitute, which should be:
 completely consistent with numpy and should fail when this wasn't implemented / impossible
 fewer new abstractions, nobody wants to learn onemorewaytomanipulatethedata, specifically other researchers
 it may be less convenient for interactive data mungling
 in particular, less methods is ok
 written code should be interpretable, and hardly can be misinterpreted.
 not super slow, 110 gigabytes datasets are a normal situation
This has some overlap with our motivations for writing Xarray (http://xarray.pydata.org http://xarray.pydata.org/), so I encourage you to take a look. It still might be more complex than you're looking for, but we did try to clean up the really ambiguous APIs from pandas like indexing. _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpydiscussion
participants (9)

Alex Rogozhnikov

Chris Barker

Francesc Alted

josef.pktd＠gmail.com

Kiko

Matthew Harrigan

Nathaniel Smith

Robert McLeod

Stephan Hoyer