Hey Guys Few days with folks at my first pycon has made me wonder how much of cool things I was missing .. I am looking to do some quick catch up on numpy and wondering if there are any set of videos that I can refer to. I learn quicker seeing videos and would appreciate if you guys can point me to anything available it will be of great help. Thanks! -Abhi
On Mon, Mar 12, 2012 at 6:04 PM, Abhishek Pratap <apratap@lbl.gov> wrote:
Hey Guys
Few days with folks at my first pycon has made me wonder how much of cool things I was missing ..
I am looking to do some quick catch up on numpy and wondering if there are any set of videos that I can refer to. I learn quicker seeing videos and would appreciate if you guys can point me to anything available it will be of great help.
You'll find a lot of videos here. The tutorials in particular may interest you from past conferences. http://conference.scipy.org/index.html Oddly though it doesn't look like there's a straight link to the 2011 conference there. http://conference.scipy.org/scipy2011/ Skipper
Abhi, One thing I would suggest is to tackle numpy with a particular focus. Once you've gotten the basics down through tutorials and videos, do you have a research project in mind to use with numpy? On Mon, Mar 12, 2012 at 6:08 PM, Skipper Seabold <jsseabold@gmail.com>wrote:
On Mon, Mar 12, 2012 at 6:04 PM, Abhishek Pratap <apratap@lbl.gov> wrote:
Hey Guys
Few days with folks at my first pycon has made me wonder how much of cool things I was missing ..
I am looking to do some quick catch up on numpy and wondering if there are any set of videos that I can refer to. I learn quicker seeing videos and would appreciate if you guys can point me to anything available it will be of great help.
You'll find a lot of videos here. The tutorials in particular may interest you from past conferences.
http://conference.scipy.org/index.html
Oddly though it doesn't look like there's a straight link to the 2011 conference there.
http://conference.scipy.org/scipy2011/
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Super awesome. I love how the python community in general keeps the recordings available for free. @Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ?? -Abhi On Mon, Mar 12, 2012 at 3:18 PM, Adam Hughes <hugadams@gwmail.gwu.edu> wrote:
Abhi,
One thing I would suggest is to tackle numpy with a particular focus. Once you've gotten the basics down through tutorials and videos, do you have a research project in mind to use with numpy?
On Mon, Mar 12, 2012 at 6:08 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Mon, Mar 12, 2012 at 6:04 PM, Abhishek Pratap <apratap@lbl.gov> wrote:
Hey Guys
Few days with folks at my first pycon has made me wonder how much of cool things I was missing ..
I am looking to do some quick catch up on numpy and wondering if there are any set of videos that I can refer to. I learn quicker seeing videos and would appreciate if you guys can point me to anything available it will be of great help.
You'll find a lot of videos here. The tutorials in particular may interest you from past conferences.
http://conference.scipy.org/index.html
Oddly though it doesn't look like there's a straight link to the 2011 conference there.
http://conference.scipy.org/scipy2011/
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
This is a probably an area that is quite common, so I'd be interested to hear some other chime in. I refer to the lookup and storage in numpy data. Your implementation will of course be unique, but there are several avenues that you can consider. Here is how I handle a similar problem. Imagine I have data, probably similar to yours, where there is qualitative data (maybe biological or experimental parameters and other things), as well as numerical data. I would define a dictionary object that stores both of these to a unique key. In my work, I use the original file that all the information was taken from as my key. So for example: dict{ key: (file_info), (data_array, dtype='float')} The value of the item in the dictionary is split so that the information and the actually data arrays are kept separate. Notice my use of dtype...it is also possible to build your own numpy data type that gives you a bit more flexibility for storing your data. This is very useful if your data is not all that standardized, or if you want to quickly look up data by reference. For example, if you have a column in your file called "counts" and you want later to access this, having a custom datatype will let you do this with ease. Anyway, you can read into that later. This storage type is also highly useful if you need to make new data structures later. For example, if you want to plot all of your data in a multiplot, you can design a method to take this object and return the formatted multi-array data, as well as any axis arrays that can be extracted from this data. Generally, if you can this object built, than any other representation of the data that you need can be taken from this. This approach is useful to me, but may not be ideal if your dataset is so large that you cannot afford to have several data structures that are holding it simultanesouly in your code. On Mon, Mar 12, 2012 at 6:23 PM, Abhishek Pratap <apratap@lbl.gov> wrote:
Super awesome. I love how the python community in general keeps the recordings available for free.
@Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ??
-Abhi
On Mon, Mar 12, 2012 at 3:18 PM, Adam Hughes <hugadams@gwmail.gwu.edu> wrote:
Abhi,
One thing I would suggest is to tackle numpy with a particular focus. Once you've gotten the basics down through tutorials and videos, do you have a research project in mind to use with numpy?
On Mon, Mar 12, 2012 at 6:08 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Mon, Mar 12, 2012 at 6:04 PM, Abhishek Pratap <apratap@lbl.gov>
wrote:
Hey Guys
Few days with folks at my first pycon has made me wonder how much of cool things I was missing ..
I am looking to do some quick catch up on numpy and wondering if there are any set of videos that I can refer to. I learn quicker seeing videos and would appreciate if you guys can point me to anything available it will be of great help.
You'll find a lot of videos here. The tutorials in particular may interest you from past conferences.
http://conference.scipy.org/index.html
Oddly though it doesn't look like there's a straight link to the 2011 conference there.
http://conference.scipy.org/scipy2011/
Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mar 12, 2012, at 5:23 PM, Abhishek Pratap wrote:
Super awesome. I love how the python community in general keeps the recordings available for free.
@Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ??
PyTables does precisely that. Allows to do out-of-core operations with large arrays, store tables with an unlimited number of rows on-disk and, by using its integrated indexing engine (OPSI), you can perform quick lookups based on strings (or whatever other type). Look into these examples: http://www.pytables.org/moin/HowToUse#Selectingvalues HTH, -- Francesc Alted
On 12.03.2012 23:23, Abhishek Pratap wrote:
Super awesome. I love how the python community in general keeps the recordings available for free.
@Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ??
Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style. As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter). With a 64 bit system it is also possible to memory map a temporary file, and tell the OS to keep as much of it in memory if possible. That way we can "fake" more RAM than we actually have. (The Linux equivalent of the code in bigmem.c would be to mmap from tmpfs.) A usecase for bigmem.c is e.g. if you need to use 10 tables that each are 1-2 GB in size, but only have 4 GB of RAM on the desktop computer. Sturla
On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:
On 12.03.2012 23:23, Abhishek Pratap wrote:
Super awesome. I love how the python community in general keeps the recordings available for free.
@Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ??
Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style.
As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter).
Yes, however IMO PyTables does adapt better to the OP lookup user case. For example, let's suppose a very simple key-value problem, where we need to locate a certain value by using a key. Using h5py I get: In [1]: import numpy as np In [2]: N = 100*1000 In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4") In [4]: import h5py In [5]: f = h5py.File('h5py.h5', 'w') In [6]: d = f.create_dataset('sa', data=sa) In [7]: time [val for val in d if val[0] == 'key500'] CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s Wall time: 29.25 s Out[7]: [('key500', 500)] Another option is to use fancy selection: In [8]: time d[d['f0']=='key500'] CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 s Out[8]: array([('key500', 500)], dtype=[('f0', 'S8'), ('f1', '<i4')]) Hmm, time resolution is too poor here. Let's use the %timeit magic: In [9]: timeit d[d['f0']=='key500'] 100 loops, best of 3: 9.3 ms per loop which is much better. But, in this case you need to load the column d['f0'] completely in-memory, and this is *not* what you want when you have large tables that does not fit in-memory. Using PyTables: In [10]: import tables In [11]: ft = tables.openFile('pytables.h5', 'w') In [12]: dt = ft.createTable(ft.root, 'sa', sa) In [13]: time [val[:] for val in dt if val[0] == 'key500'] CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s Wall time: 0.04 s Out[13]: [('key500', 500)] That's almost a 100x of speed-up compared with h5py. But, in addition, PyTables has specific machinery to optimize these queries by using the numexpr behind the scenes: In [14]: time [val[:] for val in dt.where("f0=='key500'")] CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.00 s Out[14]: [('key500', 500)] Again, time resolution is too poor here. Let's use timeit magic: In [15]: timeit [val[:] for val in dt.where("f0=='key500'")] 100 loops, best of 3: 2.36 ms per loop This is an additional 10x speed-up. In fact, this is almost as fast as performing the query using NumPy directly: In [16]: timeit sa[sa['f0']=='key500'] 100 loops, best of 3: 2.14 ms per loop with the difference that PyTables uses an out-of-core paradigm (i.e. it does not need to load the datasets completely in-memory). And finally, PyTables does support true indexing capabilities, so that you do not have to read the complete dataset for getting results: In [17]: dt.cols.f0.createIndex() Out[17]: 100000 In [18]: timeit [val[:] for val in dt.where("f0=='key500'")] 1000 loops, best of 3: 213 us per loop which accounts for another additional 10x speedup. Of course, this speed up can be *much* more larger for bigger datasets, and specially for those that does not fit in-memory. See: http://pytables.github.com/usersguide/optimization.html#accelerating-your-se... for more detailed rational and benchmarks in big datasets. -- Francesc Alted
Thanks guys..very handy examples by Francesc. I need to bookmark them until I reach this point. best, -Abhi On Tue, Mar 13, 2012 at 9:24 AM, Francesc Alted <francesc@continuum.io> wrote:
On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:
On 12.03.2012 23:23, Abhishek Pratap wrote:
Super awesome. I love how the python community in general keeps the recordings available for free.
@Adam : I do have some problems that I can hit numpy with, mainly bigData based. So in summary I have millions/billions of rows of biological data on which I want to run some computation but at the same time have a capability to do quick lookup. I am not sure if numpy will be applicable for quick lookups by a string based key right ??
Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style.
As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter).
Yes, however IMO PyTables does adapt better to the OP lookup user case. For example, let's suppose a very simple key-value problem, where we need to locate a certain value by using a key. Using h5py I get:
In [1]: import numpy as np
In [2]: N = 100*1000
In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4")
In [4]: import h5py
In [5]: f = h5py.File('h5py.h5', 'w')
In [6]: d = f.create_dataset('sa', data=sa)
In [7]: time [val for val in d if val[0] == 'key500'] CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s Wall time: 29.25 s Out[7]: [('key500', 500)]
Another option is to use fancy selection:
In [8]: time d[d['f0']=='key500'] CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 s Out[8]: array([('key500', 500)], dtype=[('f0', 'S8'), ('f1', '<i4')])
Hmm, time resolution is too poor here. Let's use the %timeit magic:
In [9]: timeit d[d['f0']=='key500'] 100 loops, best of 3: 9.3 ms per loop
which is much better. But, in this case you need to load the column d['f0'] completely in-memory, and this is *not* what you want when you have large tables that does not fit in-memory.
Using PyTables:
In [10]: import tables
In [11]: ft = tables.openFile('pytables.h5', 'w')
In [12]: dt = ft.createTable(ft.root, 'sa', sa)
In [13]: time [val[:] for val in dt if val[0] == 'key500'] CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s Wall time: 0.04 s Out[13]: [('key500', 500)]
That's almost a 100x of speed-up compared with h5py. But, in addition, PyTables has specific machinery to optimize these queries by using the numexpr behind the scenes:
In [14]: time [val[:] for val in dt.where("f0=='key500'")] CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.00 s Out[14]: [('key500', 500)]
Again, time resolution is too poor here. Let's use timeit magic:
In [15]: timeit [val[:] for val in dt.where("f0=='key500'")] 100 loops, best of 3: 2.36 ms per loop
This is an additional 10x speed-up. In fact, this is almost as fast as performing the query using NumPy directly:
In [16]: timeit sa[sa['f0']=='key500'] 100 loops, best of 3: 2.14 ms per loop
with the difference that PyTables uses an out-of-core paradigm (i.e. it does not need to load the datasets completely in-memory). And finally, PyTables does support true indexing capabilities, so that you do not have to read the complete dataset for getting results:
In [17]: dt.cols.f0.createIndex() Out[17]: 100000
In [18]: timeit [val[:] for val in dt.where("f0=='key500'")] 1000 loops, best of 3: 213 us per loop
which accounts for another additional 10x speedup. Of course, this speed up can be *much* more larger for bigger datasets, and specially for those that does not fit in-memory. See:
http://pytables.github.com/usersguide/optimization.html#accelerating-your-se...
for more detailed rational and benchmarks in big datasets.
-- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Abhishek Pratap -
Adam Hughes -
Francesc Alted -
Skipper Seabold -
Sturla Molden