Easier way to do this?
Thomas Jollans
tjol at tjol.eu
Wed Oct 4 18:00:32 EDT 2017
On 04/10/17 22:47, Fabien wrote:
> On 10/04/2017 10:11 PM, Thomas Jollans wrote:
>> Be warned, pandas is part of the scientific python stack, which is
>> immensely powerful and popular, but it does have a distinctive style
>> that may appear cryptic if you're used to the way the rest of the world
>> writes Python.
>
> Can you elaborate on this one? As a scientist, I am curious ;-)
Sure.
Python is GREAT at iterating. Generators are everywhere. Everyone loves
for loops. List comprehensions and generator expressions are star
features. filter and map are builtins. reduce used be a builtin, even
though almost nobody really understood what it did.
In [1]: import numpy as np
In the world of numpy (and the greater scientific stack), you don't
iterate. You don't write for loops. You have a million floats in memory
that you want to do math on - you don't want to wait for ten million
calls to __class__.__dict__['__getattr__']('__add__').__call__() or
whatever to run. In numpy land, numpy writes your loops for you. In
FORTRAN. (well ... probably C)
As I see it the main cultural difference between "traditional" Python
and numpy-Python is that numpy implicitly iterates over arrays all the
time. Python never implicitly iterates. Python is not MATLAB.
In [2]: np.array([1, 2, 3]) + np.array([-3, -2, -1])
Out[2]: array([-2, 0, 2])
In [3]: [1, 2, 3] + [-3, -2, -1]
Out[3]: [1, 2, 3, -3, -2, -1]
In numpy, operators don't mean what you think they mean.
In [4]: a = (np.random.rand(30) * 10).astype(np.int64)
In [5]: a
Out[5]:
array([6, 1, 6, 9, 1, 0, 3, 5, 8, 5, 2, 6, 1, 1, 2, 2, 4, 2, 4, 2, 5, 3, 7,
8, 2, 5, 8, 1, 0, 8])
In [6]: a > 5
Out[6]:
array([ True, False, True, True, False, False, False, False, True,
False, False, True, False, False, False, False, False, False,
False, False, False, False, True, True, False, False, True,
False, False, True], dtype=bool)
In [7]: list(a) > 5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-0c10c9961870> in <module>()
----> 1 list(a) > 5
TypeError: unorderable types: list() > int()
Suddenly, you can even compare sequences and scalars! And > no longer
gives you a bool! Madness!
Now, none of this, so far, has been ALL THAT cryptic as far as I can
tell. It's when you do more complicated things, and start combining
different parts of the numpy toolbox, that it becomes clear that
numpy-Python is kind of a different language.
In [8]: a[(np.sqrt(a).astype(int)**2 == a) & (a < 5)]
Out[8]: array([1, 1, 0, 1, 1, 4, 4, 1, 0])
In [9]: import math
In [10]: [i for i in a if int(math.sqrt(i))**2 == i and i < 5]
Out[10]: [1, 1, 0, 1, 1, 4, 4, 1, 0]
Look at my pandas example from my previous post. If you're a
Python-using scientist, even if you're not very familiar with pandas,
you'll probably be able to see more or less how it works. I imagine that
there are plenty of experienced Pythonistas on this list who never need
to deal with large amounts of numeric data that are completely
nonplussed by it, and I wouldn't blame them. The style and the
idiosyncrasies of array-heavy scientific Python and stream or
iterator-heavy scripting and networking Python are just sometimes rather
different.
Cheers
Thomas
More information about the Python-list
mailing list