<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Sat, Feb 14, 2015 at 5:21 PM, <span dir="ltr"><<a href="mailto:josef.pktd@gmail.com" target="_blank">josef.pktd@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class=""><div class="h5">On Sat, Feb 14, 2015 at 4:27 PM, Charles R Harris<br>
<<a href="mailto:charlesr.harris@gmail.com">charlesr.harris@gmail.com</a>> wrote:<br>
><br>
><br>
> On Sat, Feb 14, 2015 at 12:36 PM, <<a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a>> wrote:<br>
>><br>
>> On Sat, Feb 14, 2015 at 12:05 PM, cjw <<a href="mailto:cjw@ncf.ca">cjw@ncf.ca</a>> wrote:<br>
>> ><br>
>> > On 14-Feb-15 11:35 AM, <a href="mailto:josef.pktd@gmail.com">josef.pktd@gmail.com</a> wrote:<br>
>> >><br>
>> >> On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson <<a href="mailto:rnelsonchem@gmail.com">rnelsonchem@gmail.com</a>><br>
>> >> wrote:<br>
>> >>><br>
>> >>> Colin,<br>
>> >>><br>
>> >>> I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test<br>
>> >>> conda<br>
>> >>> environment with Python2.7 and Numpy 1.7.0, and I get the same:<br>
>> >>><br>
>> >>> ############<br>
>> >>> Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014,<br>
>> >>> 16:57:52)<br>
>> >>> [MSC v<br>
>> >>> .1500 64 bit (AMD64)]<br>
>> >>> Type "copyright", "credits" or "license" for more information.<br>
>> >>><br>
>> >>> IPython 2.3.1 -- An enhanced Interactive Python.<br>
>> >>> Anaconda is brought to you by Continuum Analytics.<br>
>> >>> Please check out: <a href="http://continuum.io/thanks" target="_blank">http://continuum.io/thanks</a> and <a href="https://binstar.org" target="_blank">https://binstar.org</a><br>
>> >>> ? -> Introduction and overview of IPython's features.<br>
>> >>> %quickref -> Quick reference.<br>
>> >>> help -> Python's own help system.<br>
>> >>> object? -> Details about 'object', use 'object??' for extra details.<br>
>> >>><br>
>> >>> In [1]: import numpy as np<br>
>> >>><br>
>> >>> In [2]: np.__version__<br>
>> >>> Out[2]: '1.7.0'<br>
>> >>><br>
>> >>> In [3]: np.mat([4,'5',6])<br>
>> >>> Out[3]:<br>
>> >>> matrix([['4', '5', '6']],<br>
>> >>> dtype='|S1')<br>
>> >>><br>
>> >>> In [4]: np.mat([4,'5',6], dtype=int)<br>
>> >>> Out[4]: matrix([[4, 5, 6]])<br>
>> >>> ###############<br>
>> >>><br>
>> >>> As to your comment about coordinating with Statsmodels, you should see<br>
>> >>> the<br>
>> >>> links in the thread that Alan posted:<br>
>> >>> <a href="http://permalink.gmane.org/gmane.comp.python.numeric.general/56516" target="_blank">http://permalink.gmane.org/gmane.comp.python.numeric.general/56516</a><br>
>> >>> <a href="http://permalink.gmane.org/gmane.comp.python.numeric.general/56517" target="_blank">http://permalink.gmane.org/gmane.comp.python.numeric.general/56517</a><br>
>> >>> Josef's comments at the time seem to echo the issues the devs (and<br>
>> >>> others)<br>
>> >>> have with the matrix class. Maybe things have changed with<br>
>> >>> Statsmodels.<br>
>> >><br>
>> >> Not changed, we have a strict policy against using np.matrix.<br>
>> >><br>
>> >> generic efficient versions for linear operators, kronecker or sparse<br>
>> >> block matrix styly operations would be useful, but I would use array<br>
>> >> semantics, similar to using dot or linalg functions on ndarrays.<br>
>> >><br>
>> >> Josef<br>
>> >> (long reply canceled because I'm writing too much that might only be<br>
>> >> of tangential interest or has been in some of the matrix discussion<br>
>> >> before.)<br>
>> ><br>
>> > Josef,<br>
>> ><br>
>> > Many thanks. I have gained the impression that there is some antipathy<br>
>> > to<br>
>> > np.matrix, perhaps this is because, as others have suggested, the array<br>
>> > doesn't provide an appropriate framework.<br>
>><br>
>> It's not directly antipathy, it's cost-benefit analysis.<br>
>><br>
>> np.matrix has few advantages, but makes reading and maintaining code<br>
>> much more difficult.<br>
>> Having to watch out for multiplication `*` is a lot of extra work.<br>
>><br>
>> Checking shapes and fixing bugs with unexpected dtypes is also a lot<br>
>> of work, but we have large benefits.<br>
>> For a long time the policy in statsmodels was to keep pandas out of<br>
>> the core of functions (i.e. out of the actual calculations) and<br>
>> restrict it to inputs and returns. However, pandas is becoming more<br>
>> popular and can do some things much better than plain numpy, so it is<br>
>> slowly moving inside some of our core calculations.<br>
>> It's still an easy source of bugs, but we do gain something.<br>
><br>
><br>
> Any bits of Pandas that might be good for numpy/scipy to steal?<br>
<br>
</div></div>I'm not a Pandas expert.<br>
Some of it comes into statsmodels because we need the data handling<br>
also inside a function, e.g. keeping track of labels, indices, and so<br>
on. Another reason is that contributors are more familiar with<br>
pandas's way of solving a problems, even if I suspect numpy would be<br>
more efficient.<br>
<br>
However, a recent change, replaces where I would have used np.unique<br>
with pandas.factorize which is supposed to be faster.<br>
<a href="https://github.com/statsmodels/statsmodels/pull/2213" target="_blank">https://github.com/statsmodels/statsmodels/pull/2213</a></blockquote><div><br></div><div>Numpy could use some form of hash table for its arraysetops, which is where pandas is getting its advantage from. It is a tricky thing though, see e.g. these timings:</div><div><br></div><div><font face="monospace, monospace">a = np.ranomdom.randint(10, size=1000)</font></div><div><font face="monospace, monospace">srs = pd.Series(a)</font></div><div><font face="monospace, monospace"><br></font></div><div>
<p class=""><span class=""><font face="monospace, monospace">%timeit np.unique(a)</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">100000 loops, best of 3: 13.2 µs per loop</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">%timeit srs.unique()</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">100000 loops, best of 3: 15.6 µs per loop</font></span></p>
<p class=""><font face="monospace, monospace"><span class=""></span><br></font></p>
<p class=""><span class=""><font face="monospace, monospace">%timeit pd.factorize(a)</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">10000 loops, best of 3: 25.6 µs per loop</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">%timeit np.unique(a, return_inverse=True)</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">10000 loops, best of 3: 82.5 µs per loop</font></span></p></div><div><br></div><div>This last timings are with 1.9.0 an 0.14.0, so numpy doesn't have <a href="https://github.com/numpy/numpy/pull/5012">https://github.com/numpy/numpy/pull/5012</a> yet, which makes the operation in which numpy is slower about 2x faster. And if you need your unique values sorted, then things are more even, especially if numpy runs 2x faster:</div><div><br></div><div>
<p class=""><span class=""><font face="monospace, monospace">%timeit pd.factorize(a, sort=True)</font></span></p>
<p class=""><span class=""><font face="monospace, monospace">10000 loops, best of 3: 36.4 µs per loop</font></span></p></div><div><br></div><div>The algorithms scale differently though, so for sufficiently large data Pandas is going to win almost certainly. Not sure if they support all dtypes, nor how efficient their use of memory is.</div><div><br></div><div>I did a toy implementation of a hash table, mimicking Python's dictionary, for numpy some time ago, see here:</div><div><br></div><div><a href="https://github.com/jaimefrio/numpy/commit/50b951289dfe9e2c3ef8950184090742ff2ac896">https://github.com/jaimefrio/numpy/commit/50b951289dfe9e2c3ef8950184090742ff2ac896</a><br></div><div><br></div><div>and if I remember correctly for the basic unique operations it was generally faster, both than numpy and pandas, but only by a factor of about 2x, which didn't seem to justify the effort. More complicated operations can probably benefit more, as the pd.factorize example shows.</div><div><br></div><div>It still seems like an awful lot of work for an operation that isn't obviously needed. If Numpy attempted to have some form of groupby functionality it could make more sense. As is, not really sure.</div><div><br></div><div>Jaime</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>
<br>
Two or three years ago my numpy way of group handling (using<br>
np.unique, bincount and similar) was still faster than the pandas<br>
`apply` version, I'm not sure that's still true.<br>
<br>
<br>
And to emphasize: all our heavy stuff especially the big models still<br>
only have numpy and scipy inside (with the exception of one model<br>
waiting in a PR).<br>
<br>
Josef<br>
<br>
<br>
><br>
> <snip><br>
><br>
> Chuck<br>
><br>
><br>
> _______________________________________________<br>
> NumPy-Discussion mailing list<br>
> <a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>
> <a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>
><br>
_______________________________________________<br>
NumPy-Discussion mailing list<br>
<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>
<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature">(\__/)<br>( O.o)<br>( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.</div>
</div></div>