[Numpy-discussion] Matrix Class

Sat Feb 14 21:21:18 EST 2015

On Sat, Feb 14, 2015 at 5:21 PM, <josef.pktd at gmail.com> wrote:

> On Sat, Feb 14, 2015 at 4:27 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Sat, Feb 14, 2015 at 12:36 PM, <josef.pktd at gmail.com> wrote:
> >>
> >> On Sat, Feb 14, 2015 at 12:05 PM, cjw <cjw at ncf.ca> wrote:
> >> >
> >> > On 14-Feb-15 11:35 AM, josef.pktd at gmail.com wrote:
> >> >>
> >> >> On Wed, Feb 11, 2015 at 4:18 PM, Ryan Nelson <rnelsonchem at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Colin,
> >> >>>
> >> >>> I currently use Py3.4 and Numpy 1.9.1. However, I built a quick test
> >> >>> conda
> >> >>> environment with Python2.7 and Numpy 1.7.0, and I get the same:
> >> >>>
> >> >>> ############
> >> >>> Python 2.7.9 |Continuum Analytics, Inc.| (default, Dec 18 2014,
> >> >>> 16:57:52)
> >> >>> [MSC v
> >> >>> .1500 64 bit (AMD64)]
> >> >>> Type "copyright", "credits" or "license" for more information.
> >> >>>
> >> >>> IPython 2.3.1 -- An enhanced Interactive Python.
> >> >>> Anaconda is brought to you by Continuum Analytics.
> >> >>> Please check out: http://continuum.io/thanks and
> https://binstar.org
> >> >>> ?         -> Introduction and overview of IPython's features.
> >> >>> %quickref -> Quick reference.
> >> >>> help      -> Python's own help system.
> >> >>> object?   -> Details about 'object', use 'object??' for extra
> details.
> >> >>>
> >> >>> In [1]: import numpy as np
> >> >>>
> >> >>> In [2]: np.__version__
> >> >>> Out[2]: '1.7.0'
> >> >>>
> >> >>> In [3]: np.mat([4,'5',6])
> >> >>> Out[3]:
> >> >>> matrix([['4', '5', '6']],
> >> >>>         dtype='|S1')
> >> >>>
> >> >>> In [4]: np.mat([4,'5',6], dtype=int)
> >> >>> Out[4]: matrix([[4, 5, 6]])
> >> >>> ###############
> >> >>>
> >> >>> As to your comment about coordinating with Statsmodels, you should
> see
> >> >>> the
> >> >>> links in the thread that Alan posted:
> >> >>> http://permalink.gmane.org/gmane.comp.python.numeric.general/56516
> >> >>> http://permalink.gmane.org/gmane.comp.python.numeric.general/56517
> >> >>> Josef's comments at the time seem to echo the issues the devs (and
> >> >>> others)
> >> >>> have with the matrix class. Maybe things have changed with
> >> >>> Statsmodels.
> >> >>
> >> >> Not changed, we have a strict policy against using np.matrix.
> >> >>
> >> >> generic efficient versions for linear operators, kronecker or sparse
> >> >> block matrix styly operations would be useful, but I would use array
> >> >> semantics, similar to using dot or linalg functions on ndarrays.
> >> >>
> >> >> Josef
> >> >> (long reply canceled because I'm writing too much that might only be
> >> >> of tangential interest or has been in some of the matrix discussion
> >> >> before.)
> >> >
> >> > Josef,
> >> >
> >> > Many thanks.  I have gained the impression that there is some
> antipathy
> >> > to
> >> > np.matrix, perhaps this is because, as others have suggested, the
> array
> >> > doesn't provide an appropriate framework.
> >>
> >> It's not directly antipathy, it's cost-benefit analysis.
> >>
> >> np.matrix has few advantages, but makes reading and maintaining code
> >> much more difficult.
> >> Having to watch out for multiplication `*` is a lot of extra work.
> >>
> >> Checking shapes and fixing bugs with unexpected dtypes is also a lot
> >> of work, but we have large benefits.
> >> For a long time the policy in statsmodels was to keep pandas out of
> >> the core of functions (i.e. out of the actual calculations) and
> >> restrict it to inputs and returns. However, pandas is becoming more
> >> popular and can do some things much better than plain numpy, so it is
> >> slowly moving inside some of our core calculations.
> >> It's still an easy source of bugs, but we do gain something.
> >
> >
> > Any bits of Pandas that might be good for numpy/scipy to steal?
>
> I'm not a Pandas expert.
> Some of it comes into statsmodels because we need the data handling
> also inside a function, e.g. keeping track of labels, indices, and so
> on. Another reason is that contributors are more familiar with
> pandas's way of solving a problems, even if I suspect numpy would be
> more efficient.
>
> However, a recent change, replaces where I would have used np.unique
> with pandas.factorize which is supposed to be faster.
> https://github.com/statsmodels/statsmodels/pull/2213

Numpy could use some form of hash table for its arraysetops, which is where
pandas is getting its advantage from. It is a tricky thing though, see e.g.
these timings:

a = np.ranomdom.randint(10, size=1000)
srs = pd.Series(a)

%timeit np.unique(a)

100000 loops, best of 3: 13.2 µs per loop

%timeit srs.unique()

100000 loops, best of 3: 15.6 µs per loop

%timeit pd.factorize(a)

10000 loops, best of 3: 25.6 µs per loop

%timeit np.unique(a, return_inverse=True)

10000 loops, best of 3: 82.5 µs per loop

This last timings are with 1.9.0 an 0.14.0, so numpy doesn't have
https://github.com/numpy/numpy/pull/5012 yet, which makes the operation in
which numpy is slower about 2x faster. And if you need your unique values
sorted, then things are more even, especially if numpy runs 2x faster:

%timeit pd.factorize(a, sort=True)

10000 loops, best of 3: 36.4 µs per loop

The algorithms scale differently though, so for sufficiently large data
Pandas is going to win almost certainly. Not sure if they support all
dtypes, nor how efficient their use of memory is.

I did a toy implementation of a hash table, mimicking Python's dictionary,
for numpy some time ago, see here:

https://github.com/jaimefrio/numpy/commit/50b951289dfe9e2c3ef8950184090742ff2ac896

and if I remember correctly for the basic unique operations it was
generally  faster, both than numpy and pandas, but only by a factor of
about 2x, which didn't seem to justify the effort. More complicated
operations can probably benefit more, as the pd.factorize example shows.

It still seems like an awful lot of work for an operation that isn't
obviously needed. If Numpy attempted to have some form of groupby
functionality it could make more sense. As is, not really sure.

Jaime

>
> Two or three years ago my numpy way of group handling (using
> np.unique, bincount and similar) was still faster than the pandas
> `apply` version, I'm not sure that's still true.
>
>
> And to emphasize: all our heavy stuff especially the big models still
> only have numpy and scipy inside (with the exception of one model
> waiting in a PR).
>
> Josef
>
>
> >
> > <snip>
> >
> > Chuck
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20150214/623a53fc/attachment.html>