svd error checking vs. speed
Hello list, Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion. The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input. """ [...] the following numpy code hangs until I `kill -9` it. ``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
"""
Alex
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more). - Sebastian
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1 statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples. (*) not converged, ... pinv(x.T).dot(x) -> pinv(x.T, please_don_t_check=True).dot(y) numbers ? grep: we also use scipy.linalg.pinv in some cases Josef
- Sebastian
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
(*) not converged, ...
pinv(x.T).dot(x) -> pinv(x.T, please_don_t_check=True).dot(y)
numbers ?
FWIW, I see this spurious SVD did not converge warning very frequently with ARMA when there is a nan that has creeped in. I usually know where to find the problem, but I think it'd be nice if this error message was a little better. Skipper
grep: we also use scipy.linalg.pinv in some cases
Josef
- Sebastian
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Feb 15, 2014 at 5:12 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
(*) not converged, ...
pinv(x.T).dot(x) -> pinv(x.T, please_don_t_check=True).dot(y)
numbers ?
FWIW, I see this spurious SVD did not converge warning very frequently with ARMA when there is a nan that has creeped in. I usually know where to find the problem, but I think it'd be nice if this error message was a little better.
maybe I'm +1 While we don't see crashes, when I run Alex's example I see 13% cpu usage for a hanging process which looks very familiar to me, I see it reasonably often when I'm debugging code. I never tried to track down where it hangs. Josef
Skipper
grep: we also use scipy.linalg.pinv in some cases
Josef
- Sebastian
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
> import numpy as np > np.__version__ '1.9.0.dev-e3f0f53' > A = np.array([[1e3, 0], [0, 1]]) > B = np.array([[1e300, 0], [0, 1]]) > C = np.array([[1e3000, 0], [0, 1]]) > np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) > np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) > np.linalg.svd(C) [hangs forever]
""" Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sa, 2014-02-15 at 17:35 -0500, josef.pktd@gmail.com wrote:
On Sat, Feb 15, 2014 at 5:12 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
(*) not converged, ...
pinv(x.T).dot(x) -> pinv(x.T, please_don_t_check=True).dot(y)
numbers ?
FWIW, I see this spurious SVD did not converge warning very frequently with ARMA when there is a nan that has creeped in. I usually know where to find the problem, but I think it'd be nice if this error message was a little better.
maybe I'm +1
While we don't see crashes, when I run Alex's example I see 13% cpu usage for a hanging process which looks very familiar to me, I see it reasonably often when I'm debugging code.
I never tried to track down where it hangs.
If this should not cause big hangs/crashes (just "not converged" after a long time or so), then maybe we should just check afterwards to give the user a better idea of where to look for the error. I think I remember people running into this and being confused (but without crash/hang). - Sebsatian
Josef
Skipper
grep: we also use scipy.linalg.pinv in some cases
Josef
- Sebastian
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
>> import numpy as np >> np.__version__ '1.9.0.dev-e3f0f53' >> A = np.array([[1e3, 0], [0, 1]]) >> B = np.array([[1e300, 0], [0, 1]]) >> C = np.array([[1e3000, 0], [0, 1]]) >> np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) >> np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) >> np.linalg.svd(C) [hangs forever]
""" Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Feb 15, 2014 at 6:06 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 17:35 -0500, josef.pktd@gmail.com wrote:
On Sat, Feb 15, 2014 at 5:12 PM, Skipper Seabold <jsseabold@gmail.com> wrote:
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
(*) not converged, ...
pinv(x.T).dot(x) -> pinv(x.T, please_don_t_check=True).dot(y)
numbers ?
FWIW, I see this spurious SVD did not converge warning very frequently with ARMA when there is a nan that has creeped in. I usually know where to find the problem, but I think it'd be nice if this error message was a little better.
maybe I'm +1
While we don't see crashes, when I run Alex's example I see 13% cpu usage for a hanging process which looks very familiar to me, I see it reasonably often when I'm debugging code.
I never tried to track down where it hangs.
If this should not cause big hangs/crashes (just "not converged" after a long time or so), then maybe we should just check afterwards to give the user a better idea of where to look for the error. I think I remember people running into this and being confused (but without crash/hang).
I'm not sure exactly what you mean by this. You are suggesting that if the svd fails with some kind of exception (possibly poorly or misleadingly worded) then it could be cleaned-up after the fact by checking the input, and that this would not incur the speed penalty because no check will be done if the svd succeeds? This would not work on my system because that svd call really does hang, as in some non-ctrl-c-interruptable spin lock inside fortran code or something. I think the behavior is undefined and it can crash although I do not personally have an example of this. These modes of failure cannot be recovered from as easily as recovering from an exception. Alex
On Sa, 2014-02-15 at 18:20 -0500, alex wrote: <snip>
I'm not sure exactly what you mean by this. You are suggesting that if the svd fails with some kind of exception (possibly poorly or misleadingly worded) then it could be cleaned-up after the fact by checking the input, and that this would not incur the speed penalty because no check will be done if the svd succeeds? This would not work on my system because that svd call really does hang, as in some non-ctrl-c-interruptable spin lock inside fortran code or something. I think the behavior is undefined and it can crash although I do not personally have an example of this. These modes of failure cannot be recovered from as easily as recovering from an exception.
Yeah, I meant that. But it has a big "if", that the failure is basically a bug in the library you happen to be using and extremely uncommon. - Sebastian
Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sat, Feb 15, 2014 at 6:34 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 18:20 -0500, alex wrote: <snip>
I'm not sure exactly what you mean by this. You are suggesting that if the svd fails with some kind of exception (possibly poorly or misleadingly worded) then it could be cleaned-up after the fact by checking the input, and that this would not incur the speed penalty because no check will be done if the svd succeeds? This would not work on my system because that svd call really does hang, as in some non-ctrl-c-interruptable spin lock inside fortran code or something. I think the behavior is undefined and it can crash although I do not personally have an example of this. These modes of failure cannot be recovered from as easily as recovering from an exception.
Yeah, I meant that. But it has a big "if", that the failure is basically a bug in the library you happen to be using and extremely uncommon.
On my system the lapack bundled with numpy hangs.
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
According to https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/lin... statsmodels pinv(A) checks isfinite(A) at least twice and also checks for finiteness of the identity matrix. Or maybe this is not the pinv that you meant. Alex
On Sat, Feb 15, 2014 at 5:18 PM, alex <argriffi@ncsu.edu> wrote:
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown. I didn't do any serious timings just a few examples.
According to https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/lin... statsmodels pinv(A) checks isfinite(A) at least twice and also checks for finiteness of the identity matrix. Or maybe this is not the pinv that you meant.
that's dead code copy of np.pinv used in linear regression https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/too... (it's a recent change to streamline some of the linalg in regression, and master only) outside of linear regression we still use almost only np.linalg.pinv directly Josef
Alex _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
<josef.pktd@gmail.com> wrote:
copy of np.pinv used in linear regression https://github.com/statsmodels/statsmodels/blob/master/statsmodels/tools/too... (it's a recent change to streamline some of the linalg in regression, and master only)
Why not call lapack routine DGELSS instead? It does exactly this, only faster. (And DEGELS for fitting with QR?) Sturla
On Sat, Feb 15, 2014 at 5:08 PM, <josef.pktd@gmail.com> wrote:
On Sat, Feb 15, 2014 at 4:56 PM, Sebastian Berg <sebastian@sipsolutions.net> wrote:
On Sa, 2014-02-15 at 16:37 -0500, alex wrote:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
+1, unless this is a huge speed penalty, correctness (and decent error messages) should come first in my opinion, this is python after all. If this is a noticable speed difference, a kwarg may be an option (but would think about that some more).
maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown.
Although numpy.linalg.pinv uses svd, scipy.linalg.pinv uses least squares with an rhs identity matrix. The scipy.linalg function that uses svd for pseudoinverse is pinv2. These connect to different LAPACK functions. Also I noticed that these scipy functions all have redundant finiteness checking which I've fixed in a PR. Alex
<josef.pktd@gmail.com> wrote: maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown.
If you did care about speed in least-sqares fitting you would not call QR or SVD directly, but use the builting LAPACK least-squares drivers (*GELSS, *GELS, *GGGLM), which are much faster (I have checked), as well as use an optimized multi-core efficient LAPACK (e.g. Intel MKL). Any overhead from finiteness checking will be tiny compared to the Python/NumPy overhead your statsmodels code incurs, not to mention the overhead you get from using f2c lapack_lite. Sturla
Sturla Molden <sturla.molden@gmail.com> wrote:
<josef.pktd@gmail.com> wrote: maybe -1
statsmodels is using np.linalg.pinv which uses svd I never ran heard of any crash (*), and the only time I compared with scipy I didn't like the slowdown.
If you did care about speed in least-sqares fitting you would not call QR or SVD directly, but use the builting LAPACK least-squares drivers (*GELSS, *GELS, *GGGLM), which are much faster (I have checked), as well as use an optimized multi-core efficient LAPACK (e.g. Intel MKL). Any overhead from finiteness checking will be tiny compared to the Python/NumPy overhead your statsmodels code incurs, not to mention the overhead you get from using f2c lapack_lite.
By the way: I am not saying you should call this methods. Keeping most of the QR and SVD least-squares solvers in Python has its merits as well, e.g. for clarity. But if you do, it defeats any argument that finiteness checking before calling LAPACK will be too slow. Sturla
alex <argriffi <at> ncsu.edu> writes:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex
I'm -1 on checking finiteness - if there's one place you usually want maximum performance it's linear algebra operations. It certainly shouldn't crash or hang though and for me at least it doesn't - it returns NaN which immediately suggests to me that I've got bad input (maybe just because I've seen it before). I'm not sure adding an extra kwarg is worth cluttering up the api when a simple call to isfinite beforehand will do the job if you think you may potentially have non-finite input. Python 2.7.5 |Anaconda 1.8.0 (64-bit)| (default, Jul 1 2013, 12:37:52) [MSC v.1500 64 bit (AMD64)] In [1]: import numpy as np In [2]: >>> A = np.array([[1e3, 0], [0, 1]]) ...: >>> B = np.array([[1e300, 0], [0, 1]]) ...: >>> C = np.array([[1e3000, 0], [0, 1]]) ...: >>> np.linalg.svd(A) ...: Out[2]: (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) In [3]: np.linalg.svd(B) Out[3]: (array([[ 1., 0.], [ 0., 1.]]), array([ 1.0000e+300, 1.0000e+000]), array([[ 1., 0.], [ 0., 1.]])) In [4]: C Out[4]: array([[ inf, 0.], [ 0., 1.]]) In [5]: np.linalg.svd(C) Out[5]: (array([[ 0., 1.], [ 1., 0.]]), array([ nan, nan]), array([[ 0., 1.], [ 1., 0.]])) In [6]: np.__version__ Out[6]: '1.7.1' Regards, Dave
Dave Hirschfeld <novin01@gmail.com> wrote:
It certainly shouldn't crash or hang though and for me at least it doesn't - it returns NaN which immediately suggests to me that I've got bad input (maybe just because I've seen it before).
It might be dependent on the BLAS or LAPACK version. Since you are on Anaconda, I assume you are on MKL. But can we expect f2c lapack-lite and blas-lite to be equally well behaved? Sturla
On Mon, Feb 17, 2014 at 4:49 AM, Dave Hirschfeld <novin01@gmail.com> wrote:
alex <argriffi <at> ncsu.edu> writes:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex
I'm -1 on checking finiteness - if there's one place you usually want maximum performance it's linear algebra operations.
It certainly shouldn't crash or hang though and for me at least it doesn't - it returns NaN
btw when I use the python/numpy/openblas packaged for ubuntu, I also get NaN. The infinite loop appears when I build numpy letting it use its lapack lite. I don't know which LAPACK Josef uses to get the weird behavior he observes "13% cpu usage for a hanging process". This is consistent with the scipy svd docstring describing its check_finite flag, where it warns "Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs." I think this caveat also applies to most numpy linalg functions that connect more or less directly to lapack.
On Mon, Feb 17, 2014 at 10:03 AM, alex <argriffi@ncsu.edu> wrote:
On Mon, Feb 17, 2014 at 4:49 AM, Dave Hirschfeld <novin01@gmail.com> wrote:
alex <argriffi <at> ncsu.edu> writes:
Hello list,
Here's another idea resurrection from numpy github comments that I've been advised could be posted here for re-discussion.
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
""" [...] the following numpy code hangs until I `kill -9` it.
``` $ python runtests.py --shell $ python Python 2.7.5+ [GCC 4.8.1] on linux2
import numpy as np np.__version__ '1.9.0.dev-e3f0f53' A = np.array([[1e3, 0], [0, 1]]) B = np.array([[1e300, 0], [0, 1]]) C = np.array([[1e3000, 0], [0, 1]]) np.linalg.svd(A) (array([[ 1., 0.], [ 0., 1.]]), array([ 1000., 1.]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(B) (array([[ 1., 0.], [ 0., 1.]]), array([ 1.00000000e+300, 1.00000000e+000]), array([[ 1., 0.], [ 0., 1.]])) np.linalg.svd(C) [hangs forever]
""" Alex
I'm -1 on checking finiteness - if there's one place you usually want maximum performance it's linear algebra operations.
It certainly shouldn't crash or hang though and for me at least it doesn't - it returns NaN
btw when I use the python/numpy/openblas packaged for ubuntu, I also get NaN. The infinite loop appears when I build numpy letting it use its lapack lite. I don't know which LAPACK Josef uses to get the weird behavior he observes "13% cpu usage for a hanging process".
I use official numpy release for development, Windows, 32bit python, i.e. MingW 3.5 and whatever old ATLAS the release includes. a constant 13% cpu usage is 1/8 th of my 8 virtual cores. If it were in a loop doing some work, then cpu usage fluctuates (between 12 and 13% in a busy loop). +/- 1 Josef
This is consistent with the scipy svd docstring describing its check_finite flag, where it warns "Disabling may give a performance gain, but may result in problems (crashes, non-termination) if the inputs do contain infinities or NaNs." I think this caveat also applies to most numpy linalg functions that connect more or less directly to lapack. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
<josef.pktd@gmail.com> wrote:
I use official numpy release for development, Windows, 32bit python, i.e. MingW 3.5 and whatever old ATLAS the release includes.
a constant 13% cpu usage is 1/8 th of my 8 virtual cores.
Based on this and Alex' message it seems the offender is the f2c generated lapack_lite library. So do we do with lapack_lite? Should we patch it? Sturla
Sturla Molden <sturla.molden <at> gmail.com> writes:
<josef.pktd <at> gmail.com> wrote:
I use official numpy release for development, Windows, 32bit python, i.e. MingW 3.5 and whatever old ATLAS the release includes.
a constant 13% cpu usage is 1/8 th of my 8 virtual cores.
Based on this and Alex' message it seems the offender is the f2c generated lapack_lite library.
So do we do with lapack_lite? Should we patch it?
Sturla
Even if lapack_lite always performed the isfinite check and threw a python error if False, it would be much better than either hanging or segfaulting and people who care about the isfinite cost probably would be linking to a fast lapack anyway. -Dave
Dave Hirschfeld <novin01@gmail.com> wrote:
Even if lapack_lite always performed the isfinite check and threw a python error if False, it would be much better than either hanging or segfaulting and people who care about the isfinite cost probably would be linking to a fast lapack anyway.
+1 (if I have a vote) Correctness is always more important than speed. Segfaulting or hanging while burning the CPU is not something we should allow "by design". And those who need speed should in any case use a different lapack library instead. The easiest place to put a finiteness test is the check_object function here: https://github.com/numpy/numpy/blob/master/numpy/linalg/lapack_litemodule.c But in that case we should probably use a macro guard to leave it out if any other LAPACK than the builtin f2c version is used. Sturla
Sturla Molden <sturla.molden@gmail.com> wrote:
Dave Hirschfeld <novin01@gmail.com> wrote:
Even if lapack_lite always performed the isfinite check and threw a python error if False, it would be much better than either hanging or segfaulting and people who care about the isfinite cost probably would be linking to a fast lapack anyway.
+1 (if I have a vote)
Correctness is always more important than speed. Segfaulting or hanging while burning the CPU is not something we should allow "by design". And those who need speed should in any case use a different lapack library instead. The easiest place to put a finiteness test is the check_object function here:
https://github.com/numpy/numpy/blob/master/numpy/linalg/lapack_litemodule.c
But in that case we should probably use a macro guard to leave it out if any other LAPACK than the builtin f2c version is used.
It seems even the more recent (3.4.x) versions of LAPACK have places where NANs can cause infinite loops. As long as this is an issue it might perhaps be worth checking everywhere. http://www.netlib.org/lapack/bug_list.html The semi-official C interface LAPACKE implements NAN checking as well: http://www.netlib.org/lapack/lapacke.html#_nan_checking If Intel's engineers put NAN checking inside LAPACKE it probably were for a good reason. Sturla
On Mon, Feb 17, 2014 at 1:24 PM, Sturla Molden wrote:
Sturla Molden wrote:
Dave Hirschfeld wrote:
Even if lapack_lite always performed the isfinite check and threw a python error if False, it would be much better than either hanging or segfaulting and people who care about the isfinite cost probably would be linking to a fast lapack anyway.
+1 (if I have a vote)
Correctness is always more important than speed. Segfaulting or hanging while burning the CPU is not something we should allow "by design". And those who need speed should in any case use a different lapack library instead. The easiest place to put a finiteness test is the check_object function here:
https://github.com/numpy/numpy/blob/master/numpy/linalg/lapack_litemodule.c
But in that case we should probably use a macro guard to leave it out if any other LAPACK than the builtin f2c version is used.
It seems even the more recent (3.4.x) versions of LAPACK have places where NANs can cause infinite loops. As long as this is an issue it might perhaps be worth checking everywhere.
http://www.netlib.org/lapack/bug_list.html
The semi-official C interface LAPACKE implements NAN checking as well:
http://www.netlib.org/lapack/lapacke.html#_nan_checking
If Intel's engineers put NAN checking inside LAPACKE it probably were for a good reason.
As more evidence that checking isfinite could be important for stability even for non-lapack-lite LAPACKs, MKL docs currently include the following warning: WARNING LAPACK routines assume that input matrices do not contain IEEE 754 special values such as INF or NaN values. Using these special values may cause LAPACK to return unexpected results or become unstable.
On 2/15/14 3:37 PM, alex wrote:
The proposal would be to make np.linalg.svd more like scipy.linalg.svd with respect to input checking. The argument against the change is raw speed; if you know that you will never feed non-finite input to svd, then np.linalg.svd is a bit faster than scipy.linalg.svd. An argument for the change could be to avoid issues reported on github like crashes, hangs, spurious non-convergence exceptions, etc. from the undefined behavior of svd of non-finite input.
For what my vote is worth, -1. I thought this was pretty much the designed difference between the scipy and numpy linalg routines. Scipy does the checking, and numpy provides the raw speed. Maybe this is better resolved as a note in the documentation for numpy about the assumptions for the input and a reference to the scipy implementation? That said, I don't extensively use the linalg.svd routine in practice, so I defer to those that use it. Thanks, Jason
Jason Grout <jason-sage@creativetrax.com> wrote:
For what my vote is worth, -1. I thought this was pretty much the designed difference between the scipy and numpy linalg routines. Scipy does the checking, and numpy provides the raw speed. Maybe this is better resolved as a note in the documentation for numpy about the assumptions for the input and a reference to the scipy implementation?
I think if there is a stability issue, we should find out which LAPACK or BLAS versions are affected, and then decide what to do with it. No NumPy functions should arbitrarily hang forever. I would consider that a bug. Sturla
participants (7)
-
alex
-
Dave Hirschfeld
-
Jason Grout
-
josef.pktd@gmail.com
-
Sebastian Berg
-
Skipper Seabold
-
Sturla Molden