Improving spatial.distance code and performance
Hi all, TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback. For context (though I don't really need you to read these to give the feedback I'm looking for), - original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218 Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first. *Existing stuff to Cython* Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about: - should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though. - what's the best way to structure a PR? One metric at a time, or the whole caboodle? *Adding performance optimizations and future functionality* As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that. However, there are other possibilities: - using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant. - using approximate searches (e.g. Faiss). My understanding is that depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.) I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc. So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated. Thanks, kodonnell
Good to see some activity / interest in spatial. Definitely agree with Ralf's github comments re: using smaller / more tractable PRs -- it really is tough to sit down at night with 30 minutes of free time or whatever and look at a massive diff & not want to give up. I like the idea of making small / targeted / clearly demonstrated performance improvements without overhauling the entire infrastructure first, but maybe that's a controversial view if it just causes too much heterogeneity. Presumably the affected code is all thoroughly covered by unit tests? That's an important pre-requisite to have the confidence to really start making changes, esp. with older low-level code like that. On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell < email.the.odonnells@gmail.com> wrote:
Hi all,
TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback.
For context (though I don't really need you to read these to give the feedback I'm looking for),
- original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218
Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first.
*Existing stuff to Cython*
Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about:
- should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though. - what's the best way to structure a PR? One metric at a time, or the whole caboodle?
*Adding performance optimizations and future functionality*
As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that.
However, there are other possibilities:
- using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant. - using approximate searches (e.g. Faiss). My understanding is that depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.)
I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc.
So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated.
Thanks,
kodonnell
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
I'm very interested to see how a successful cython/performance PR progresses from a reviewers standpoint. On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy@gmail.com> wrote:
Good to see some activity / interest in spatial.
Definitely agree with Ralf's github comments re: using smaller / more tractable PRs -- it really is tough to sit down at night with 30 minutes of free time or whatever and look at a massive diff & not want to give up.
I like the idea of making small / targeted / clearly demonstrated performance improvements without overhauling the entire infrastructure first, but maybe that's a controversial view if it just causes too much heterogeneity.
Presumably the affected code is all thoroughly covered by unit tests? That's an important pre-requisite to have the confidence to really start making changes, esp. with older low-level code like that.
On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell < email.the.odonnells@gmail.com> wrote:
Hi all,
TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback.
For context (though I don't really need you to read these to give the feedback I'm looking for),
- original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218
Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first.
*Existing stuff to Cython*
Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about:
- should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though. - what's the best way to structure a PR? One metric at a time, or the whole caboodle?
*Adding performance optimizations and future functionality*
As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that.
However, there are other possibilities:
- using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant. - using approximate searches (e.g. Faiss). My understanding is that depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.)
I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc.
So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated.
Thanks,
kodonnell
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
Reply from Kane to the digest: " Yes, I believe there are good unit tests (Ralf actually ran them I think). From the basis of this, and the feedback in the issue/PR, it sounds like the preference would be for me to upgrade the distance metrics to cython bit-by-bit. However, I'm still unsure about scipy vs a new package, since scipy apparently won't support dependencies. For example, I could use Faiss or flann, but scipy wouldn't add them as a dependency. With that in mind, it seems the only option for offering users full-functionality (aside from rewriting those libraries from scratch etc.) is to offer a package. Unless scipy is willing to specify a run-time dependency (not sure of the correct terminology) - e.g. scipy doesn't depend on Faiss (when you install it), but if the user calls a distance metric with e.g. method='faiss', it will fail if faiss cannot be imported. Is this a possibility (and a good idea)? " This older discussion on Flann may be relevant: https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It says Flann only does Euclidean; not sure if that has changed since then. Regarding a dependency: https://github.com/mariusmuja/flann is basically inactivate for the last years; we wouldn't depend on it but could consider vendoring it. However, probably not worth it if it's for one method inside euclidean only. Faiss is still actively developed: https://github.com/facebookresearch/faiss, and looks like a much better option than Flann. However, something fast-moving like that which itself depends on BLAS and has GPU code in it too is not something we'd like to depend on nor want to vendor. Either way, Flann/Faiss is not about a 1:1 Cython translation, but about new features. We've got the distance metrics; approximate distances are probably best left to another package it looks like to me. On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski < mikofski@berkeley.edu> wrote:
I'm very interested to see how a successful cython/performance PR progresses from a reviewers standpoint.
On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy@gmail.com> wrote:
Good to see some activity / interest in spatial.
Definitely agree with Ralf's github comments re: using smaller / more tractable PRs -- it really is tough to sit down at night with 30 minutes of free time or whatever and look at a massive diff & not want to give up.
I like the idea of making small / targeted / clearly demonstrated performance improvements without overhauling the entire infrastructure first, but maybe that's a controversial view if it just causes too much heterogeneity.
Presumably the affected code is all thoroughly covered by unit tests? That's an important pre-requisite to have the confidence to really start making changes, esp. with older low-level code like that.
On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell < email.the.odonnells@gmail.com> wrote:
Hi all,
TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback.
For context (though I don't really need you to read these to give the feedback I'm looking for),
- original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218
Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first.
*Existing stuff to Cython*
Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about:
- should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though.
It wouldn't be the first time that code is moved from scikit-learn to scipy, that could make sense. Would be good to see if that makes sense from scikit-learn's point of view.
- what's the best way to structure a PR? One metric at a time, or the
whole caboodle?
How about one metric first (easier to review), and then the rest in one go?
*Adding performance optimizations and future functionality*
As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that.
However, there are other possibilities:
- using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant.
- using approximate searches (e.g. Faiss). My understanding is that
depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.)
I think some of these are an option; they'd need to be applicable to all distance metrics though and not just euclidean or a small subset. In the mailing list thread I linked to above there was some discussion as well about using kdtree/balltree.
I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc.
If you want to get really fancy, I'd lean towards a separate package. The idea of your current PR is in scopy for scipy.spatial though. We'd also be happy to link to a separate package from the scipy docs.
Cheers, Ralf
So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated.
Thanks,
kodonnell
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
Sorry, flann and faiss were just examples (I haven't actually researched different libraries in depth).
... approximate distances are probably best left to another package it looks like to me ... If you want to get really fancy, I'd lean towards a separate package.
OK, I want to try things which it sounds like scipy won't support, so, decision made: I'll aim to create a new package. If I actually do it, and if it's actually 'good', then there'll be a better discussion point for integrating it (if at all).
This older discussion on Flann may be relevant: https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It says Flann only does Euclidean; not sure if that has changed since then. Regarding a dependency: https://github.com/mariusmuja/flann is basically inactivate for the last years; we wouldn't depend on it but could consider vendoring it. However, probably not worth it if it's for one method inside euclidean only.
Faiss is still actively developed: https://github.com/facebookresearch/faiss, and looks like a much better option than Flann. However, something fast-moving like that which itself depends on BLAS and has GPU code in it too is not something we'd like to depend on nor want to vendor.
Either way, Flann/Faiss is not about a 1:1 Cython translation, but about new features. We've got the distance metrics; approximate distances are probably best left to another package it looks like to me.
On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski <mikofski@berkeley.edu <mailto:mikofski@berkeley.edu>> wrote:
I'm very interested to see how a successful cython/performance PR progresses from a reviewers standpoint.
On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy@gmail.com <mailto:tyler.je.reddy@gmail.com>> wrote:
Good to see some activity / interest in spatial.
Definitely agree with Ralf's github comments re: using smaller / more tractable PRs -- it really is tough to sit down at night with 30 minutes of free time or whatever and look at a massive diff & not want to give up.
I like the idea of making small / targeted / clearly demonstrated performance improvements without overhauling the entire infrastructure first, but maybe that's a controversial view if it just causes too much heterogeneity.
Presumably the affected code is all thoroughly covered by unit tests? That's an important pre-requisite to have the confidence to really start making changes, esp. with older low-level code like that.
On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell <email.the.odonnells@gmail.com <mailto:email.the.odonnells@gmail.com>> wrote:
Hi all,
TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback.
For context (though I don't really need you to read these to give the feedback I'm looking for),
- original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218
Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first.
*Existing stuff to Cython*
Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about:
- should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though.
It wouldn't be the first time that code is moved from scikit-learn to scipy, that could make sense. Would be good to see if that makes sense from scikit-learn's point of view.
- what's the best way to structure a PR? One metric at a time, or the whole caboodle?
How about one metric first (easier to review), and then the rest in one go?
*Adding performance optimizations and future functionality* * * As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that.
However, there are other possibilities:
- using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant.
- using approximate searches (e.g. Faiss). My understanding is that depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.)
I think some of these are an option; they'd need to be applicable to all distance metrics though and not just euclidean or a small subset. In the mailing list thread I linked to above there was some discussion as well about using kdtree/balltree.
I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc.
If you want to get really fancy, I'd lean towards a separate package. The idea of your current PR is in scopy for scipy.spatial though. We'd also be happy to link to a separate package from the scipy docs.
Cheers, Ralf
So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated.
Thanks,
kodonnell
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
OK, after a bit of playing around, it actually looks like faiss supports much of what I was intending (including auto-tuning <https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-clon...> etc.). It's only the l2 norm / dot product, but I figure those cover most use cases anyway. So maybe I'll submit some PRs to piece-by-piece migrate scipy.spatial to cython etc. On Sun, 16 Sep 2018 at 20:26, Kane & Anna O'Donnell < email.the.odonnells@gmail.com> wrote:
Sorry, flann and faiss were just examples (I haven't actually researched different libraries in depth).
... approximate distances are probably best left to another package it looks like to me ... If you want to get really fancy, I'd lean towards a separate package.
OK, I want to try things which it sounds like scipy won't support, so, decision made: I'll aim to create a new package. If I actually do it, and if it's actually 'good', then there'll be a better discussion point for integrating it (if at all).
This older discussion on Flann may be relevant: https://mail.python.org/pipermail/scipy-dev/2011-May/thread.html. It says Flann only does Euclidean; not sure if that has changed since then. Regarding a dependency: https://github.com/mariusmuja/flann is basically inactivate for the last years; we wouldn't depend on it but could consider vendoring it. However, probably not worth it if it's for one method inside euclidean only.
Faiss is still actively developed: https://github.com/facebookresearch/faiss, and looks like a much better option than Flann. However, something fast-moving like that which itself depends on BLAS and has GPU code in it too is not something we'd like to depend on nor want to vendor.
Either way, Flann/Faiss is not about a 1:1 Cython translation, but about new features. We've got the distance metrics; approximate distances are probably best left to another package it looks like to me.
On Mon, Sep 10, 2018 at 10:05 AM Mark Alexander Mikofski < mikofski@berkeley.edu> wrote:
I'm very interested to see how a successful cython/performance PR progresses from a reviewers standpoint.
On Sun, Sep 9, 2018, 5:10 PM Tyler Reddy <tyler.je.reddy@gmail.com> wrote:
Good to see some activity / interest in spatial.
Definitely agree with Ralf's github comments re: using smaller / more tractable PRs -- it really is tough to sit down at night with 30 minutes of free time or whatever and look at a massive diff & not want to give up.
I like the idea of making small / targeted / clearly demonstrated performance improvements without overhauling the entire infrastructure first, but maybe that's a controversial view if it just causes too much heterogeneity.
Presumably the affected code is all thoroughly covered by unit tests? That's an important pre-requisite to have the confidence to really start making changes, esp. with older low-level code like that.
On Thu, 6 Sep 2018 at 02:29, Kane & Anna O'Donnell < email.the.odonnells@gmail.com> wrote:
Hi all,
TLDR; I'm wondering about a) porting the spatial.distance code to Cython, and b) adding some performance optimizations, and I'd like the dev community's input/feedback.
For context (though I don't really need you to read these to give the feedback I'm looking for),
- original issue/proposal: https://github.com/scipy/scipy/issues/9205 - PR: https://github.com/scipy/scipy/pull/9218
Before submitting the PR, I naively thought it was going to be nice Cython or similar. Turns out it's some pretty old code, that I found pretty hard to wrap my head around and understand. I eventually figured it out after spending ages debugging a nastily hidden 'bug', and demonstrated the performance optimizations, but it prompted the discussion about whether it was best to port everything to Cython first.
*Existing stuff to Cython*
Doing so shouldn't be too hard, and it shouldn't change any functionality, except to replace the distance functions with their Cython ones (instead of the current approach, where the distance functions are actually numpy things, and there's not supported access to the underlying C stuff). A few possible 'bugs' (as above) should hopefully become non-issues too. So, it'd be a win for performance (e.g. all the distance functions will be much faster), and code quality, and future maintainability and development. However, things to think about:
- should I just replace like-for-like, or consider some nicer OOP stuff like e.g. sklearn's approach (which is already Cython)? https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/d... (I'm guessing the reason they rolled their own was because they found scipy lacking, as above.) In fact, we could largely just copy that file over. Not sure about the interplay between scipy and scikit learn though.
It wouldn't be the first time that code is moved from scikit-learn to scipy, that could make sense. Would be good to see if that makes sense from scikit-learn's point of view.
- what's the best way to structure a PR? One metric at a time, or the
whole caboodle?
How about one metric first (easier to review), and then the rest in one go?
*Adding performance optimizations and future functionality*
As per the links, this was initially about providing a nifty performance hack. It should still be pretty easy to implement. Personally, I think it makes sense to implement after the Cythonization - unless the community are against that.
However, there are other possibilities:
- using 'indices' within calculations. E.g. when doing pdist, it might pay to use a tree of some description. I also proposed another 'index' to optimize the 'bail early' approach further (which would, I think, actually work well with trees too). This would involve more API changes, possibly significant.
- using approximate searches (e.g. Faiss). My understanding is that
depending on other libraries probably isn't really an option, so I'm not sure what means. - maybe other cool approaches like https://github.com/droyed/eucl_dist - providing a way to 'tune' distance computations to be optimal to your particular dataset and constraints (e.g. my 'bail early' optimization might be a lot faster or a bit slower, depending on your data ... or you might be OK with approximate matching with a 'low' error rate, etc.)
I think some of these are an option; they'd need to be applicable to all distance metrics though and not just euclidean or a small subset. In the mailing list thread I linked to above there was some discussion as well about using kdtree/balltree.
I guess what I'd like to see is a single place where users can get access to everything related to distance metrics and their uses, including all sorts of optimizations etc. (possibly for different hardware, and possibly distributed). To do that well is a pretty big undertaking, and I don't know whether it's suited to scipy - e.g. maybe scipy doesn't really care about distance stuff, or only wants to stick with 'simple' distance metric cases (e.g. a few thousand vectors, etc.). So, maybe it'd be better to start a completely new python package - which would probably be a lot easier to develop as I'd have a lot more flexibility (e.g. to depend on other packages, and not have to worry about breaking the scipy API etc.). On the other hand (as discussed in the latest comment on the PR), that might not be best - it might never get used/maintained etc.
If you want to get really fancy, I'd lean towards a separate package. The idea of your current PR is in scopy for scipy.spatial though. We'd also be happy to link to a separate package from the scipy docs.
Cheers, Ralf
So, what does the community think is the best approach? I've got too little context of what scipy is and what it's aiming for, and I don't want to head off on the wrong tack. Comments on any of the other implied questions would also be appreciated.
Thanks,
kodonnell
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing listSciPy-Dev@python.orghttps://mail.python.org/mailman/listinfo/scipy-dev
participants (4)
-
Kane & Anna O'Donnell -
Mark Alexander Mikofski -
Ralf Gommers -
Tyler Reddy