[scikit-learn] distances

Fri Mar 6 05:00:45 EST 2020

Although pairwise distances are very good candidates for OpenMP based 
multi-threading due to their embarrassingly parallel nature, I think 
euclidean distances (from the pairwise module) is the one which will 
less benefit from that. It's implementation, using the dot trick, uses 
BLAS level 3 routine (matrix matrix multiplication) which will always be 
better optimized, better parallelized, have runtime cpu detection.

Side note: What really makes KMeans faster is not the fact that 
euclidean distances are computed by chunks, it's because the chunked 
pairwise distance matrix fits in cache so it stays there for the 
following operations on this matrix (finding labels, partially update 
centers). So it does not apply to only computing euclidean distances.

On the other hand, other metrics don't all have internal 
multi-threading, and probably none rely on level 3 BLAS routines. 
Usually computing pairwise distances does not involve a lot of 
computations and is quite fast, so parallelizing them with joblib has no 
benefit due to the joblib overhead being bigger than the computations 
themselves. Unless the data is big enough but memory issues will happen 
before that :) Those metrics could probably benefit from OpenMP based 
multithreading.

About going too low-level, we already have this DistanceMetric module 
implementing all metrics in cython, so I'd say we're already kind of 
low-level and in that case, using OpenMP would really just be adding a 
'p' before 'range' :) I think a good first step could be to move this 
module in metrics, where it really belongs, rework it to make it fused 
typed and sparse friendly, and add some prange. Obviously it will keep 
most of the API flaws that @jnothman exposed but it might set up a 
cleaner ground for future API changes.

In the end, whatever you choose, I'd be happy to help.

Jérémie (@jeremiedbb)

On 05/03/2020 22:12, Andreas Mueller wrote:
> Thanks for a great summary of issues!
> I agree there's lots to do, though I think most of the issues that you 
> list are quite hard and require thinking about API pretty hard.
> So they might not be super amendable to being solved by a shorter-term 
> project.
>
> I was hoping there would be some more easy wins that we could get by 
> exploiting OpenMP better (or at all) in the distances.
> Not sure if there is, though.
>
> I wonder if having a multicore implementation of euclidean_distances 
> would be useful for us, or if that's going too low-level.
>
>
>
> On 3/3/20 5:47 PM, Joel Nothman wrote:
>> I noticed a comment by @amueller on Gitter re considering a project 
>> on our distances implementations.
>>
>> I think there's a lot of work that can be done in unifying distances 
>> implementations... (though I'm not always sure the benefit.) I 
>> thought I would summarise some of the issues below, as I was unsure 
>> what Andy intended.
>>
>> As @jeremiedbb said, making n_jobs more effective would be 
>> beneficial. Reducing duplication between metrics.pairwise and 
>> neighbors._dist_metrics and kmeans would be noble (especially with 
>> regard to parameters, where scicpy.spatial's mahalanobis available 
>> through sklearn.metrics does not accept V but sklearn.neighbors 
>> does). and perhaps offer higher consistency of results and efficiencies.
>>
>> We also have idioms the code like "if the metric is euclidean, use 
>> squared=True where we only need a ranking, then take the squareroot" 
>> while neighbors metrics abstract this with an API by providing rdsit 
>> and rdist_to_dist.
>>
>> There are issues about making sure that 
>> pairwise_distances(metric='minkowski', p=2) is using the same 
>> implementation as pairwise_distances(metric='euclidean'), etc.
>>
>> We have issues with chunking and distributing computations in the 
>> case that metric params are derived from the dataset (ideally a 
>> training set).
>>
>> #16419 is a simple instance where the metric param is sample-aligned 
>> and needs to be chunked up.
>>
>> In other cases, we precompute some metric param over all the data, 
>> then pass it to each chunk worker, using _precompute_metric_params 
>> introduced in #12672. This is also relevant to #9555.
>>
>> While that initial implementation in #12672 is helpful and aims to 
>> maintain backwards compatibility, it makes some dubious choices.
>>
>> Firstly in terms of code structure it is not a very modular approach 
>> - each metric is handled with an if-then. Secondly, it *only* handles 
>> the chunking case, relying on the fact that these metrics are in 
>> scipy.spatial, and have a comparable handling of V=None and VI=None. 
>> In the Gower Distances PR (#9555) when implementing a metric locally, 
>> rather than relying on scipy.spatial, we needed to provide an 
>> implementation of these default parameters both when the data is 
>> chunked and when the metric function is called straight out.
>>
>> Thirdly, its approach to training vs test data is dubious. We don't 
>> formally label X and Y in pairwise_distances as train/test, and 
>> perhaps we should. Maintaining backwards compat with scipy's 
>> seuclidean and mahalanobis, our implementation stacks X and Y to each 
>> other if both are provided, and then calculates their variance. This 
>> means that users may be applying a different metric at train and at 
>> test time (if the variance of X as train and Y as test is 
>> substantially different), which I consider a silent error. We can 
>> either make the train/test nature of X and Y more explicit, or we can 
>> require that data-based parameters are given explicitly by the user 
>> and not implicitly computed. If I understand correctly, 
>> sklearn.neighbors will not compute V or VI for you, and it must be 
>> provided explicitly. (Requiring that the scaling of each feature be 
>> given explicitly in Gower seems like an unnecessary burden on the 
>> user, however.)
>>
>> Then there are issues like whether we should consistently set the 
>> diagonal to zero in all metrics where Y=None.
>>
>> In short, there are several projects in distances, and I'd support 
>> them being considered for work.... But it's a lot of engineering, if 
>> motivated by ML needs and consistency for users.
>>
>> J
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200306/871a0e3d/attachment-0001.html>