[scikit-learn] distances

Andreas Mueller t3kcit at gmail.com
Thu Mar 5 16:12:43 EST 2020


Thanks for a great summary of issues!
I agree there's lots to do, though I think most of the issues that you 
list are quite hard and require thinking about API pretty hard.
So they might not be super amendable to being solved by a shorter-term 
project.

I was hoping there would be some more easy wins that we could get by 
exploiting OpenMP better (or at all) in the distances.
Not sure if there is, though.

I wonder if having a multicore implementation of euclidean_distances 
would be useful for us, or if that's going too low-level.



On 3/3/20 5:47 PM, Joel Nothman wrote:
> I noticed a comment by @amueller on Gitter re considering a project on 
> our distances implementations.
>
> I think there's a lot of work that can be done in unifying distances 
> implementations... (though I'm not always sure the benefit.) I thought 
> I would summarise some of the issues below, as I was unsure what Andy 
> intended.
>
> As @jeremiedbb said, making n_jobs more effective would be beneficial. 
> Reducing duplication between metrics.pairwise and 
> neighbors._dist_metrics and kmeans would be noble (especially with 
> regard to parameters, where scicpy.spatial's mahalanobis available 
> through sklearn.metrics does not accept V but sklearn.neighbors does). 
> and perhaps offer higher consistency of results and efficiencies.
>
> We also have idioms the code like "if the metric is euclidean, use 
> squared=True where we only need a ranking, then take the squareroot" 
> while neighbors metrics abstract this with an API by providing rdsit 
> and rdist_to_dist.
>
> There are issues about making sure that 
> pairwise_distances(metric='minkowski', p=2) is using the same 
> implementation as pairwise_distances(metric='euclidean'), etc.
>
> We have issues with chunking and distributing computations in the case 
> that metric params are derived from the dataset (ideally a training set).
>
> #16419 is a simple instance where the metric param is sample-aligned 
> and needs to be chunked up.
>
> In other cases, we precompute some metric param over all the data, 
> then pass it to each chunk worker, using _precompute_metric_params 
> introduced in #12672. This is also relevant to #9555.
>
> While that initial implementation in #12672 is helpful and aims to 
> maintain backwards compatibility, it makes some dubious choices.
>
> Firstly in terms of code structure it is not a very modular approach - 
> each metric is handled with an if-then. Secondly, it *only* handles 
> the chunking case, relying on the fact that these metrics are in 
> scipy.spatial, and have a comparable handling of V=None and VI=None. 
> In the Gower Distances PR (#9555) when implementing a metric locally, 
> rather than relying on scipy.spatial, we needed to provide an 
> implementation of these default parameters both when the data is 
> chunked and when the metric function is called straight out.
>
> Thirdly, its approach to training vs test data is dubious. We don't 
> formally label X and Y in pairwise_distances as train/test, and 
> perhaps we should. Maintaining backwards compat with scipy's 
> seuclidean and mahalanobis, our implementation stacks X and Y to each 
> other if both are provided, and then calculates their variance. This 
> means that users may be applying a different metric at train and at 
> test time (if the variance of X as train and Y as test is 
> substantially different), which I consider a silent error. We can 
> either make the train/test nature of X and Y more explicit, or we can 
> require that data-based parameters are given explicitly by the user 
> and not implicitly computed. If I understand correctly, 
> sklearn.neighbors will not compute V or VI for you, and it must be 
> provided explicitly. (Requiring that the scaling of each feature be 
> given explicitly in Gower seems like an unnecessary burden on the 
> user, however.)
>
> Then there are issues like whether we should consistently set the 
> diagonal to zero in all metrics where Y=None.
>
> In short, there are several projects in distances, and I'd support 
> them being considered for work.... But it's a lot of engineering, if 
> motivated by ML needs and consistency for users.
>
> J
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200305/d2ea43aa/attachment.html>


More information about the scikit-learn mailing list