[scikit-learn] Inconsistencies in clustering documentations

Wed May 23 05:50:24 EDT 2018

Dear all,

Three clustering algorithms can take as input distance or similarity
matrices instead of the observations (AgglomerativeClustering
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering>,
AffinityPropagation
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation>,
and DBSCAN
<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN>),
but there are inconsistencies in their documentations.

*DBSCAN :*
   The documentation explains clearly how to run DBSCAN with a
precomputed distance matrix.
   Constructor:/
       metric: If metric is “precomputed”, X is assumed to be a distance
matrix and must be square.
/
   fit / fit_predict /:
       X: A feature array, or array of distances between samples if
|metric='precomputed'|.

/
*AffinityPropagation :
*
    Constructor:
        affinity: /Which affinity to use. At the moment |precomputed|
and |euclidean| are supported. |euclidean| uses the negative squared
euclidean distance between points.
/
    fit :  /
        X: //Data matrix or, if affinity is |precomputed|, matrix of
similarities / affinities.
/
    fit_predict :/
/
/        X: Input data.     /
        X can also be a matrix of similarities ? fit and fit_predict
should share the same documentation for the input X ?/

/
*AgglomerativeClustering :
*    Constructor:
        /affinity: Metric used to compute the linkage. Can be
“euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or ‘precomputed’. If
linkage is “ward”, only “euclidean” is accepted/. 
        The name of the parameter 'affinity' seems misleading, since it
does not correspond to similarity functions, but to distance functions.
    fit :  /
        X: //The samples a.k.a. observations./   
    fit_predict :/
//        X: //Input data. 
/        The documentation of fit and fit_predict does not specify that
X can also be a matrix of distances.

The user may be confused whether he/she should provide a distance or a
similarity matrix to AgglomerativeClustering.
The documentation of fit and fit_predict can be easily updated. As for
the name of the 'affinity' parameter, it is more difficult since it
involves an API change.

What do you think of these potential updates of the documentation ?

Cheers,

Anaël Beaugnon
//
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180523/2208d1fa/attachment.html>