[Scipy-svn] r4670 - trunk/scipy/cluster
scipy-svn at scipy.org
scipy-svn at scipy.org
Sat Aug 23 15:04:55 EDT 2008
Author: damian.eads
Date: 2008-08-23 14:04:52 -0500 (Sat, 23 Aug 2008)
New Revision: 4670
Modified:
trunk/scipy/cluster/hierarchy.py
Log:
Finished RSTifying scipy.cluster.hierarchy.linkage function.
Modified: trunk/scipy/cluster/hierarchy.py
===================================================================
--- trunk/scipy/cluster/hierarchy.py 2008-08-23 18:44:21 UTC (rev 4669)
+++ trunk/scipy/cluster/hierarchy.py 2008-08-23 19:04:52 UTC (rev 4670)
@@ -427,114 +427,153 @@
def linkage(y, method='single', metric='euclidean'):
- """ Z = linkage(y, method)
+ """
+ Performs hierarchical/agglomerative clustering on the
+ condensed distance matrix y. y must be a {n \choose 2} sized
+ vector where n is the number of original observations paired
+ in the distance matrix. The behavior of this function is very
+ similar to the MATLAB(TM) linkage function.
- Performs hierarchical/agglomerative clustering on the
- condensed distance matrix y. y must be a {n \choose 2} sized
- vector where n is the number of original observations paired
- in the distance matrix. The behavior of this function is very
- similar to the MATLAB(TM) linkage function.
+ A 4 by :math:`$(n-1)$` matrix ``Z`` is returned. At the
+ :math:`$i$`th iteration, clusters with indices ``Z[i, 0]`` and
+ ``Z[i, 1]`` are combined to form cluster :math:`$n + i$`. A
+ cluster with an index less than :math:`$n$` corresponds to one of
+ the :math:`$n$` original observations. The distance between
+ clusters ``Z[i, 0]`` and ``Z[i, 1]`` is given by ``Z[i, 2]``. The
+ fourth value ``Z[i, 3]`` represents the number of original
+ observations in the newly formed cluster.
- A (n - 1) * 4 matrix Z is returned. At the i'th iteration,
- clusters with indices Z[i, 0] and Z[i, 1] are combined to
- form cluster n + i. A cluster with an index less than n
- corresponds to one of the n original observations. The
- distance between clusters Z[i, 0] and Z[i, 1] is given by
- Z[i, 2]. The fourth value Z[i, 3] represents the number of
- original observations in the newly formed cluster.
+ The following linkage methods are used to compute the distance
+ :math:`$d(s, t)$` between two clusters :math:`$s$` and
+ :math:`$t$`. The algorithm begins with a forest of clusters that
+ have yet to be used in the hierarchy being formed. When two
+ clusters :math:`$s$` and :math:`$t$` from this forest are combined
+ into a single cluster :math:`$u$`, :math:`$s$` and :math:`$t$` are
+ removed from the forest, and :math:`$u$` is added to the
+ forest. When only one cluster remains in the forest, the algorithm
+ stops, and this cluster becomes the root.
- The following linkage methods are used to compute the
- distance dist(s, t) between two clusters s and t. The
- algorithm begins with a forest of clusters that have yet
- to be used in the hierarchy being formed. When two clusters
- s and t from this forest are combined into a single cluster u,
- s and t are removed from the forest, and u is added to
- the forest. When only one cluster remains in the forest,
- the algorithm stops, and this cluster becomes the root.
+ A distance matrix is maintained at each iteration. The ``d[i,j]``
+ entry corresponds to the distance between cluster :math:`$i$` and
+ :math:`$j$` in the original forest.
- A distance matrix is maintained at each iteration. The
- d[i,j] entry corresponds to the distance between cluster
- i and j in the original forest.
+ At each iteration, the algorithm must update the distance matrix
+ to reflect the distance of the newly formed cluster u with the
+ remaining clusters in the forest.
- At each iteration, the algorithm must update the distance
- matrix to reflect the distance of the newly formed cluster
- u with the remaining clusters in the forest.
+ Suppose there are :math:`$|u|$` original observations
+ :math:`$u[0], \ldots, u[|u|-1]$` in cluster :math:`$u$` and
+ :math:`$|v|$` original objects :math:`$v[0], \ldots, v[|v|-1]$` in
+ cluster :math:`$v$`. Recall :math:`$s$` and :math:`$t$` are
+ combined to form cluster :math:`$u$`. Let :math:`$v$` be any
+ remaining cluster in the forest that is not :math:`$u$`.
- Suppose there are |u| original observations u[0], ..., u[|u|-1]
- in cluster u and |v| original objects v[0], ..., v[|v|-1]
- in cluster v. Recall s and t are combined to form cluster
- u. Let v be any remaining cluster in the forest that is not
- u.
+ :Parameters:
+ Q : ndarray
+ A condensed or redundant distance matrix. A condensed
+ distance matrix is a flat array containing the upper
+ triangular of the distance matrix. This is the form that
+ ``pdist`` returns. Alternatively, a collection of
+ :math:`$m$` observation vectors in n dimensions may be passed as
+ a :math:`$m$` by :math:`$n$` array.
+ method : string
+ The linkage algorithm to use. See the ``Linkage Methods``
+ section below for full descriptions.
+ metric : string
+ The distance metric to use. See the ``Distance Metric``
+ section below for full descriptions.
- The following are methods for calculating the distance between
- the newly formed cluster u and each v.
+ Linkage Methods
+ ---------------
- * method='single' assigns dist(u,v) = MIN(dist(u[i],v[j])
- for all points i in cluster u and j in cluster v.
+ The following are methods for calculating the distance between the
+ newly formed cluster :math:`$u$` and each :math:`$v$`.
- (also called Nearest Point Algorithm)
+ * method=``single`` assigns
- * method='complete' assigns dist(u,v) = MAX(dist(u[i],v[j])
- for all points i in cluster u and j in cluster v.
+ .. math:
+ d(u,v) = \min(dist(u[i],v[j]))
- (also called Farthest Point Algorithm
- or the Voor Hees Algorithm)
+ for all points :math:`$i$` in cluster :math:`$u$` and
+ :math:`$j$` in cluster :math:`$v$`. This is also known as the
+ Nearest Point Algorithm.
- * method='average' assigns dist(u,v) =
- \sum_{ij} { dist(u[i], v[j]) } / (|u|*|v|)
- for all points i and j where |u| and |v| are the
- cardinalities of clusters u and v, respectively.
+ * method=``complete`` assigns
- (also called UPGMA)
+ .. math:
+ d(u, v) = \max(dist(u[i],v[j]))
- * method='weighted' assigns
+ for all points :math:`$i$` in cluster u and :math:`$j$` in
+ cluster :math:`$v$`. This is also known by the Farthest Point
+ Algorithm or Voor Hees Algorithm.
- dist(u,v) = (dist(s,v) + dist(t,v))/2
+ * method=``average`` assigns
- where cluster u was formed with cluster s and t and v
- is a remaining cluster in the forest. (also called WPGMA)
+ .. math:
+ d(u,v) = \sum_{ij} \frac{d(u[i], v[j])}
+ {(|u|*|v|)
- Z = linkage(X, method, metric='euclidean')
+ for all points :math:`$i$` and :math:`$j$` where :math:`$|u|$`
+ and :math:`$|v|$` are the cardinalities of clusters :math:`$u$`
+ and :math:`$v$`, respectively. This is also called the UPGMA
+ algorithm. This is called UPGMA.
- Performs hierarchical clustering on the objects defined by the
- n by m observation matrix X.
+ * method='weighted' assigns
- If the metric is 'euclidean' then the following methods may be
- used:
+ .. math:
+ d(u,v) = (dist(s,v) + dist(t,v))/2
- * method='centroid' assigns dist(s,t) = euclid(c_s, c_t) where
- c_s and c_t are the centroids of clusters s and t,
- respectively. When two clusters s and t are combined into a new
- cluster u, the new centroid is computed over all the original
- objects in clusters s and t. The distance then becomes
- the Euclidean distance between the centroid of u and the
- centroid of a remaining cluster v in the forest.
- (also called UPGMC)
+ where cluster u was formed with cluster s and t and v
+ is a remaining cluster in the forest. (also called WPGMA)
- * method='median' assigns dist(s,t) as above. When two clusters
- s and t are combined into a new cluster u, the average of
- centroids s and t give the new centroid u. (also called WPGMC)
- * method='ward' uses the Ward variance minimization algorithm.
- The new entry dist(u, v) is computed as follows,
+ * method='centroid' assigns
- dist(u,v) =
+ .. math:
+ dist(s,t) = euclid(c_s, c_t)
- ----------------------------------------------------
- | |v|+|s| |v|+|t| |v|
- | ------- d(v,s)^2 + ------- d(v,t)^2 - --- d(s,t)^2
- \| T T T
+ where :math:`$c_s$` and :math:`$c_t$` are the centroids of
+ clusters :math:`$s$` and :math:`$t$`, respectively. When two
+ clusters :math:`$s$` and :math:`$t$` are combined into a new
+ cluster :math:`$u$`, the new centroid is computed over all the
+ original objects in clusters :math:`$s$` and :math:`$t$`. The
+ distance then becomes the Euclidean distance between the
+ centroid of :math:`$u$` and the centroid of a remaining cluster
+ :math:`$v$` in the forest. This is also known as the UPGMC
+ algorithm.
- where u is the newly joined cluster consisting of clusters
- s and t, v is an unused cluster in the forest, T=|v|+|s|+|t|,
- and |*| is the cardinality of its argument.
- (also called incremental)
+ * method='median' assigns math:`$d(s,t)$` like the ``centroid``
+ method. When two clusters s and t are combined into a new
+ cluster :math:`$u$`, the average of centroids s and t give the
+ new centroid :math:`$u$`. This is also known as the WPGMC
+ algorithm.
- Warning to MATLAB(TM) users: when the minimum distance pair in
- the forest is chosen, there may be two or more pairs with the
- same minimum distance. This implementation may chose a
- different minimum than the MATLAB(TM) version.
- """
+ * method='ward' uses the Ward variance minimization algorithm.
+ The new entry :math:`$d(u,v)$` is computed as follows,
+
+ .. math:
+
+ d(u,v) = \sqrt{\frac{|v|+|s|}
+ {T}d(v,s)^2
+ + \frac{|v|+|t|}
+ {T}d(v,t)^2
+ + \frac{|v|}
+ {T}d(s,t)^2}
+
+ where :math:`$u$` is the newly joined cluster consisting of
+ clusters :math:`$s$` and :math:`$t$`, :math:`$v$` is an unused
+ cluster in the forest, :math:`$T=|v|+|s|+|t|$`, and
+ :math:`$|*|$` is the cardinality of its argument. This is also
+ known as the incremental algorithm.
+
+ Warning
+ -------
+
+ When the minimum distance pair in the forest is chosen, there may
+ be two or more pairs with the same minimum distance. This
+ implementation may chose a different minimum than the MATLAB(TM)
+ version.
+ """
if not isinstance(method, str):
raise TypeError("Argument 'method' must be a string.")
More information about the Scipy-svn
mailing list