scipy.cluster.hierarchy.linkage — SciPy v1.13.1 Manual (2024)

scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)[source]#

Perform hierarchical/agglomerative clustering.

The input y may be either a 1-D condensed distance matrixor a 2-D array of observation vectors.

If y is a 1-D condensed distance matrix,then y must be a \(\binom{n}{2}\) sizedvector, where n is the number of original observations pairedin the distance matrix. The behavior of this function is verysimilar to the MATLAB linkage function.

A \((n-1)\) by 4 matrix Z is returned. At the\(i\)-th iteration, clusters with indices Z[i, 0] andZ[i, 1] are combined to form cluster \(n + i\). Acluster with an index less than \(n\) corresponds to one ofthe \(n\) original observations. The distance betweenclusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. Thefourth value Z[i, 3] represents the number of originalobservations in the newly formed cluster.

The following linkage methods are used to compute the distance\(d(s, t)\) between two clusters \(s\) and\(t\). The algorithm begins with a forest of clusters thathave yet to be used in the hierarchy being formed. When twoclusters \(s\) and \(t\) from this forest are combinedinto a single cluster \(u\), \(s\) and \(t\) areremoved from the forest, and \(u\) is added to theforest. When only one cluster remains in the forest, the algorithmstops, and this cluster becomes the root.

A distance matrix is maintained at each iteration. The d[i,j]entry corresponds to the distance between cluster \(i\) and\(j\) in the original forest.

At each iteration, the algorithm must update the distance matrixto reflect the distance of the newly formed cluster u with theremaining clusters in the forest.

Suppose there are \(|u|\) original observations\(u[0], \ldots, u[|u|-1]\) in cluster \(u\) and\(|v|\) original objects \(v[0], \ldots, v[|v|-1]\) incluster \(v\). Recall, \(s\) and \(t\) arecombined to form cluster \(u\). Let \(v\) be anyremaining cluster in the forest that is not \(u\).

The following are methods for calculating the distance between thenewly formed cluster \(u\) and each \(v\).

method=’single’ assigns
\[d(u,v) = \min(dist(u[i],v[j]))\]
for all points \(i\) in cluster \(u\) and\(j\) in cluster \(v\). This is also known as theNearest Point Algorithm.
method=’complete’ assigns
\[d(u, v) = \max(dist(u[i],v[j]))\]
See Also
Clustering Custom Data Using the K-Means Algorithm — Python Definitive Guide to K-Means Clustering with Scikit-Learn Custom Distance Function in K-Means Clustering with scikit-learn in Python 3 - DNMTechs - Sharing and Storing Technology Knowledge scipy.spatial.distance.cdist — SciPy v1.13.1 Manual
for all points \(i\) in cluster u and \(j\) incluster \(v\). This is also known by the Farthest PointAlgorithm or Voor Hees Algorithm.
method=’average’ assigns
\[d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}\]
for all points \(i\) and \(j\) where \(|u|\)and \(|v|\) are the cardinalities of clusters \(u\)and \(v\), respectively. This is also called the UPGMAalgorithm.
method=’weighted’ assigns
\[d(u,v) = (dist(s,v) + dist(t,v))/2\]
where cluster u was formed with cluster s and t and vis a remaining cluster in the forest (also called WPGMA).
method=’centroid’ assigns
\[dist(s,t) = ||c_s-c_t||_2\]
where \(c_s\) and \(c_t\) are the centroids ofclusters \(s\) and \(t\), respectively. When twoclusters \(s\) and \(t\) are combined into a newcluster \(u\), the new centroid is computed over all theoriginal objects in clusters \(s\) and \(t\). Thedistance then becomes the Euclidean distance between thecentroid of \(u\) and the centroid of a remaining cluster\(v\) in the forest. This is also known as the UPGMCalgorithm.
method=’median’ assigns \(d(s,t)\) like the centroidmethod. When two clusters \(s\) and \(t\) are combinedinto a new cluster \(u\), the average of centroids s and tgive the new centroid \(u\). This is also known as theWPGMC algorithm.
method=’ward’ uses the Ward variance minimization algorithm.The new entry \(d(u,v)\) is computed as follows,
\[d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 - \frac{|v|} {T}d(s,t)^2}\]
where \(u\) is the newly joined cluster consisting ofclusters \(s\) and \(t\), \(v\) is an unusedcluster in the forest, \(T=|v|+|s|+|t|\), and\(|*|\) is the cardinality of its argument. This is alsoknown as the incremental algorithm.

Warning: When the minimum distance pair in the forest is chosen, theremay be two or more pairs with the same minimum distance. Thisimplementation may choose a different minimum than the MATLABversion.

Parameters:

yndarray

A condensed distance matrix. A condensed distance matrixis a flat array containing the upper triangular of the distance matrix.This is the form that pdist returns. Alternatively, a collection of\(m\) observation vectors in \(n\) dimensions may be passed asan \(m\) by \(n\) array. All elements of the condensed distancematrix must be finite, i.e., no NaNs or infs.

methodstr, optional

The linkage algorithm to use. See the Linkage Methods section belowfor full descriptions.

metricstr or function, optional

The distance metric to use in the case that y is a collection ofobservation vectors; ignored otherwise. See the pdistfunction for a list of valid distance metrics. A custom distancefunction can also be used.

optimal_orderingbool, optional

If True, the linkage matrix will be reordered so that the distancebetween successive leaves is minimal. This results in a more intuitivetree structure when the data are visualized. defaults to False, becausethis algorithm can be slow, particularly on large datasets [2]. Seealso the optimal_leaf_ordering function.

Added in version 1.0.0.

Returns:

Zndarray: The hierarchical clustering encoded as a linkage matrix.

See also

scipy.spatial.distance.pdist: pairwise distance metrics

Notes

For method ‘single’, an optimized algorithm based on minimum spanningtree is implemented. It has time complexity \(O(n^2)\).For methods ‘complete’, ‘average’, ‘weighted’ and ‘ward’, an algorithmcalled nearest-neighbors chain is implemented. It also has timecomplexity \(O(n^2)\).For other methods, a naive algorithm is implemented with \(O(n^3)\)time complexity.All algorithms use \(O(n^2)\) memory.Refer to [1] for details about the algorithms.
Methods ‘centroid’, ‘median’, and ‘ward’ are correctly defined only ifEuclidean pairwise metric is used. If y is passed as precomputedpairwise distances, then it is the user’s responsibility to assure thatthese distances are in fact Euclidean, otherwise the produced resultwill be incorrect.

References

[1]

Daniel Mullner, “Modern hierarchical, agglomerative clusteringalgorithms”, arXiv:1109.2378v1.

[2]

Ziv Bar-Joseph, David K. Gifford, Tommi S. Jaakkola, “Fast optimalleaf ordering for hierarchical clustering”, 2001. BioinformaticsDOI:10.1093/bioinformatics/17.suppl_1.S22

Examples

>>> from scipy.cluster.hierarchy import dendrogram, linkage>>> from matplotlib import pyplot as plt>>> X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]

>>> Z = linkage(X, 'ward')>>> fig = plt.figure(figsize=(25, 10))>>> dn = dendrogram(Z)

>>> Z = linkage(X, 'single')>>> fig = plt.figure(figsize=(25, 10))>>> dn = dendrogram(Z)>>> plt.show()