- scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean', optimal_ordering=False)[source]#
Perform hierarchical/agglomerative clustering.
The input y may be either a 1-D condensed distance matrixor a 2-D array of observation vectors.
If y is a 1-D condensed distance matrix,then y must be a \(\binom{n}{2}\) sizedvector, where n is the number of original observations pairedin the distance matrix. The behavior of this function is verysimilar to the MATLAB linkage function.
A \((n-1)\) by 4 matrix
Z
is returned. At the\(i\)-th iteration, clusters with indicesZ[i, 0]
andZ[i, 1]
are combined to form cluster \(n + i\). Acluster with an index less than \(n\) corresponds to one ofthe \(n\) original observations. The distance betweenclustersZ[i, 0]
andZ[i, 1]
is given byZ[i, 2]
. Thefourth valueZ[i, 3]
represents the number of originalobservations in the newly formed cluster.The following linkage methods are used to compute the distance\(d(s, t)\) between two clusters \(s\) and\(t\). The algorithm begins with a forest of clusters thathave yet to be used in the hierarchy being formed. When twoclusters \(s\) and \(t\) from this forest are combinedinto a single cluster \(u\), \(s\) and \(t\) areremoved from the forest, and \(u\) is added to theforest. When only one cluster remains in the forest, the algorithmstops, and this cluster becomes the root.
A distance matrix is maintained at each iteration. The
d[i,j]
entry corresponds to the distance between cluster \(i\) and\(j\) in the original forest.At each iteration, the algorithm must update the distance matrixto reflect the distance of the newly formed cluster u with theremaining clusters in the forest.
Suppose there are \(|u|\) original observations\(u[0], \ldots, u[|u|-1]\) in cluster \(u\) and\(|v|\) original objects \(v[0], \ldots, v[|v|-1]\) incluster \(v\). Recall, \(s\) and \(t\) arecombined to form cluster \(u\). Let \(v\) be anyremaining cluster in the forest that is not \(u\).
The following are methods for calculating the distance between thenewly formed cluster \(u\) and each \(v\).
method=’single’ assigns
\[d(u,v) = \min(dist(u[i],v[j]))\]
for all points \(i\) in cluster \(u\) and\(j\) in cluster \(v\). This is also known as theNearest Point Algorithm.
method=’complete’ assigns
\[d(u, v) = \max(dist(u[i],v[j]))\]
See AlsoClustering Custom Data Using the K-Means Algorithm — PythonDefinitive Guide to K-Means Clustering with Scikit-LearnCustom Distance Function in K-Means Clustering with scikit-learn in Python 3 - DNMTechs - Sharing and Storing Technology Knowledgescipy.spatial.distance.cdist — SciPy v1.13.1 Manualfor all points \(i\) in cluster u and \(j\) incluster \(v\). This is also known by the Farthest PointAlgorithm or Voor Hees Algorithm.
method=’average’ assigns
\[d(u,v) = \sum_{ij} \frac{d(u[i], v[j])} {(|u|*|v|)}\]
for all points \(i\) and \(j\) where \(|u|\)and \(|v|\) are the cardinalities of clusters \(u\)and \(v\), respectively. This is also called the UPGMAalgorithm.
method=’weighted’ assigns
\[d(u,v) = (dist(s,v) + dist(t,v))/2\]
where cluster u was formed with cluster s and t and vis a remaining cluster in the forest (also called WPGMA).
method=’centroid’ assigns
\[dist(s,t) = ||c_s-c_t||_2\]
where \(c_s\) and \(c_t\) are the centroids ofclusters \(s\) and \(t\), respectively. When twoclusters \(s\) and \(t\) are combined into a newcluster \(u\), the new centroid is computed over all theoriginal objects in clusters \(s\) and \(t\). Thedistance then becomes the Euclidean distance between thecentroid of \(u\) and the centroid of a remaining cluster\(v\) in the forest. This is also known as the UPGMCalgorithm.
method=’median’ assigns \(d(s,t)\) like the
centroid
method. When two clusters \(s\) and \(t\) are combinedinto a new cluster \(u\), the average of centroids s and tgive the new centroid \(u\). This is also known as theWPGMC algorithm.method=’ward’ uses the Ward variance minimization algorithm.The new entry \(d(u,v)\) is computed as follows,
\[d(u,v) = \sqrt{\frac{|v|+|s|} {T}d(v,s)^2 + \frac{|v|+|t|} {T}d(v,t)^2 - \frac{|v|} {T}d(s,t)^2}\]
where \(u\) is the newly joined cluster consisting ofclusters \(s\) and \(t\), \(v\) is an unusedcluster in the forest, \(T=|v|+|s|+|t|\), and\(|*|\) is the cardinality of its argument. This is alsoknown as the incremental algorithm.
Warning: When the minimum distance pair in the forest is chosen, theremay be two or more pairs with the same minimum distance. Thisimplementation may choose a different minimum than the MATLABversion.
- Parameters:
- yndarray
A condensed distance matrix. A condensed distance matrixis a flat array containing the upper triangular of the distance matrix.This is the form that
pdist
returns. Alternatively, a collection of\(m\) observation vectors in \(n\) dimensions may be passed asan \(m\) by \(n\) array. All elements of the condensed distancematrix must be finite, i.e., no NaNs or infs.- methodstr, optional
The linkage algorithm to use. See the
Linkage Methods
section belowfor full descriptions.- metricstr or function, optional
The distance metric to use in the case that y is a collection ofobservation vectors; ignored otherwise. See the
pdist
function for a list of valid distance metrics. A custom distancefunction can also be used.- optimal_orderingbool, optional
If True, the linkage matrix will be reordered so that the distancebetween successive leaves is minimal. This results in a more intuitivetree structure when the data are visualized. defaults to False, becausethis algorithm can be slow, particularly on large datasets [2]. Seealso the optimal_leaf_ordering function.
Added in version 1.0.0.
- Returns:
- Zndarray
The hierarchical clustering encoded as a linkage matrix.
See also
- scipy.spatial.distance.pdist
pairwise distance metrics
Notes
For method ‘single’, an optimized algorithm based on minimum spanningtree is implemented. It has time complexity \(O(n^2)\).For methods ‘complete’, ‘average’, ‘weighted’ and ‘ward’, an algorithmcalled nearest-neighbors chain is implemented. It also has timecomplexity \(O(n^2)\).For other methods, a naive algorithm is implemented with \(O(n^3)\)time complexity.All algorithms use \(O(n^2)\) memory.Refer to [1] for details about the algorithms.
Methods ‘centroid’, ‘median’, and ‘ward’ are correctly defined only ifEuclidean pairwise metric is used. If y is passed as precomputedpairwise distances, then it is the user’s responsibility to assure thatthese distances are in fact Euclidean, otherwise the produced resultwill be incorrect.
References
[1]
Daniel Mullner, “Modern hierarchical, agglomerative clusteringalgorithms”, arXiv:1109.2378v1.
[2]
Ziv Bar-Joseph, David K. Gifford, Tommi S. Jaakkola, “Fast optimalleaf ordering for hierarchical clustering”, 2001. BioinformaticsDOI:10.1093/bioinformatics/17.suppl_1.S22
Examples
>>> from scipy.cluster.hierarchy import dendrogram, linkage>>> from matplotlib import pyplot as plt>>> X = [[i] for i in [2, 8, 0, 4, 1, 9, 9, 0]]
>>> Z = linkage(X, 'ward')>>> fig = plt.figure(figsize=(25, 10))>>> dn = dendrogram(Z)
>>> Z = linkage(X, 'single')>>> fig = plt.figure(figsize=(25, 10))>>> dn = dendrogram(Z)>>> plt.show()
scipy.cluster.hierarchy.linkage — SciPy v1.13.1 Manual (2024)
Top Articles
Medgen Patient Portal
Are you one of the rare introverted personality types?
417-990-0201
What Is And How To Fix Code SPN 4364 FMI 18 (FMI 1, FMI 31)
Download & Play Skip-Bo on PC & Mac (Emulator)
Get Skip-Bo Free: Sequencing Fun Card game - Microsoft Store en-NA
Liberal justices say Trump immunity decision 'will have disastrous consequences' for the U.S.
True cost of Labour’s net zero target revealed in audio leak
Follow Us on Instagram Templates/Captions/Signs: 10K Followers
The Evolution of the Instagram Logo and How It Came to Be [2022]
Woman Beheaded Gore
Top 10 Tamil Movie Download Websites 2021- MouthShut.com
Latest Posts
Article information
Author: Fr. Dewey Fisher
Last Updated:
Views: 5454
Rating: 4.1 / 5 (42 voted)
Reviews: 89% of readers found this page helpful
Author information
Name: Fr. Dewey Fisher
Birthday: 1993-03-26
Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827
Phone: +5938540192553
Job: Administration Developer
Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball
Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.