Custom Distance Function in K-Means Clustering with scikit-learn in Python 3 - DNMTechs - Sharing and Storing Technology Knowledge (2024)

K-means clustering is a popular unsupervised machine learning algorithm used for grouping similar data points into clusters. It is widely used in various fields such as image segmentation, customer segmentation, and anomaly detection. The algorithm works by iteratively assigning data points to clusters and updating the cluster centroids until convergence.

By default, the K-means algorithm uses the Euclidean distance metric to measure the similarity between data points and cluster centroids. However, in some cases, the Euclidean distance may not be the most appropriate metric for measuring similarity. For example, in text clustering, the Euclidean distance may not capture the semantic similarity between documents.

Fortunately, scikit-learn, a popular machine learning library in Python, allows us to define our own custom distance function for K-means clustering. This enables us to use domain-specific knowledge or alternative distance metrics to improve the clustering results.

Defining a Custom Distance Function

To define a custom distance function in scikit-learn, we need to create a Python function that takes two data points as input and returns the distance between them. The function should follow the signature:

Example: Custom Distance Function for Text Clustering

Let’s consider an example where we want to cluster a collection of text documents based on their semantic similarity. In this case, the Euclidean distance may not be suitable as it does not capture the semantic meaning of the documents.

We can define a custom distance function that uses a pre-trained word embedding model, such as Word2Vec or GloVe, to calculate the semantic similarity between two documents. Here’s an example using the Word2Vec model from the gensim library:

from gensim.models import Word2Vec# Load the pre-trained Word2Vec modelmodel = Word2Vec.load("word2vec_model.bin")def custom_distance(doc1, doc2): # Convert the documents to word vectors vec1 = [model[word] for word in doc1.split() if word in model] vec2 = [model[word] for word in doc2.split() if word in model] # Calculate the cosine similarity between the word vectors similarity = cosine_similarity(vec1, vec2) distance = 1 - similarity[0][0] return distance

In this example, we load a pre-trained Word2Vec model and define a custom distance function that converts the input documents into word vectors using the model. We then calculate the cosine similarity between the word vectors to measure the semantic similarity between the documents.

We can now use our custom distance function with the K-means algorithm to cluster the text documents:

from sklearn.cluster import KMeans# Create an instance of the KMeans class with our custom distance functionkmeans = KMeans(n_clusters=5, metric=custom_distance)# Fit the K-means model to the datakmeans.fit(data)

By using our custom distance function, we can potentially achieve better clustering results compared to using the default Euclidean distance.

In this article, we explored how to define a custom distance function for K-means clustering using scikit-learn in Python 3. We learned that by defining our own distance function, we can use domain-specific knowledge or alternative distance metrics to improve the clustering results. We also saw an example of using a custom distance function for text clustering based on semantic similarity. By leveraging pre-trained word embedding models, we can capture the semantic meaning of text documents and achieve more meaningful clusters.

Example 1: Custom Distance Function

In this example, we will demonstrate how to define a custom distance function for K-Means clustering in scikit-learn. Suppose we have a dataset of 2D points and we want to cluster them based on their Euclidean distance from a specific point.

from sklearn.cluster import KMeansimport numpy as np# Custom distance functiondef euclidean_distance(x, y): return np.sqrt(np.sum((x - y) ** 2))# Create a datasetX = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])# Define the custom distance function in KMeanskmeans = KMeans(n_clusters=2, init='k-means++', algorithm='auto', distance=euclidean_distance)# Fit the model to the datakmeans.fit(X)# Get the cluster labelslabels = kmeans.labels_# Print the cluster labelsprint(labels)

Example 2: Custom Distance Function with Additional Parameters

In this example, we will extend the previous example by adding additional parameters to the custom distance function. Suppose we want to weight the Euclidean distance by a factor of 2.

from sklearn.cluster import KMeansimport numpy as np# Custom distance function with additional parametersdef weighted_euclidean_distance(x, y, weight): return np.sqrt(np.sum((x - y) ** 2)) * weight# Create a datasetX = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])# Define the custom distance function with additional parameters in KMeanskmeans = KMeans(n_clusters=2, init='k-means++', algorithm='auto', distance=weighted_euclidean_distance, weight=2)# Fit the model to the datakmeans.fit(X)# Get the cluster labelslabels = kmeans.labels_# Print the cluster labelsprint(labels)

Reference Links:

1. scikit-learn documentation on KMeans clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

2. scikit-learn documentation on custom distance functions: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

Conclusion:

In this topic, we explored how to use a custom distance function in K-Means clustering with scikit-learn in Python. By defining a custom distance function, we can incorporate domain-specific knowledge or modify the distance metric to suit our specific needs. This flexibility allows us to perform more specialized clustering tasks and obtain better results. The examples provided demonstrate how to define and use a custom distance function in K-Means clustering, and the reference links provide further information on the topic. Overall, custom distance functions enhance the versatility and applicability of K-Means clustering in scikit-learn.