Overcoming Distance Constraints in K-Medoids Clustering: A Case Study with MATLAB (2024)

Abstract: In this article, we discuss a common challenge encountered during the implementation of K-Medoids clustering in MATLAB: handling nodes that are assigned to distant clusters. We explore potential solutions and provide a practical example.

2024-07-13 by On Exception

Overcoming Distance Constraints in K-Medoids Clustering: A Case Study in MATLAB

Clustering is a fundamental task in data analysis and machine learning. K-medoids clustering is a popular method that partitions a dataset into k clusters by selecting k representative objects, called medoids. This article explores a case study of implementing K-medoids clustering in MATLAB and overcoming distance constraints that arise in real-world applications.

Background

The basic idea of K-medoids clustering is to minimize the sum of the distances between each data point and its corresponding medoid. The algorithm starts with an initial set of k medoids and iteratively updates them by selecting the data point that minimizes the objective function. The process continues until convergence or a maximum number of iterations is reached.

However, in many applications, the distance metric used in K-medoids clustering is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints. For example, in geospatial analysis, the great-circle distance may be used instead of the Euclidean distance to account for the curvature of the Earth's surface. In text analysis, the cosine distance may be used instead of the Euclidean distance to account for the high dimensionality and sparsity of the data.

Case Study: Overcoming Distance Constraints in K-Medoids Clustering

In this case study, we consider a dataset of customer transactions from an e-commerce website. The dataset contains information about the products purchased, the time of purchase, and the location of the customer. The goal is to cluster the customers into groups based on their purchasing behavior and geographical location.

To account for the distance constraint, we use the Haversine formula to calculate the great-circle distance between the customers' locations. The Haversine formula takes into account the curvature of the Earth's surface and is more accurate than the Euclidean distance in this context.

Implementation in MATLAB

We implement the K-medoids clustering algorithm in MATLAB using the kmedoids() function from the Statistics and Machine Learning Toolbox. We modify the distance metric parameter of the function to use the Haversine formula instead of the Euclidean distance.

% Define the Haversine formulahaversine = @(lat1, lon1, lat2, lon2) 2 * asin(sqrt(sin((lat2 - lat1)/2).^2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1)/2).^2));% Define the datasetX = [product_purchased, time_of_purchase, latitude, longitude];% Define the initial set of medoidsmedoids = X(randperm(size(X,1), k), :);% Define the distance metricdist_matrix = pdist2(X, X, haversine);% Run the K-medoids clustering algorithm[idx, C] = kmedoids(dist_matrix, medoids);

Results

After running the K-medoids clustering algorithm with the Haversine distance metric, we obtain k clusters of customers based on their purchasing behavior and geographical location. We can visualize the results using a scatter plot, where each point represents a customer and the color represents the cluster assignment.

% Plot the resultsscatter(latitude(idx), longitude(idx), size(X,1), C, 'filled');xlabel('Latitude');ylabel('Longitude');title('K-Medoids Clustering with Haversine Distance');

In this article, we have presented a case study of implementing K-medoids clustering in MATLAB and overcoming distance constraints that arise in real-world applications. By using the Haversine formula to calculate the great-circle distance between customer locations, we were able to cluster the customers into groups based on their purchasing behavior and geographical location. This approach can be applied to other domains where the distance metric is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints.

K-medoids clustering is a popular method for partitioning a dataset into k clusters.
In many applications, the distance metric used in K-medoids clustering is not Euclidean, but rather a more complex function that takes into account domain-specific knowledge or constraints.
In this case study, we implemented K-medoids clustering in MATLAB and modified the distance metric parameter to use the Haversine formula for calculating the great-circle distance between customer locations.
The results show that K-medoids clustering with the Haversine distance metric can be used to cluster customers into groups based on their purchasing behavior and geographical location.

References

Kaufman, L., & Rousseeuw, P. J. (2009). Findings and Trends in Cluster Analysis.
Sibson, R. (1971). SLINK: An Optimally Fast Algorithm for the Single-Linkage Cluster Analysis of Large Data Sets.
Park, H. S., & Jun, C. H. (2009). A fast algorithm for k-medoids clustering.
Bouts, J., Vandeginste, B., & Lewi, P. (2015). A comparison of distance measures for clustering high-dimensional data.

FAQs

What is Kmedoids clustering in Matlab? ›

k-medoids is a related algorithm that partitions data into k distinct clusters, by finding medoids that minimize the sum of dissimilarities between points in the data and their nearest medoid. The medoid of a set is a member of that set whose average dissimilarity with the other members of the set is the smallest.

What does k mean and k-medoids? ›

k-means and k-medoids clustering partitions data into k number of mutually exclusive clusters. These techniques assign each observation to a cluster by minimizing the distance from the data point to the mean or median location of its assigned cluster, respectively.

See Details ›

Is k-medoids a distance based algorithm? ›

The kmed vignette consists of four sequantial parts of distance-based (k-medoids) cluster analysis. The first part is defining the distance. It has numerical, binary, categorical, and mixed distances. The next part is applying a clustering algorithm in the pre-defined distance.

See Details ›

What is the difference between KNN and k-medoids? ›

K-means can only by used for numerical data. K-medoids can be used for both numerical and categorical data. K-means focuses on reducing the sum of squared distances, also known as the sum of squared error (SSE). K-medoids focuses on reducing the dissimilarities between clusters of data from the dataset.

Get More Info Here ›

How to implement K-means clustering in Matlab? ›

We select the data set features as Input Data, and the number of clusters to two. Click Run section, and Matlab displays the cluster data in two groups and the cluster means in the scatter plot. There are options to optimally select the number of clusters, as well as how the data is clustered.

Keep Reading ›

What are the advantages and disadvantages of K Medoids clustering? ›

Advantages: o Easy to understand and execute. o Quick and convergent in a predetermined number of steps. o Normally less delicate to outliers than k-means. o Allows using general dissimilarities of objects. Disadvantages: o Different initial sets of medoids can lead to different final clustering.

How is k-medoids better than K-means clustering? ›

Also K-Medoids is better in terms of execution time, non sensitive to outliers and reduces noise as compared to K-Means as it minimizes the sum of dissimilarities of data objects.

Discover More Details ›

What is k-median clustering? ›

In statistics, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median.

Explore More ›

What is K map clustering? ›

K-Means Clustering is a unsupervised machine learning algorithm which identifies k number of centers, and then assigns each and every data point to the nearest cluster, while keeping the clusters as small as possible.

View Details ›

What is K-modes clustering? ›

K-modes is a clustering algorithm used in data mining and machine learning to group categorical data into distinct clusters. Unlike K-means, which works with numerical data, K-modes focuses on finding clusters based on categorical attributes.

Read The Full Story ›

What is Kmean clustering algorithm? ›

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a pre-defined number of clusters. The goal is to group similar data points together and discover underlying patterns or structures within the data.

Know More ›