K-means Clustering Algorithm in Higher Dimensions

Terence Johnson; Jervin Zen Lobo

K-means Clustering Algorithm in Higher Dimensions

Date

2012-12-01

Authors

Terence Johnson

Jervin Zen Lobo

Publisher

Sinhgad Technical Review

Abstract

Clustering, using the classical K-Means method uses a square error which is usually the Euclidean Distance as its score function to measure the distance of each data point to the mean of the nearest cluster Since clustering hinges on the notion of distance, we need to carefully select the most appropriate or standard distance measure to perform clustering. It is observed that when Euclidean Distance is used as a score function between predictions and actual data to traina model, its utility becomes questionable when the model is used in a real domain specific environment where a host of other dimensions come into play. As the dimensions increase, the Euclidean norm is not the best 31 The purpose of a score function is to rank models as a function of how useful the models are to the data miner. because in higher dimensions the Euclidean distance from a point to its nearest and farthest neighbors tends to be very similar and hence nearly meaningless, To overcome this shortcoming of the Euclidean norm, we use, in this paper, the inner product space for measuring the similarity or dissimilarity between the points also called as 141 Unfortunately in practice it can be quite difficult to measure usefulness in terms of direct practical utility. In many cases one might use Euclidean Distance between predictions and actual data as a score function to train one's model. the Cosine distance measure.