K-means clustering
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult ; however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation–maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling. They both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the Gaussian mixture model allows clusters to have different shapes.
The unsupervised k-means algorithm has a loose relationship to the k-nearest neighbor classifier, a popular supervised machine learning technique for classification that is often confused with k-means due to the name. Applying the 1-nearest neighbor classifier to the cluster centers obtained by k-means classifies new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.
Description
Given a set of observations, where each observation is a -dimensional real vector, k-means clustering aims to partition the n observations into sets so as to minimize the within-cluster sum of squares . Formally, the objective is to find:where μi is the mean of points in, i.e.
is the size of, and is the usual L2 norm. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster:
The equivalence can be deduced from the identity. Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters. This deterministic relationship is also related to the law of total variance in probability theory.
History
The term "k-means" was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1956. The standard algorithm was first proposed by Stuart Lloyd of Bell Labs in 1957 as a technique for pulse-code modulation, although it was not published as a journal article until 1982. In 1965, Edward W. Forgy published essentially the same method, which is why it is sometimes referred to as the Lloyd–Forgy algorithm.Algorithms
Standard algorithm (naive ''k''-means)
The most common algorithm uses an iterative refinement technique. Due to its ubiquity, it is often called "the k-means algorithm"; it is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naïve k-means", because there exist much faster alternatives.Given an initial set of means , the algorithm proceeds by alternating between two steps:
- Assignment step: Assign each observation to the cluster with the nearest mean : that with the least squared Euclidean distance. where each is assigned to exactly one, even if it could be assigned to two or more of them.
- Update step: Recalculate means for observations assigned to each cluster. This is also called refitting.
The algorithm has converged when the assignments no longer change or equivalently, when the WCSS has become stable. The algorithm is not guaranteed to find the optimal cluster assignment.
The algorithm is often presented as assigning objects to the nearest cluster by distance. Using a different distance function other than Euclidean distance may prevent the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.
;Pseudocode
The below pseudocode outlines the implementation of the standard k-means clustering algorithm. Initialization of centroids, distance metric between points and centroids, and the calculation of new centroids are design choices and will vary with different implementations. In this example pseudocode,
distance returns the distance between the specified points.function kmeans is
// Initialize centroids
centroids ← list of k starting centroids
converged ← false
while converged false do
// Create empty clusters
clusters ← list of k empty lists
// Assign each point to the nearest centroid
for i ← 0 to length - 1 do
point ← points
closestIndex ← 0
minDistance ← distance
for j ← 1 to k - 1 do
d ← distance
if d < minDistance THEN
minDistance ← d
closestIndex ← j
clusters.append
// Recalculate centroids as the mean of each cluster
newCentroids ← empty list
for i ← 0 to k - 1 do
newCentroid ← calculateCentroid
newCentroids.append
// Check for convergence
if newCentroids centroids THEN
converged ← true
else
centroids ← newCentroids
return clusters
Initialization methods
Commonly used initialization methods are Forgy and Random Partition. The Forgy method randomly chooses k observations from the dataset and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points. The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. According to Hamerly et al., the Random Partition method is generally preferable for algorithms such as the k-harmonic means and fuzzy k-means. For expectation maximization and standard k-means algorithms, the Forgy method of initialization is preferable. A comprehensive study by Celebi et al., however, found that popular initialization methods such as Forgy, Random Partition, and Maximin often perform poorly, whereas Bradley and Fayyad's approach performs "consistently" in "the best group" and k-means++ performs "generally well".The algorithm does not guarantee convergence to the global optimum. The result may depend on the initial clusters. As the algorithm is usually fast, it is common to run it multiple times with different starting conditions. However, worst-case performance can be slow: in particular certain point sets, even in two dimensions, converge in exponential time, that is. These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.
The "assignment" step is referred to as the "expectation step", while the "update step" is a maximization step, making this algorithm a variant of the generalized expectation–maximization algorithm.
Complexity
Finding the optimal solution to the k-means clustering problem for observations in d dimensions is:- NP-hard in general Euclidean space even for two clusters,
- NP-hard for a general number of clusters k even in the plane,
- if k and d are fixed, the problem can be exactly solved in time, where n is the number of entities to be clustered.
The running time of Lloyd's algorithm is, where:
- n is the number of d-dimensional vectors
- k the number of clusters
- i the number of iterations needed until convergence.
- In the worst-case, Lloyd's algorithm needs iterations, so that the worst-case complexity of Lloyd's algorithm is superpolynomial.
- Lloyd's k-means algorithm has polynomial smoothed running time. It is shown that for arbitrary set of n points in, if each point is independently perturbed by a normal distribution with mean and variance, then the expected running time of -means algorithm is bounded by, which is a polynomial in,, and.
- Better bounds are proven for simple cases. For example, it is shown that the running time of k-means algorithm is bounded by for points in an integer lattice.