K-Means Algorithm
k-means is one of the most widely used Clustering algorithm existing right now, in the algorithm k is a variable which we need to provide and algorithm will divide a given n number of observations into k number of clusters, logic of assigning a cluster to a particular observation is based on the distance between the observation and the cluster.
which means if a particular observation is most near to a particular cluster that means the observation is belong to that cluster
Lets look at what you will get after performing k means algorithm
lets assume there is 10 observations that can be visualized in 2d space like below
after performing K-means algorithm, it will successfully identify the cluster centers like below, in this case
k is equal to 3, this is very obvious situation but algorithm can do the same in complected situations
Lets learn Euclidean distance
before you go further it is important to know about Euclidean distance, Euclidean distance is nothing but the distance between two given coordinates, this is how you calculated it in your school for given two coordinates that can be represented in 2d Space
but Euclidean distance is not only for 2d spcae it is defined for multidimensional space, which means you can use it to find the distance between two points that are in multidimensional space like below
Here are the steps of performing k means algorithm
- Input : Set of observations x1….x2 and a value for k
- Place initial cluster centers randomly
- Calculate Euclidean distance form each initial cluster center to observations, observations (vectors) that shows less distance to a particular initial cluster center is assigned to that cluster
- Since cluster centers are first randomly selected we need to find the new cluster centers, to do that we take the mean of the cluster points and assign that value as the new cluster center.
- After doing 4th you now need to perform 3 and 4 continuesly untill observations that are assigned to clusters are unchanged. at this point you find the clusters
Note : sometimes this 3-4 process goes infinitely, in such time we stop the algorithm in defined number of loops.