# Clustering

## Clustering

Often we don't have labels attached to the data that tell us the class of the samples; we have to analyze the data in order to group them on the basis of a similarity criteria where groups (or clusters) are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which we can run as follows:

from sklearn.cluster import KMeans

kmeans = KMeans(k=3, init='random') # initialization

kmeans.fit(data) # actual execution

The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now we can use the model to assign each sample to one of the clusters:

c = kmeans.predict(data)

And we can evaluate the results of clustering, comparing it with the labels that we already have using the completeness and the homogeneity score:

from sklearn.metrics import completeness_score, homogeneity_score

print completeness_score(t,c)

0.7649861514489815

print homogeneity_score(t,c)

0.7514854021988338

The completeness score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.

We can also visualize the result of the clustering and compare the assignments with the real labels visually:

figure()

subplot(211) # top figure with the real classes

plot(data[t==1,0],data[t==1,2],'bo')

plot(data[t==2,0],data[t==2,2],'ro')

plot(data[t==3,0],data[t==3,2],'go')

subplot(212) # bottom figure with classes assigned automatically

plot(data[c==1,0],data[tt==1,2],'bo',alpha=.7)

plot(data[c==2,0],data[tt==2,2],'go',alpha=.7)

plot(data[c==0,0],data[tt==0,2],'mo',alpha=.7)

show()

The following graph shows the result:

Observing the graph we see that the cluster in the bottom left corner has been completely indentified by k-means while the two clusters on the top have been identified with some errors.

## Comments ( 0 )