Clustering

Clustering

Clustering

Often we don't have labels attached to the data that tell us the class of the samples; we have to analyze the data in order to group them on the basis of a similarity criteria where groups (or clusters) are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which we can run as follows:

from sklearn.cluster import KMeans 
kmeans = KMeans(k=3, init='random') # initialization
kmeans.fit(data) # actual execution

The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now we can use the model to assign each sample to one of the clusters:

c = kmeans.predict(data)

And we can evaluate the results of clustering, comparing it with the labels that we already have using the completeness and the homogeneity score:

from sklearn.metrics import completeness_score, homogeneity_score
print completeness_score(t,c)

 

 
0.7649861514489815

 

 
print homogeneity_score(t,c)

 

 
0.7514854021988338

The completeness score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.

We can also visualize the result of the clustering and compare the assignments with the real labels visually:

figure()
subplot(211) # top figure with the real classes
plot(data[t==1,0],data[t==1,2],'bo')
plot(data[t==2,0],data[t==2,2],'ro')
plot(data[t==3,0],data[t==3,2],'go')
subplot(212) # bottom figure with classes assigned automatically
plot(data[c==1,0],data[tt==1,2],'bo',alpha=.7)
plot(data[c==2,0],data[tt==2,2],'go',alpha=.7)
plot(data[c==0,0],data[tt==0,2],'mo',alpha=.7)
show()

The following graph shows the result:

Graph 3

Observing the graph we see that the cluster in the bottom left corner has been completely indentified by k-means while the two clusters on the top have been identified with some errors.