Often we don't have labels attached to the data that tell us the class of the samples; we have to analyze the data in order to group them on the basis of a similarity criteria where groups (or clusters) are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which we can run as follows:

from sklearn.cluster import KMeans 
kmeans = KMeans(k=3, init='random') # initialization # actual execution

The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now we can use the model to assign each sample to one of the clusters:

c = kmeans.predict(data)

And we can evaluate the results of clustering, comparing it with the labels that we already have using the completeness and the homogeneity score:

from sklearn.metrics import completeness_score, homogeneity_score
print completeness_score(t,c)




print homogeneity_score(t,c)



The completeness score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.

We can also visualize the result of the clustering and compare the assignments with the real labels visually:

subplot(211) # top figure with the real classes
subplot(212) # bottom figure with classes assigned automatically

The following graph shows the result:

Graph 3

Observing the graph we see that the cluster in the bottom left corner has been completely indentified by k-means while the two clusters on the top have been identified with some errors.