Often we don't have labels attached to the data that tell us the class of the samples; we have to analyze the data in order to group them on the basis of a similarity criteria where groups (or clusters) are sets of similar samples. This kind of analysis is called unsupervised data analysis. One of the most famous clustering tools is the k-means algorithm, which we can run as follows:
from sklearn.cluster import KMeans
kmeans = KMeans(k=3, init='random') # initialization
kmeans.fit(data) # actual execution
The snippet above runs the algorithm and groups the data in 3 clusters (as specified by the parameter k). Now we can use the model to assign each sample to one of the clusters:
c = kmeans.predict(data)
And we can evaluate the results of clustering, comparing it with the labels that we already have using the completeness and the homogeneity score:
from sklearn.metrics import completeness_score, homogeneity_score
The completeness score approaches 1 when most of the data points that are members of a given class are elements of the same cluster while the homogeneity score approaches 1 when all the clusters contain almost only data points that are member of a single class.
We can also visualize the result of the clustering and compare the assignments with the real labels visually:
subplot(211) # top figure with the real classes
subplot(212) # bottom figure with classes assigned automatically
The following graph shows the result:
Observing the graph we see that the cluster in the bottom left corner has been completely indentified by k-means while the two clusters on the top have been identified with some errors.