Clustering: Grouping Data to Single Out Results
The terminology of “data mining” is sort of a red herring as the quintessential focus is to mine relevant patterns and information from that data, rather than mine the data itself. This process of data mining is carried out in 3 stages:
The above template is valid for most of the data mining techniques, and I hope to elucidate the clustering mining technique based on these existing premises.
In layman’s terms clustering is a technique involving, grouping chucks of data together based on their similarities. Clustering is not a single process, but a conglomerate of techniques which can be deployed based on our needs. This cluster analysis is not an automatic task but rather an iterative process, which depends on knowledge discovery for its completion. Some of the algorithms encompassed under clustering analysis are discussed below:
- Hierarchical: In this, the data-points are clustered based on their spatial proximity
- Centroid-based: Differentiating lines are drawn across the space vector model to group the existing data-set
- Distribution-based: This approach utilizes sampling random data-points and iteratively fitting them into proper clusters.
- Density-based: Based on the space-vector model, the data-points which are densely packed are grouped together and the sparsely populated ones are grouped separately.
Real life example: Consider, if we need to find out under which tax bracket all the employees of an organization fall under, then clustering can be used as an efficient and convenient technique for this task. For this:
- Initially the data-set must be selected carefully, for this task we must focus on the salaries.
- Then these salaries should be plotted on a vector space graph.
- Next these data-points are clustered based on the existing tax slabs using k-means algorithms and separated into Voronoi cells
Finally, looking at the model, we can ascertain the tax-bracket under which each employee falls under.
By: Yogesh Das