Handling and processing of larger volume of data requires efficient data mining algorithms. k-means is a very popular clustering algorithm for data mining, but its performance suffers because of initial seeding problem. The computation time of k-means algorithm is directly proportional to the number of datapoints, number of dimensions, and number of iterations, therefore, it is very expensive to process large data-points sequentially. We proposed an efficient parallel framework which includes dimensionality-reduction as well as data-size reduction techniques to improve k-means processing time and initial seeding problem. Our proposed parallel framework leverages the multi-node and multi-core architectures of a typical commodity cluster. We have validated our proposed approaches with real and synthetic datasets in parallel environment setup. The experimental results clearly show the significant improvements in k-means performance.

Additional Metadata
Keywords Cluster, Coreset, Dimensionality-reduction, Initial seeding, K-means, Multicore, Multinode, Parallel computing
Persistent URL dx.doi.org/10.1145/2835043.2835060
Conference 8th ACM COMPUTE INDIA Conference, Compute 2015
Kumari, S. (Sonal), Maheshwari, A, Goyal, P. (Poonam), & Goyal, N. (Navneet). (2015). Parallel framework for efficient k-means clustering. Presented at the 8th ACM COMPUTE INDIA Conference, Compute 2015. doi:10.1145/2835043.2835060