Optimizing Map Reduce Scheduling Using Parallel Processing Model On Data Nodes With K-Mean Clustering Algorithm In Hadoop Environment

Prof. Abhishek Pandey; Ankita Malviya

doi:10.32628/IJSRSET184979

Authors

Prof. Abhishek Pandey Takshshila Institute of Engineering and Technology, Jabalpur, Madhya Pradesh, India
Ankita Malviya Takshshila Institute of Engineering and Technology, Jabalpur, Madhya Pradesh, India

Keywords:

K-Means Clustering, MapReduce, Hadoop, Data Mining, Distributed Computing.

Abstract

Cluster is a gathering of information individuals having comparable qualities. The procedure of setting up a connection or getting data from crude information by performing a few operations on the information set like grouping is known as information mining. Information gathered in reasonable situations is usually totally arbitrary and unstructured. Consequently, there is dependably a requirement for examination of unstructured information sets to determine important data. This is the place unsupervised calculations come into picture to prepare unstructured or even semi organized information sets by resultant. K-Means Clustering is one such method used to give a structure to unstructured information so that significant data can be separated. Discusses the implementation of the K-Means Clustering Algorithm over a distributed environment using Apache Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described and this based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations.

References

Hadoop Page on Mahout. http://mahout.apache.org/, 2017.
Hadoop Page on Disco. http://discoproject.org/, 2017.
Hadoop Page on Pig. http://pig.apache.org/, 2018.
Hadoop Page on Hive. http://hive.apache.org/, 2017.
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters.Communications of the ACM, 51(1):107–113, 2018.
J. Ekanayake, T. Gunarathne, G. Fox, A.S. Balkir, C. Poulain, N. Araujo, and R. Barga. Dryadlinq for scientific analyses. In 2009 Fifth IEEE International Conference on e- Science, pages 329–336. IEEE, 2017.
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, and G. Fox. Twister:A runtime for iterative mapreduce. In Proceedings of the 19th ACM International Sym- posium on High Performance Distributed Computing, pages 810–818. ACM, 2017.
D. Gillick, A. Faria, and J. DeNero. Mapreduce: Distributed computing for machine learning, 2016.
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In ACM SIGMOD Record, volume 27, pages 73–84. ACM, 2017.
S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi. Evaluating mapreduce on virtual machines: The hadoop case. Cloud Computing, pages 519–528, 2018.
W. Jiang, V.T. Ravi, and G. Agrawal. Comparing map-reduce and freeride for data- intensive applications. In Cluster Computing and Workshops, 2016. CLUSTER’16. IEEE International Conference on, pages 1–10. IEEE, 2016.
R. Jin, A. Goswami, and G. Agrawal. Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, 10(1):17–40, 2016.
W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on mapreduce. Cloud Computing, pages 674–679, 2017.

Optimizing Map Reduce Scheduling Using Parallel Processing Model On Data Nodes With K-Mean Clustering Algorithm In Hadoop Environment

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite