Effective Distribution of Large Scale Datasets Clustering Based on Map Reduce

T Vignesh Kumar; M Yuvaraj; S Anusha

doi:10.32628/IJSRSET1622136

Authors

T Vignesh Kumar Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India
M Yuvaraj Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India
S Anusha Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India

Keywords:

Clustering, Datasets, Map Reduce, Big Data, ICT, MongoDB, NoSQL Database, ERP, LSBT, LHC

Abstract

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk .Data that are generated from variety of sources with massive volumes, high rates, and different data structure are collectively known as Big Data. MapReduce framework was built as a parallel distributed programming model to process such large-scale datasets effectively and efficiently. Big Data software analysis solutions were implemented on MapReduce framework, describing their datasets structures and how they were implemented with MongoDB as NoSQL Database. NoSQL encompasses a wide variety of different database technologies that were developed in response to the demands presented in building modern applications. MongoDB stores data using a flexible document data model. Documents contain one or more fields, including arrays, binary data and sub-documents.

Thus, the demand for building a service stack to distribute, manage, and process massive data sets has risen drastically. In this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a big chunk of data to a set of nodes with the objective of minimizing the maximum completion time. Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are dealing with data sets of petabyte scale in the cloud computing paradigm.

References

R. E. Bryant, R. H. Katz, and E. D. Lazowska, â€œBig-data computing: Creating revolutionary break throughs in commerce, science, and society,â€ In Computing Research Initiatives for the 21st Century., 2008.
A. Szalay and J. Gray, â€œ2020 computing: Science in an exponential world,â€ Nature 440, 413-414, March, 2006.
G. Brumfiel, â€œHigh-energy physics: Down the petabyte highway,â€ Nature 469, 282-283 January, 2011.
J. Dean and S. Ghemawat, â€œMapreduce: Simplified data processing on large clusters,â€ Proc. of Operating Systems Design and Implementation (OSDI), 2004.
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- rows, T. Chandra, A. Fikes, , and R. E. Gruber, â€œBigtable: A distributed storage system for structured data,â€ Proc. of Operating Systems Design and Implementation (OSDI), 2006.
W. D. Hillis and G. L. Steele, Jr., â€œData parallel algorithms,â€ Commu- nications of the ACM, vol. 29, pp. 1170â€“1183, December 1986.
U. Rencuzogullari and S. Dwarkadas, â€œDynamic adaptation to available resources for parallel computing in an autonomous network of worksta- tions,â€ Proc. of ACM SIGPLAN PPoPP, 2001.
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, â€œMan- aging data transfers in computer clusters with orchestra,â€ Proc. of ACM Special Interest Group on Data Communication (SIGCOMM), pp. 98â€“109, 2011.
D. Nukarapu, B. Tang, L. Wang, and S. Lu, â€œData replication in data intensive scientific applications with performance guarantee,â€ IEEE Transactions on Parallel and Distributed Systems, aug. 2011.
C. Peng, M. Kim, Z. Zhang, and H. Lei, â€œVdn: Virtual machine image distribution network for cloud data centers,â€ Proc. of IEEE International Conference on Computer Communications (INFOCOM), 2012.
S. Khuller and Y.-A. Kim, â€œBroadcasting in heterogeneous networks,â€ Algorithmica, vol. 48, no. 1, Mar. 2007.
J. Mundinger, R. Weber, and G. Weiss, â€œOptimal scheduling of peer-to- peer file dissemination,â€ Journal of Scheduling, vol. 11, no. 2, 2008.

Effective Distribution of Large Scale Datasets Clustering Based on Map Reduce

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite