Effective Distribution of Large Scale Datasets Clustering Based on Map Reduce

Authors

  • T Vignesh Kumar  Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India
  • M Yuvaraj  Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India
  • S Anusha  Department of Computer Science and Engineering Dhanalakshmi College of Engineering, Chennai, Tamilnadu, India

Keywords:

Clustering, Datasets, Map Reduce, Big Data, ICT, MongoDB, NoSQL Database, ERP, LSBT, LHC

Abstract

Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and querying and information privacy. The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk .Data that are generated from variety of sources with massive volumes, high rates, and different data structure are collectively known as Big Data. MapReduce framework was built as a parallel distributed programming model to process such large-scale datasets effectively and efficiently. Big Data software analysis solutions were implemented on MapReduce framework, describing their datasets structures and how they were implemented with MongoDB as NoSQL Database. NoSQL encompasses a wide variety of different database technologies that were developed in response to the demands presented in building modern applications. MongoDB stores data using a flexible document data model. Documents contain one or more fields, including arrays, binary data and sub-documents.

Thus, the demand for building a service stack to distribute, manage, and process massive data sets has risen drastically. In this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a big chunk of data to a set of nodes with the objective of minimizing the maximum completion time. Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are dealing with data sets of petabyte scale in the cloud computing paradigm.

References

  1. R. E. Bryant, R. H. Katz, and E. D. Lazowska, “Big-data computing: Creating revolutionary break throughs in commerce, science, and society,” In Computing Research Initiatives for the 21st Century., 2008.
  2. A. Szalay and J. Gray, “2020 computing: Science in an exponential world,” Nature 440, 413-414, March, 2006.
  3. G. Brumfiel, “High-energy physics: Down the petabyte highway,” Nature 469, 282-283 January, 2011.
  4. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Proc. of Operating Systems Design and Implementation (OSDI), 2004.
  5. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur- rows, T. Chandra, A. Fikes, , and R. E. Gruber, “Bigtable: A distributed storage system for structured data,” Proc. of Operating Systems Design and Implementation (OSDI), 2006.
  6. W. D. Hillis and G. L. Steele, Jr., “Data parallel algorithms,” Commu- nications of the ACM, vol. 29, pp. 1170–1183, December 1986.
  7. U. Rencuzogullari and S. Dwarkadas, “Dynamic adaptation to available resources for parallel computing in an autonomous network of worksta- tions,” Proc. of ACM SIGPLAN PPoPP, 2001.
  8. M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica, “Man- aging data transfers in computer clusters with orchestra,” Proc. of ACM Special Interest Group on Data Communication (SIGCOMM), pp. 98–109, 2011.
  9. D. Nukarapu, B. Tang, L. Wang, and S. Lu, “Data replication in data intensive scientific applications with performance guarantee,” IEEE Transactions on Parallel and Distributed Systems, aug. 2011.
  10. C. Peng, M. Kim, Z. Zhang, and H. Lei, “Vdn: Virtual machine image distribution network for cloud data centers,” Proc. of IEEE International Conference on Computer Communications (INFOCOM), 2012.
  11. S. Khuller and Y.-A. Kim, “Broadcasting in heterogeneous networks,” Algorithmica, vol. 48, no. 1, Mar. 2007.
  12. J. Mundinger, R. Weber, and G. Weiss, “Optimal scheduling of peer-to- peer file dissemination,” Journal of Scheduling, vol. 11, no. 2, 2008.

Downloads

Published

2017-12-31

Issue

Section

Research Articles

How to Cite

[1]
T Vignesh Kumar, M Yuvaraj, S Anusha, " Effective Distribution of Large Scale Datasets Clustering Based on Map Reduce, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 2, pp.505-508, March-April-2016.