Hadoop is an open source implementation of Google’s MapReduce framework. MapReduce is the heart of the apache’s Hadoop. The file system which is used by the Hadoop for storing the files is known as Hadoop distributed file system (HDFS) which is an open source implementation of the google file system (GFS). Hadoop allows the parallel processing of the large data sets by splitting the larger data set into smaller partitions and each partition is fed to the separate task in the data node by the job tracker. The data node is the node where the data actually resides. The task tracker resides on the data node and it runs the tasks and also reports the status of the tasks to the job tracker. In a MapReduce, the slowest running task decides the job completion time. If the task is slower, it delays the progress of the entire job. This slowest running task is known as the straggler. There can be many reasons for the straggler to occur. One of the reasons is the data skew. This paper reviews the different types of the data skew, where in MapReduce data skew can occur and what is the measure taken to overcome these problems.
Reetesh Rai, Shravan Kumar
Mapreduce, HDFS, Straggler, Data Skew
- Joe B. Buck, Noah Watkins, Jeff Lefebvre, Leoni Ioannides, Carlos Matzah, Neola Polyposis, Scott Brandt, "SciHadoop: Array-based Query Processing in Hadoop", UC Santa Cruz, Dept. of Computer Science.
- Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian,James Majors, Adam Manzanares, and Xiao Qin, "Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters", Department of Computer Science and Software Engineering Auburn University, Auburn, AL 36849-5347.
- Benjamin Gufler, Nikolaus Augsten2, Angelika Reiser2 and Alfons Kempe "Handling data skew in MapReduce", Technische Universit at Munchen, Munchen, Germany 2Free University of Bozen-Bolzano, Bolzano, Italy
- YongChul Kwon1, Kai Ren2, Magdalena Balazinska1, and Bill Howe1, "Managing Skew in Hadoop", 1University of Washington, 2Carnegie Mellon University.
- YongChul Kwon, Magdalena Balazinska, Bill Howe, "A Study of Skew in MapReduce Applications", University of Washington, USA
- JinWoo Lee, SyKyoung Kim, "Study for Performance Improvement of Parallel Process According to Analysis of Hadoop", Computer Engineering Hanbat National University Daejeon, Korea
- Weijia Xu, Wei Luo, Nicholas Woodward, "Analysis and Optimization of Data Import with Hadoop", 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
- Da-Wei, Zhang, Fu-Quan, Sun,Xu Cheng and Chao Liu, "Research on Hadoop-based Enterprise File Cloud Storage System" Information Technology and Business Management Department Dalian Neusoft Institute of Information Dalian, China
- AiLing Duan, "Research and Application of Distributed Parallel Search Hadoop Algorithm", 2012 International Conference on Systems and Informatics (ICSAI 2012).
- AiLing Duan , HaiFang Si , 1.School of Information Science and Engineering, Henan University of Technology, Zhengzhou,450001, China, "Research and Practice of Distributed Parallel Search Algorithm on Hadoop_MapReduce" , 2012 International Conference on Control Engineering and Communication Technology
- Tom white, "Hadoop: the definitive guide.
- Qi Chen, Jinyu Yao, and Zhen Xiao, Senior Member, IEEE, "LIBRA: Lightweight DataSkew Mitigation in MapReduce" , IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS.
- YongChul Kwon, Magdalena Balazinska, Bill Howe, Jerome Rolia University of Washington, HP Labs, "SkewTune in Action: Mitigating Skew in MapReduce Applications".
|Published in :
||Volume 2 | Issue 3 | May-June - 2016
|Date of Publication
Cite This Article
Reetesh Rai, Shravan Kumar, "A Survey on Hadoop Storage Issues", International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 3, pp.499-505, May-June-2016.
URL : http://ijsrset.com/IJSRSET1623136.php