Survey on Resource Management Solutions to Speed up Processing Small Files in Hadoop Cluster
DOI:
https://doi.org/10.32628/IJSRSET3214668Keywords:
Internet of Things, Big data analytics, resource allocation and schedulingAbstract
High performance data analytics is a computing paradigm involving optimal placement of data, analytics and other computational resources such that superior performance is achieved with lesser resource consumption. Resource allocation and scheduling are the two major functionalities to be addressed in Hadoop clusters to satisfy the service level agreements of users for High performance data analytics applications. Though many solutions have been proposed for optimal resource allocation and scheduling, those schemes are designed for large Hadoop files. Recently with Internet of Things (IoT) convergence with big data, there is need to process large volumes of small files whose size is lower than block size of Hadoop. This creates huge storage overhead and exhausts Hadoop clusters computational resources. This survey analyzes the existing works on resource allocation and scheduling in Hadoop clusters and their suitability for small files. The aim is to identify the problems in existing resource allocation and scheduling approaches while handling small files. Based on the problems identified, prospective solution architecture is proposed.
References
- Small size problem in Hadoop: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
- Solving Small size problem in Hadoop https://pastiaro.wordpress.com/2013/06/05/solving-the-small-files-problem-in-apache-hadoop-appending-and-merging-in-hdfs/
- Bo Dong , Qinghua Zheng, Feng Tian , Kuo-Ming Chao , Rui Ma, Rachid Anane.(2012), An optimized approach for storing and accessing small files on cloud storage, Journal of Network and Computer Applications, 35 (2012) 1847-1862, Elsevier
- N. Lim, S. Majumdar and P. Ashwood-Smith, "MRCP-RM: A Technique for Resource Allocation and Scheduling of MapReduce Jobs with Deadlines," in IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1375-1389, 1 May 2017
- Y. Yao, H. Gao, J. Wang, B. Sheng and N. Mi, "New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters," in IEEE Transactions on Cloud Computing, vol. 9, no. 3, pp. 1158-1171, 1 July-Sept. 2021
- C. -T. Chen, L. -J. Hung, S. -Y. Hsieh, R. Buyya and A. Y. Zomaya, "Heterogeneous Job Allocation Scheduler for Hadoop MapReduce Using Dynamic Grouping Integrated Neighboring Search," in IEEE Transactions on Cloud Computing, vol. 8, no. 1, pp. 193-206, 1 Jan.-March 202
- Hammoud, M., and Sakr, M. F., “Locality-aware reduce task scheduling for MapReduce,” In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CLOUDCOM’11), pp. 570–576, 2011.
- Kousiouris, G., Cucinotta, T., and Varvarigou, T., “The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks,” Journal of Systems and Software, vol. 84, pp. 1270–1291, August 2011.
- Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares,A., and Qin., X., “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters,” In Proceedings of 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9, 2010
- Wang, W., Zhu, K., Ying, L., Tan, J., and Zhang, L., “MapTask scheduling in mapreduce with data locality: throughput and heavytraffic optimality,” IEEE/ACM Transactions on Networking, vol. 24, pp. 190–203, February 2016
- N. Lim, S. Majumdar and P. Ashwood-Smith, "Techniques for Handling Error in User-Estimated Execution Times During Resource Management on Systems Processing MapReduce Jobs," 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, Spain, 2017, pp. 788-793
- Yang, Allen & Wang, Jiayin & Mao, Ying & Yao, Yi & Mi, Ningfang & Sheng, Bo. (2021). Optimizing Internal Overlaps by Self-Adjusting Resource Allocation in Multi-Stage Computing Systems. IEEE Access. PP. 1-1. 10.1109/ACCESS.2021.3089907.
- Y. Liu, Y. Zeng and X. Piao, "High-Responsive Scheduling with MapReduce Performance Prediction on Hadoop YARN," 2016 IEEE 22nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Daegu, Korea (South), 2016, pp. 238-247
- J. Bader, L. Thamsen, S. Kulagina, J. Will, H. Meyerhenke and O. Kao, "Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters," 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp. 65-75
- Marquez, Jack, Oscar H. Mondragon, and Juan D. Gonzalez. 2021. "An Intelligent Approach to Resource Allocation on Heterogeneous Cloud Infrastructures" Applied Sciences 11, no. 21: 9940
- X. Zhang, Y. Feng, S. Feng, J. Fan and Z. Ming, "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments," 2011 International Conference on Cloud and Service Computing, Hong Kong, China, 2011, pp. 235-242
- Jeyaraj, R., Ananthanarayana, V.S. & Paul, A. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. J Ambient Intell Human Comput 11, 4261–4272 (2020).
- T. -Y. Chen, H. -W. Wei, M. -F. Wei, Y. -J. Chen, T. -s. Hsu and W. -K. Shih, "LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment," 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 2013, pp. 342-346
- Wei, H., Wu, T.Y., Lee, W., & Hsu, C. (2015). Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing. J. Inf. Hiding Multim. Signal Process., 6, 1215-1230.
- Muhamad Amin, Anang Hudaya & Ahmad, Nazrul & Kannan, Subarmaniam. (2016). Data location aware scheduling for virtual Hadoop cluster deployment on private cloud computing environment. 10.1109/APCC.2016.7581422.
- Tao D, Lin Z, Wang B. Load Feedback-Based Resource Scheduling and Dynamic Migration-Based Data Locality for Virtual Hadoop Clusters in OpenStack-Based Clouds. Tsinghua Science and Technology, 2017, 22(2): 149-159
- Li, Chunlin & Zhang, Jing & Tao, Ma & Tang, Hengliang & Lei, Zhang & Luo, Youlong. (2018). Data locality optimization based on data migration and hotspots prediction in geo-distributed cloud environment. Knowledge-Based Systems. 165. 10.1016/j.knosys.2018.12.002.
- Gandomi, A., Reshadi, M., Movaghar, A. et al. HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework. J Big Data 6, 106 (2019).
- D. Choi, M. Jeon, N. Kim and B. -D. Lee, "An Enhanced Data-Locality-Aware Task Scheduling Algorithm for Hadoop Applications," in IEEE Systems Journal, vol. 12, no. 4, pp. 3346-3357, Dec. 2018
- Convolbo, M.W., Chou, J., Hsu, CH. et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers. Computing 100, 21–46 (2018).
- Xie, Q., Pundir, M., Lu, Y., Abad, C. L., & Campbell, R. H. (2017). Pandas: Robust Locality-Aware Scheduling with Stochastic Delay Optimality. IEEE/ACM Transactions on Networking, 25(2), 662-675,2017
- Yiren Li, Tieke Li, Pei Shen, Liang Hao, Jin Yang, Zhengtong Zhang, Junhao Chen, Liang Bao, "PAS: Performance-Aware Job Scheduling for Big Data Processing Systems", Security and Communication Networks, vol. 2022, Article ID 8598305, 14 pages, 2022.
Downloads
Published
Issue
Section
License
Copyright (c) IJSRSET

This work is licensed under a Creative Commons Attribution 4.0 International License.