Survey on Resource Management Solutions to Speed up Processing Small Files in Hadoop Cluster

Prof. Shwetha K S; Dr. Chandramouli H

doi:10.32628/IJSRSET3214668

Authors

Prof. Shwetha K S Ph.D Research Scholar, Department of Computer Science and Engineering, East Point College of Engineering and Technology, Bengaluru, Karnataka, India
Dr. Chandramouli H Professor, Department of Computer Science and Engineering East Point College of Engineering and Technology, Bengaluru, Karnataka, India

DOI:

https://doi.org/10.32628/IJSRSET3214668

Keywords:

Internet of Things, Big data analytics, resource allocation and scheduling

Abstract

High performance data analytics is a computing paradigm involving optimal placement of data, analytics and other computational resources such that superior performance is achieved with lesser resource consumption. Resource allocation and scheduling are the two major functionalities to be addressed in Hadoop clusters to satisfy the service level agreements of users for High performance data analytics applications. Though many solutions have been proposed for optimal resource allocation and scheduling, those schemes are designed for large Hadoop files. Recently with Internet of Things (IoT) convergence with big data, there is need to process large volumes of small files whose size is lower than block size of Hadoop. This creates huge storage overhead and exhausts Hadoop clusters computational resources. This survey analyzes the existing works on resource allocation and scheduling in Hadoop clusters and their suitability for small files. The aim is to identify the problems in existing resource allocation and scheduling approaches while handling small files. Based on the problems identified, prospective solution architecture is proposed.

References

Small size problem in Hadoop: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
Solving Small size problem in Hadoop https://pastiaro.wordpress.com/2013/06/05/solving-the-small-files-problem-in-apache-hadoop-appending-and-merging-in-hdfs/
Bo Dong , Qinghua Zheng, Feng Tian , Kuo-Ming Chao , Rui Ma, Rachid Anane.(2012), An optimized approach for storing and accessing small files on cloud storage, Journal of Network and Computer Applications, 35 (2012) 1847-1862, Elsevier
N. Lim, S. Majumdar and P. Ashwood-Smith, "MRCP-RM: A Technique for Resource Allocation and Scheduling of MapReduce Jobs with Deadlines," in IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1375-1389, 1 May 2017
Y. Yao, H. Gao, J. Wang, B. Sheng and N. Mi, "New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters," in IEEE Transactions on Cloud Computing, vol. 9, no. 3, pp. 1158-1171, 1 July-Sept. 2021
C. -T. Chen, L. -J. Hung, S. -Y. Hsieh, R. Buyya and A. Y. Zomaya, "Heterogeneous Job Allocation Scheduler for Hadoop MapReduce Using Dynamic Grouping Integrated Neighboring Search," in IEEE Transactions on Cloud Computing, vol. 8, no. 1, pp. 193-206, 1 Jan.-March 202
Hammoud, M., and Sakr, M. F., “Locality-aware reduce task scheduling for MapReduce,” In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CLOUDCOM’11), pp. 570–576, 2011.
Kousiouris, G., Cucinotta, T., and Varvarigou, T., “The effects of scheduling, workload type and consolidation scenarios on virtual machine performance and their prediction through optimized artificial neural networks,” Journal of Systems and Software, vol. 84, pp. 1270–1291, August 2011.
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares,A., and Qin., X., “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters,” In Proceedings of 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9, 2010
Wang, W., Zhu, K., Ying, L., Tan, J., and Zhang, L., “MapTask scheduling in mapreduce with data locality: throughput and heavytraffic optimality,” IEEE/ACM Transactions on Networking, vol. 24, pp. 190–203, February 2016
N. Lim, S. Majumdar and P. Ashwood-Smith, "Techniques for Handling Error in User-Estimated Execution Times During Resource Management on Systems Processing MapReduce Jobs," 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Madrid, Spain, 2017, pp. 788-793
Yang, Allen & Wang, Jiayin & Mao, Ying & Yao, Yi & Mi, Ningfang & Sheng, Bo. (2021). Optimizing Internal Overlaps by Self-Adjusting Resource Allocation in Multi-Stage Computing Systems. IEEE Access. PP. 1-1. 10.1109/ACCESS.2021.3089907.
Y. Liu, Y. Zeng and X. Piao, "High-Responsive Scheduling with MapReduce Performance Prediction on Hadoop YARN," 2016 IEEE 22nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Daegu, Korea (South), 2016, pp. 238-247
J. Bader, L. Thamsen, S. Kulagina, J. Will, H. Meyerhenke and O. Kao, "Tarema: Adaptive Resource Allocation for Scalable Scientific Workflows in Heterogeneous Clusters," 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 2021, pp. 65-75
Marquez, Jack, Oscar H. Mondragon, and Juan D. Gonzalez. 2021. "An Intelligent Approach to Resource Allocation on Heterogeneous Cloud Infrastructures" Applied Sciences 11, no. 21: 9940
X. Zhang, Y. Feng, S. Feng, J. Fan and Z. Ming, "An effective data locality aware task scheduling method for MapReduce framework in heterogeneous environments," 2011 International Conference on Cloud and Service Computing, Hong Kong, China, 2011, pp. 235-242
Jeyaraj, R., Ananthanarayana, V.S. & Paul, A. Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment. J Ambient Intell Human Comput 11, 4261–4272 (2020).
T. -Y. Chen, H. -W. Wei, M. -F. Wei, Y. -J. Chen, T. -s. Hsu and W. -K. Shih, "LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment," 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 2013, pp. 342-346
Wei, H., Wu, T.Y., Lee, W., & Hsu, C. (2015). Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing. J. Inf. Hiding Multim. Signal Process., 6, 1215-1230.
Muhamad Amin, Anang Hudaya & Ahmad, Nazrul & Kannan, Subarmaniam. (2016). Data location aware scheduling for virtual Hadoop cluster deployment on private cloud computing environment. 10.1109/APCC.2016.7581422.
Tao D, Lin Z, Wang B. Load Feedback-Based Resource Scheduling and Dynamic Migration-Based Data Locality for Virtual Hadoop Clusters in OpenStack-Based Clouds. Tsinghua Science and Technology, 2017, 22(2): 149-159
Li, Chunlin & Zhang, Jing & Tao, Ma & Tang, Hengliang & Lei, Zhang & Luo, Youlong. (2018). Data locality optimization based on data migration and hotspots prediction in geo-distributed cloud environment. Knowledge-Based Systems. 165. 10.1016/j.knosys.2018.12.002.
Gandomi, A., Reshadi, M., Movaghar, A. et al. HybSMRP: a hybrid scheduling algorithm in Hadoop MapReduce framework. J Big Data 6, 106 (2019).
D. Choi, M. Jeon, N. Kim and B. -D. Lee, "An Enhanced Data-Locality-Aware Task Scheduling Algorithm for Hadoop Applications," in IEEE Systems Journal, vol. 12, no. 4, pp. 3346-3357, Dec. 2018
Convolbo, M.W., Chou, J., Hsu, CH. et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers. Computing 100, 21–46 (2018).
Xie, Q., Pundir, M., Lu, Y., Abad, C. L., & Campbell, R. H. (2017). Pandas: Robust Locality-Aware Scheduling with Stochastic Delay Optimality. IEEE/ACM Transactions on Networking, 25(2), 662-675,2017
Yiren Li, Tieke Li, Pei Shen, Liang Hao, Jin Yang, Zhengtong Zhang, Junhao Chen, Liang Bao, "PAS: Performance-Aware Job Scheduling for Big Data Processing Systems", Security and Communication Networks, vol. 2022, Article ID 8598305, 14 pages, 2022.

Survey on Resource Management Solutions to Speed up Processing Small Files in Hadoop Cluster

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite