Detection and Classification of Malicious Websites Using Natural Language Processing (NLP) and Machine Learning (ML) Techniques

Authors

  • Michael Doorumun Ishima Department of Computer Science, Informatics & Cybersecurity, Faculty of Science, Federal University Otuoke, PMB 126 Yenagoa, Nigeria Author
  • Samuel Apigi Ikirigo (Ph.D) Department of Computer Science, Informatics & Cybersecurity, Faculty of Science, Federal University Otuoke, PMB 126 Yenagoa, Nigeria Author

DOI:

https://doi.org/10.32628/IJSRSET2411449

Keywords:

Hashing Vectorizer, Machine Learning, Malicious websites, Natural Language processing (NLP)

Abstract

The internet, while offering extensive services and information, has also become a platform for malicious activities, particularly through harmful websites that threaten cybersecurity. Detecting and classifying these websites is crucial for protecting users from online threats. Traditional detection methods, primarily based on blacklists and signature-based techniques, struggle to match the pace with the dynamic evolving strategies of cybercriminals. Recent advancements in Machine Learning (ML) show promise, though they remain works in progress. This research addressed this challenge by exploring the usage of Natural Language Processing and Machine Learning techniques used to classify websites as benign or malicious. Unlike many existing studies that relied on URL features alone, this study incorporated a more comprehensive feature set, including URL, content, and additional web attributes, which enhanced classification accuracy. Using an imbalanced dataset skewed towards malicious sites, this study solved using SMOTE (Synthetic Minority Over-sampling Technique) the class imbalance problem, improving model performance. Utilized Hashing Vectorizer (HashingV) and TF-IDF (Term Frequency-Inverse Document Frequency), were adopted to transform textual features into their vector representations while PCA (Principal Component Analysis) and truncated Singular Value Decomposition (truncSVD), were then used to optimize feature representation across different dimensions. Five ML classifiers include Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), and Logistic Regression (LR) were tested for classification, and performance was evaluated using metrics such as precision and recall, accuracy, F1-Score. The results revealed that Random Forest classifier utilizing HashingV recorded the best results, with accuracies of 99.9563% using truncSVD and 99.9561% with PCA.

Downloads

Download data is not yet available.

References

Aljabri, M., Altamimi, H. S., Albelali, S. A., Al-Harbi, M., Alhuraib, H. T., Alotaibi, N. K., Alahmadi, A. A., Alhaidari, F., Mohammad, R. M. A., & Salah, K. (2022). Detecting malicious URLs using Machine Learning Techniques: Review and research directions. IEEE Access, 10, 121395–121417. https://doi.org/10.1109/access.2022.3222307 DOI: https://doi.org/10.1109/ACCESS.2022.3222307

Norton (2018). What are malicious websites? Retrieved from Norton online blog. https://ie.norton.com/blog/malware/what-are-malicious-websites

Choi H., Zhu B., & Lee, H (2011). Detecting malicious web links and identifying their attack types. 2nd USENIX Conference on Web Application Development (WebApps 11). https://www.usenix.org/legacy/events/webapps11/tech/final_files/Choi.pdf

Bazrafshan Z., Hashemi H., Fard S., Hamzeh A (2013). A survey on heuristic malware detection techniques. Proceedings of the The 5th Conference on Information and Knowledge Technology; Shiraz, Iran. Malware Detection Based on Signal Processing Techniques DOI: https://doi.org/10.1109/IKT.2013.6620049

Faizah M., Kom, M (2020). Natural Language Processing: Introduction, theory and application. Online Seminar, Universitas Gadjah Mada. https://sistemcerdas.mipa.ugm.ac.id/wp-content/uploads/sites/1297/2020/07/Materi-Faizah-M.Kom_.-Natural-Language-Processing.pdf

IBM (n.d). What is natural language processing? Online resource retrieved on 11th September 2023 from https://www.ibm.com/topics/natural-language-processing

Mittal, Apurv; Engels, Dr Daniel; Kommanapalli, Harsha; Sivaraman, Ravi; and Chowdhury, Taifur (2022) "Phishing Detection Using Natural Language Processing and Machine Learning," SMU Data Science Review: Vol. 6: No. 2, Article 14. Available at: https://scholar.smu.edu/datasciencereview/vol6/iss2/14

Sushma N. B. (2011). Detecting malicious Webpages using content based classification. A Thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science. University of California, San Diego

Al Tamimi, S. A. (2020), "Detecting Malicious Websites Using Machine Learning". Thesis. Rochester Institute of Technology. Accessed from https://scholarworks.rit.edu/theses

Oshingbesan, A., Okobi, C., Ekoh, C., & Munezero, A. (2021). Detection of malicious websites using machine learning techniques. ResearchGate. https://doi.org/10.13140/RG.2.2.30165.14565

Elsadig, M.; Ibrahim, A.O.; Basheer, S.; Alohali, M.A.; Alshunaifi, S.; Alqahtani, H.; Alharbi, N.; Nagmeldin, W. (2022). Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction. Electronics 2022, 11, 3647. https://doi.org/10.3390/ electronics11223647 DOI: https://doi.org/10.3390/electronics11223647

Saleem Raja A.S., Pradeepa G., Mahalakshmi S., Jayakumar M.S. Natural language based malicious domain detection using machine learning and deep learning. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 2, pp. 304–312. doi: 10.17586/2226-1494-2023-23-2-304-312 DOI: https://doi.org/10.17586/2226-1494-2023-23-2-304-312

Manjeri, A.S, K. R., A. M.N.V. and P. C. Nair (2019). "A Machine Learning Approach for Detecting Malicious Websites using URL Features," 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 2019, pp. 555-561, doi: 10.1109/ICECA.2019.8821879 DOI: https://doi.org/10.1109/ICECA.2019.8821879

Singh, A. (2020). Malicious and Benign Webpages Dataset. Data in brief. 32. 106304. 10.1016/j.dib.2020.106304. DOI: https://doi.org/10.1016/j.dib.2020.106304

Goyal, K. (2024). Data preprocessing in Machine Learning: 7 Easy steps to follow. upGrad Blog. https://www.upgrad.com/blog/data-preprocessing-in-machine-learning/

Jha, A. (2023). Vectorization Techniques in NLP [Guide]. neptune.ai. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide

Eskandar, S. (2023). Exploring feature extraction techniques for natural language processing. Medium. https://medium.com/@eskandar.sahel/exploring-feature-extraction-techniques-for-natural-language-processing-46052ee6514

Awan, A. A. (2022). A complete guide to data augmentation. https://www.datacamp.com/tutorial/complete-guide-data-augmentation

Analytics Vidhya. (2024). 10 Techniques to solve Imbalanced Classes in Machine Learning (Updated January 17, 2024). https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/

Baruah, I. D. (2023). Dimensionality Reduction Techniques — PCA, LCA and SVD. Medium. https://medium.com/nerd-for-tech/dimensionality-reduction-techniques-pca-lca-and-svd-f2a56b097f7c

Mateusz, P. (2024, January 14). Dimensionality reduction-popular techniques and how to use them. nexocode. https://nexocode.com/blog/posts/dimensionality-reduction-techniques-guide

Dutta, B. (2022). 6 Types of classifiers in Machine Learning. Analytics Steps Blog. https://www.analyticssteps.com/blogs/types-classifiers-machine-learning

Asiri, S. (2022). An Introduction to classification in Machine Learning. Built In. https://builtin.com/machine-learning/classification-machine-learning

Togor, T. E., Abah, J., & Kwaghtyo, D. K. (2023). Development of Thyroid Disease Prediction Model in Nigeria. 12(41), 33–47. DOI: https://doi.org/10.5120/ijais2023451950

Kanade, V. (2022). Everything you need to know about logistic regression. Spiceworks Inc. https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression

Wakefield K (2024). A guide to the types of machine learning algorithms. SAS UK. https://www.sas.com/en_gb/insights/articles/analytics/machine-learning-algorithms.html

Suryanarayana, S.V., Balaji, G. N., Rao, G.V (2018). Machine Learning Approaches for Credit Card Fraud Detection. International Journal of Engineering & Technology, 7 (2) (2018) 917-920. doi: 10.14419/ijet.v7i2.9356

Sarker, I. H. (2021). Machine learning: algorithms, Real-World applications and research directions. SN Computer Science/SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00592-x DOI: https://doi.org/10.1007/s42979-021-00592-x

Downloads

Published

30-11-2024

Issue

Section

Research Articles

How to Cite

[1]
Michael Doorumun Ishima and Samuel Apigi Ikirigo (Ph.D), “Detection and Classification of Malicious Websites Using Natural Language Processing (NLP) and Machine Learning (ML) Techniques”, Int J Sci Res Sci Eng Technol, vol. 11, no. 6, pp. 206–221, Nov. 2024, doi: 10.32628/IJSRSET2411449.

Similar Articles

1-10 of 94

You may also start an advanced similarity search for this article.