Detection and Classification of Malicious Websites Using Natural Language Processing (NLP) and Machine Learning (ML) Techniques
DOI:
https://doi.org/10.32628/IJSRSET2411449Keywords:
Hashing Vectorizer, Machine Learning, Malicious websites, Natural Language processing (NLP)Abstract
The internet, while offering extensive services and information, has also become a platform for malicious activities, particularly through harmful websites that threaten cybersecurity. Detecting and classifying these websites is crucial for protecting users from online threats. Traditional detection methods, primarily based on blacklists and signature-based techniques, struggle to match the pace with the dynamic evolving strategies of cybercriminals. Recent advancements in Machine Learning (ML) show promise, though they remain works in progress. This research addressed this challenge by exploring the usage of Natural Language Processing and Machine Learning techniques used to classify websites as benign or malicious. Unlike many existing studies that relied on URL features alone, this study incorporated a more comprehensive feature set, including URL, content, and additional web attributes, which enhanced classification accuracy. Using an imbalanced dataset skewed towards malicious sites, this study solved using SMOTE (Synthetic Minority Over-sampling Technique) the class imbalance problem, improving model performance. Utilized Hashing Vectorizer (HashingV) and TF-IDF (Term Frequency-Inverse Document Frequency), were adopted to transform textual features into their vector representations while PCA (Principal Component Analysis) and truncated Singular Value Decomposition (truncSVD), were then used to optimize feature representation across different dimensions. Five ML classifiers include Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), and Logistic Regression (LR) were tested for classification, and performance was evaluated using metrics such as precision and recall, accuracy, F1-Score. The results revealed that Random Forest classifier utilizing HashingV recorded the best results, with accuracies of 99.9563% using truncSVD and 99.9561% with PCA.
Downloads
References
Aljabri, M., Altamimi, H. S., Albelali, S. A., Al-Harbi, M., Alhuraib, H. T., Alotaibi, N. K., Alahmadi, A. A., Alhaidari, F., Mohammad, R. M. A., & Salah, K. (2022). Detecting malicious URLs using Machine Learning Techniques: Review and research directions. IEEE Access, 10, 121395–121417. https://doi.org/10.1109/access.2022.3222307 DOI: https://doi.org/10.1109/ACCESS.2022.3222307
Norton (2018). What are malicious websites? Retrieved from Norton online blog. https://ie.norton.com/blog/malware/what-are-malicious-websites
Choi H., Zhu B., & Lee, H (2011). Detecting malicious web links and identifying their attack types. 2nd USENIX Conference on Web Application Development (WebApps 11). https://www.usenix.org/legacy/events/webapps11/tech/final_files/Choi.pdf
Bazrafshan Z., Hashemi H., Fard S., Hamzeh A (2013). A survey on heuristic malware detection techniques. Proceedings of the The 5th Conference on Information and Knowledge Technology; Shiraz, Iran. Malware Detection Based on Signal Processing Techniques DOI: https://doi.org/10.1109/IKT.2013.6620049
Faizah M., Kom, M (2020). Natural Language Processing: Introduction, theory and application. Online Seminar, Universitas Gadjah Mada. https://sistemcerdas.mipa.ugm.ac.id/wp-content/uploads/sites/1297/2020/07/Materi-Faizah-M.Kom_.-Natural-Language-Processing.pdf
IBM (n.d). What is natural language processing? Online resource retrieved on 11th September 2023 from https://www.ibm.com/topics/natural-language-processing
Mittal, Apurv; Engels, Dr Daniel; Kommanapalli, Harsha; Sivaraman, Ravi; and Chowdhury, Taifur (2022) "Phishing Detection Using Natural Language Processing and Machine Learning," SMU Data Science Review: Vol. 6: No. 2, Article 14. Available at: https://scholar.smu.edu/datasciencereview/vol6/iss2/14
Sushma N. B. (2011). Detecting malicious Webpages using content based classification. A Thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science. University of California, San Diego
Al Tamimi, S. A. (2020), "Detecting Malicious Websites Using Machine Learning". Thesis. Rochester Institute of Technology. Accessed from https://scholarworks.rit.edu/theses
Oshingbesan, A., Okobi, C., Ekoh, C., & Munezero, A. (2021). Detection of malicious websites using machine learning techniques. ResearchGate. https://doi.org/10.13140/RG.2.2.30165.14565
Elsadig, M.; Ibrahim, A.O.; Basheer, S.; Alohali, M.A.; Alshunaifi, S.; Alqahtani, H.; Alharbi, N.; Nagmeldin, W. (2022). Intelligent Deep Machine Learning Cyber Phishing URL Detection Based on BERT Features Extraction. Electronics 2022, 11, 3647. https://doi.org/10.3390/ electronics11223647 DOI: https://doi.org/10.3390/electronics11223647
Saleem Raja A.S., Pradeepa G., Mahalakshmi S., Jayakumar M.S. Natural language based malicious domain detection using machine learning and deep learning. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2023, vol. 23, no. 2, pp. 304–312. doi: 10.17586/2226-1494-2023-23-2-304-312 DOI: https://doi.org/10.17586/2226-1494-2023-23-2-304-312
Manjeri, A.S, K. R., A. M.N.V. and P. C. Nair (2019). "A Machine Learning Approach for Detecting Malicious Websites using URL Features," 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 2019, pp. 555-561, doi: 10.1109/ICECA.2019.8821879 DOI: https://doi.org/10.1109/ICECA.2019.8821879
Singh, A. (2020). Malicious and Benign Webpages Dataset. Data in brief. 32. 106304. 10.1016/j.dib.2020.106304. DOI: https://doi.org/10.1016/j.dib.2020.106304
Goyal, K. (2024). Data preprocessing in Machine Learning: 7 Easy steps to follow. upGrad Blog. https://www.upgrad.com/blog/data-preprocessing-in-machine-learning/
Jha, A. (2023). Vectorization Techniques in NLP [Guide]. neptune.ai. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide
Eskandar, S. (2023). Exploring feature extraction techniques for natural language processing. Medium. https://medium.com/@eskandar.sahel/exploring-feature-extraction-techniques-for-natural-language-processing-46052ee6514
Awan, A. A. (2022). A complete guide to data augmentation. https://www.datacamp.com/tutorial/complete-guide-data-augmentation
Analytics Vidhya. (2024). 10 Techniques to solve Imbalanced Classes in Machine Learning (Updated January 17, 2024). https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
Baruah, I. D. (2023). Dimensionality Reduction Techniques — PCA, LCA and SVD. Medium. https://medium.com/nerd-for-tech/dimensionality-reduction-techniques-pca-lca-and-svd-f2a56b097f7c
Mateusz, P. (2024, January 14). Dimensionality reduction-popular techniques and how to use them. nexocode. https://nexocode.com/blog/posts/dimensionality-reduction-techniques-guide
Dutta, B. (2022). 6 Types of classifiers in Machine Learning. Analytics Steps Blog. https://www.analyticssteps.com/blogs/types-classifiers-machine-learning
Asiri, S. (2022). An Introduction to classification in Machine Learning. Built In. https://builtin.com/machine-learning/classification-machine-learning
Togor, T. E., Abah, J., & Kwaghtyo, D. K. (2023). Development of Thyroid Disease Prediction Model in Nigeria. 12(41), 33–47. DOI: https://doi.org/10.5120/ijais2023451950
Kanade, V. (2022). Everything you need to know about logistic regression. Spiceworks Inc. https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-regression
Wakefield K (2024). A guide to the types of machine learning algorithms. SAS UK. https://www.sas.com/en_gb/insights/articles/analytics/machine-learning-algorithms.html
Suryanarayana, S.V., Balaji, G. N., Rao, G.V (2018). Machine Learning Approaches for Credit Card Fraud Detection. International Journal of Engineering & Technology, 7 (2) (2018) 917-920. doi: 10.14419/ijet.v7i2.9356
Sarker, I. H. (2021). Machine learning: algorithms, Real-World applications and research directions. SN Computer Science/SN Computer Science, 2(3). https://doi.org/10.1007/s42979-021-00592-x DOI: https://doi.org/10.1007/s42979-021-00592-x
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research in Science, Engineering and Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.