Web Page Categorization through Data Mining Classification Techniques on URL Information

Authors

  • R. GeethaRamani  Department of Information Science and Technology, College of Engineering, Anna University, Chennai, India
  • P. Revathy  Department of Computer Science and Engineering, Rajalakshmi Engineering College, Chennai, India

Keywords:

Web Page Categorization, Web Usage Mining, C4.5, NASA Dataset, URL Information

Abstract

The usage of web is increasing exponentially every day. The huge deluge of information owing to the web access can be mined to reveal interesting patterns. Web mining is gaining popularity in the recent times. In this paper web page categorization is attempted. Classification of web pages can provide useful information with regard to advertising and recommendations. Earlier techniques have utilised context, source and URL based information for classification of web pages. In this work, web page classification is performed through URL information. The proposed methodology involves data pre-processing, feature vector formulation, C4.5 classification, performance evaluation and prediction of class of web page. The experimentation has been carried out with NASA log dataset. Various classifiers have been utilised and C4.5 proves to yield the best possible results achieving an accuracy of 99.80% using 3 - fold cross validation. The results obtained justify the performance of the proposed methodology.

References

  1. J. Han and M. Kamber. 2011. Data Mining – Concepts and Techniques, Morgan Kauffmann Publishers, 3rd Edition.
  2. R. Geetharamani, P. Revathy and S.G. Jacob. 2015. Prediction of Users Webpage Access Behaviour Using Association Rule Mining. Sadhana, 40(8), pp.2353-2365.
  3. R. Geetha Ramani and B. Lakshmi. 2013. Multi-Class Classification for Prediction of Retinal Diseases (Retinopathy and Occlusion) from Fundus Images. In Proceedings of ICKM, 13, pp. 122-134.
  4. P. Nancy, R. G. Ramani and S.G. Jacob. 2011. Discovery of Gender Classification Rules for Social Network Data using Data Mining Algorithms. In Proceedings of the IEEE International Conference on Computational Intelligence and Computing Research (ICCIC’2011).
  5. C. Robert, B. Mobasher and J. Srivastava. 1997. Web Mining: Information and Pattern Discovery on the World Wide Web. in Proceedings of 9th International Conference on Tools with Artificial Intelligence.
  6. 6R. Kosala and H. Blockeel. 2000. Web Mining Research: A Survey.  ACM Sigkdd Explorations Newsletter, 2(1), pp. 1-15.
  7. J. Srivastava et al. 2000. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. ACM Sigkdd Explorations Newsletter, 1(2) , pp. 12-23.
  8. J. Faustina and Santosh Kumar Gupta. 2012. Web Content Mining Techniques: A Survey. International Journal of Computer Applications, 47(11) .
  9. D. Costa, M. Gomes and Z. Gong. 2005. Web Structure Mining: An Introduction. IEEE International Conference on Information Acquisition. 
  10. G. Kesavaraj and S. Sukumaran. 2013. A study on classification  Techniques in Data Mining. In Proceedings of 2013 4th IEEE International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1-7.
  11. P. Nancy and R.G. Ramani. 2012. Discovery of Patterns and Evaluation of Clustering Algorithms in Social Network Data (Face book 100 universities) through Data Mining Techniques and Methods. International Journal of Data Mining & Knowledge Management Process, 2(5), p.71.
  12. J. Hipp, U. Güntzer and G. Nakhaeizadeh. 2000. Algorithms for Association Rule Mining—A General Survey and Comparison. ACM sigkdd explorations newsletter, 2(1), pp. 58-64.
  13. J. He, A.H. Tan and C.L. Tan. 2000. Machine Learning Methods for Chinese Web Page Categorization. In Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics-Volume 12, pp. 93-100. Association for Computational Linguistics.
  14. S. Noh, H. Seo, J. Choi, K.  Choi and G. Jung. 2003. Classifying Web Pages using Adaptive Ontology. In IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 2144-2149.
  15. S.M. Kamruzzaman. 2006. Web Page Categorization using Artificial Neural Networks.  In Proceedings of the 4th international conference on Electrical Engineering and 2nd Annual Paper Meet,  pp. 96-99. arXiv preprint arXiv:1009.4991.
  16. J.A. Mangai and V.S Kumar. 2011. A Novel Approach for Web Page Classification using Optimum Features. IJCSNS, 11(5), p.252.
  17. S. Kavitha and M.S. Vijaya. 2013. Web Page Categorization using Multilayer Perceptron with Reduced Features. International Journal of Computer Applications, 65(1).
  18. M.F. Arlitt and C.L. Williamson. 1996. Web Server Workload Characterization: The Search for Invariants. ACM SIGMETRICS Performance Evaluation Review, 24(1), pp.126-137.
  19. NASA-HTTP. Available online at http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
  20. G. Poornalatha and P. S. Raghavendra. 2012. Web Page Prediction by Clustering and Integrated Distance Measure. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society.
  21. R.G. Ramani, B. Lakshmi and S.G. Jacob. 2012. Automatic Prediction of Diabetic Retinopathy and Glaucoma through Retinal Image Analysis and Data Mining Techniques. In Proceedings of 2012 IEEE International Conference on Machine Vision and Image Processing (MVIP), pp. 149-152.
  22. J.R. Quinlan. 2014. C4. 5: Programs for Machine Learning. Elsevier.
  23. R.G. Ramani, B. Lakshmi and S.G. Jacob. 2012. Data Mining Method of Evaluating Classifier Prediction accuracy in Retinal Data. In Proceedings of the 2012 IEEE International Conference on Computational Intelligence & Computing Research (ICCIC), pp. 1-4.
  24. R Rakotomalala. 2005. TANAGRA: A Free Software for Rresearch and Academic Purposes. in Proceedings of EGC'2005, RNTI-E-3, 2, pp.697-702. (in French)
  25. Tanagra. Available online at https://eric.univ-lyon2.fr/ricco/tanagra/en/tanagra.html
  26. L. Breiman. 2001. Random Forests. Machine Learning, 45(1), pp.5-32.
  27. W.W. Cohen. 1995. Fast Effective Rule Induction. In Proceedings of the twelfth international conference on machine learning, pp. 115-123.
  28. X. Wu et al. 2008. Top 10 Algorithms in Data Mining. Knowledge and information systems, 14(1), pp.1-37.
  29. K.P. Murphy. 2006. Naive Bayes Classifiers. University of British Columbia.

Downloads

Published

2017-12-31

Issue

Section

Research Articles

How to Cite

[1]
R. GeethaRamani, P. Revathy, " Web Page Categorization through Data Mining Classification Techniques on URL Information, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 8, pp.941-946, November-December-2017.