A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection

Authors(2) :-Charmi Patel, Prof. Hiteishi Diwanji

A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking different patterns of URL. Performance is analysed based on precision, recall, execution time and noise detected using proposed algorithm.

Authors and Affiliations

Charmi Patel
Information Technology, L. D. Engineering College, Ahmedabad, Gujarat, India
Prof. Hiteishi Diwanji
Information Technology, L. D. Engineering College, Ahmedabad, Gujarat, India

Page segmentation, Malicious URL, URL patterns, Text density

  1. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  2. A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
  3. Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799- 6049-1/15/$31.00 ©2015 IEEE
  4. W3C Document Object Model (2009) Website. http://www.w3.org/DOM
  5. F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
  6. Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” , Springer-Verlag London 2013
  7. Nupur S. Gawale, Nitin N. Patil, “Implementation of A System To Detect Malicious URLs for Twitter Users” ,IEEE- ICPC 2015
  8. Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
  9. Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh, “Automatic News Extraction System for Indian Online News Papers”, 978-1-4799- 6896-1/14/$31.00 ©2014 IEEE
  10. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  11. Mr. Satish J. Pusdekar and Prof. Shaikh. Phiroj,” Using Visual Clues Concept for Extracting Main Data from Deep Web Pages”,IEEE,2014
  12. Tiliang Zhang, Hua Zhang, Fei Gao, “A Malicious Advertising Detection Scheme Based on the Depth of URL Strategy”, IEEE,2013

Publication Details

Published in : Volume 2 | Issue 3 | May-June 2016
Date of Publication : 2016-06-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 128-132
Manuscript Number : IJSRSET162361
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Charmi Patel, Prof. Hiteishi Diwanji, " A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 3, pp.128-132, May-June-2016. Citation Detection and Elimination     |     
Journal URL : https://ijsrset.com/IJSRSET162361

Article Preview