IJSRSET calls volunteers interested to contribute towards the scientific development in the field of Science, Engineering and Technology

Home > IJSRSET162361                                                     


A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection

Authors(2):

Charmi Patel, Prof. Hiteishi Diwanji
  • Abstract
  • Authors
  • Keywords
  • References
  • Details
A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking different patterns of URL. Performance is analysed based on precision, recall, execution time and noise detected using proposed algorithm.

Charmi Patel, Prof. Hiteishi Diwanji

Page segmentation, Malicious URL, URL patterns, Text density

  1. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  2. A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
  3. Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799- 6049-1/15/$31.00 ©2015 IEEE
  4. W3C Document Object Model (2009) Website. http://www.w3.org/DOM
  5. F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
  6. Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” , Springer-Verlag London 2013
  7. Nupur S. Gawale, Nitin N. Patil, “Implementation of A System To Detect Malicious URLs for Twitter Users” ,IEEE- ICPC 2015
  8. Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
  9. Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh, “Automatic News Extraction System for Indian Online News Papers”, 978-1-4799- 6896-1/14/$31.00 ©2014 IEEE
  10. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  11. Mr. Satish J. Pusdekar and Prof. Shaikh. Phiroj,” Using Visual Clues Concept for Extracting Main Data from Deep Web Pages”,IEEE,2014
  12. Tiliang Zhang, Hua Zhang, Fei Gao, “A Malicious Advertising Detection Scheme Based on the Depth of URL Strategy”, IEEE,2013

Publication Details

Published in : Volume 2 | Issue 3 | May-June - 2016
Date of Publication Print ISSN Online ISSN
2016-06-30 2395-1990 2394-4099
Page(s) Manuscript Number   Publisher
128-132 IJSRSET162361   Technoscience Academy

Cite This Article

Charmi Patel, Prof. Hiteishi Diwanji, "A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection", International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 3, pp.128-132, May-June-2016.
URL : http://ijsrset.com/IJSRSET162361.php

IJSRSET Xplore

Subscribe

Conferences

National Conference on Advances in Mechanical Engineering 2017(NCAME 2017)

National Conference on Emerging Trends in Civil Engineering 2017( NCETCE 2017)