A Survey on Web Content Extraction and Noise Reduction from Webpage

Authors(2) :-Charmi Patel, Hiteishi Diwanji

A Web Page has large amount of information. Only some information in web pages is useful in real world applications. Web Page has some additional contents like hyperlinks, header footer, navigational panel; advertisements may cause the content extraction to be complicated. This irrelevant data is available with original content which is known as noisy data of website. This paper discusses various approaches for extracting informative content from web pages and removes noisy data.

Authors and Affiliations

Charmi Patel
Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India
Hiteishi Diwanji
Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India

Content Extraction, Text Density, Visual Importance, DOM Tree Generation, Noisy data

  1. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  2. A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
  3. Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799-6049-1/15/$31.00 ©2015 IEEE
  4. W3C Document Object Model (2009)  Website. http://www.w3.org/DOM 
  5. F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
  6. Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” ,  Springer-Verlag London 2013
  7. Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
  8. Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh,  “Automatic News Extraction System for Indian Online News Papers”,  978-1-4799-6896-1/14/$31.00 ©2014 IEEE
  9. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012

Publication Details

Published in : Volume 1 | Issue 6 | November-December 2015
Date of Publication : 2015-12-25
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 127-130
Manuscript Number : IJSRSET151619
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Charmi Patel, Hiteishi Diwanji, " A Survey on Web Content Extraction and Noise Reduction from Webpage , International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 1, Issue 6, pp.127-130, November-December.2015
URL : http://ijsrset.com/IJSRSET151619

Follow Us

Contact Us