A Survey on Web Content Extraction and Noise Reduction from Webpage

Authors

  • Charmi Patel  Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India
  • Hiteishi Diwanji  Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India

Keywords:

Content Extraction, Text Density, Visual Importance, DOM Tree Generation, Noisy data

Abstract

A Web Page has large amount of information. Only some information in web pages is useful in real world applications. Web Page has some additional contents like hyperlinks, header footer, navigational panel; advertisements may cause the content extraction to be complicated. This irrelevant data is available with original content which is known as noisy data of website. This paper discusses various approaches for extracting informative content from web pages and removes noisy data.

References

  1. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  2. A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
  3. Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799-6049-1/15/$31.00 ©2015 IEEE
  4. W3C Document Object Model (2009)  Website. http://www.w3.org/DOM 
  5. F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
  6. Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” ,  Springer-Verlag London 2013
  7. Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
  8. Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh,  “Automatic News Extraction System for Indian Online News Papers”,  978-1-4799-6896-1/14/$31.00 ©2014 IEEE
  9. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012

Downloads

Published

2015-12-25

Issue

Section

Research Articles

How to Cite

[1]
Charmi Patel, Hiteishi Diwanji, " A Survey on Web Content Extraction and Noise Reduction from Webpage , International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 1, Issue 6, pp.127-130, November-December-2015.