A Survey on Web Content Extraction and Noise Reduction from Webpage

Charmi Patel; Hiteishi Diwanji

doi:10.32628/IJSRSET151619

Authors

Charmi Patel Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India
Hiteishi Diwanji Information Technology Department, L. D. College of Engineering, Ahmedabad, Gujarat, India

Keywords:

Content Extraction, Text Density, Visual Importance, DOM Tree Generation, Noisy data

Abstract

A Web Page has large amount of information. Only some information in web pages is useful in real world applications. Web Page has some additional contents like hyperlinks, header footer, navigational panel; advertisements may cause the content extraction to be complicated. This irrelevant data is available with original content which is known as noisy data of website. This paper discusses various approaches for extracting informative content from web pages and removes noisy data.

References

Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799-6049-1/15/$31.00 ©2015 IEEE
W3C Document Object Model (2009) Website. http://www.w3.org/DOM
F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” , Springer-Verlag London 2013
Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh, “Automatic News Extraction System for Indian Online News Papers”, 978-1-4799-6896-1/14/$31.00 ©2014 IEEE
Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012

A Survey on Web Content Extraction and Noise Reduction from Webpage

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite