IJSRSET calls volunteers interested to contribute towards the scientific development in the field of Science, Engineering and Technology

Home > IJSRSET151619                                                     


A Survey on Web Content Extraction and Noise Reduction from Webpage

Authors(2):

Charmi Patel, Hiteishi Diwanji
  • Abstract
  • Authors
  • Keywords
  • References
  • Details
A Web Page has large amount of information. Only some information in web pages is useful in real world applications. Web Page has some additional contents like hyperlinks, header footer, navigational panel; advertisements may cause the content extraction to be complicated. This irrelevant data is available with original content which is known as noisy data of website. This paper discusses various approaches for extracting informative content from web pages and removes noisy data.

Charmi Patel, Hiteishi Diwanji

Content Extraction, Text Density, Visual Importance, DOM Tree Generation, Noisy data

  1. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
  2. A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
  3. Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799-6049-1/15/$31.00 ©2015 IEEE
  4. W3C Document Object Model (2009)  Website. http://www.w3.org/DOM 
  5. F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
  6. Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” ,  Springer-Verlag London 2013
  7. Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
  8. Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh,  “Automatic News Extraction System for Indian Online News Papers”,  978-1-4799-6896-1/14/$31.00 ©2014 IEEE
  9. Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll05/18llpp256-264 Volume 17, Number 3, June 2012

Publication Details

Published in : Volume 1 | Issue 6 | November-December - 2015
Date of Publication Print ISSN Online ISSN
2015-12-25 2395-1990 2394-4099
Page(s) Manuscript Number   Publisher
127-130 IJSRSET151619   Technoscience Academy

Cite This Article

Charmi Patel, Hiteishi Diwanji, "A Survey on Web Content Extraction and Noise Reduction from Webpage ", International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 1, Issue 6, pp.127-130, November-December-2015.
URL : http://ijsrset.com/IJSRSET151619.php