A Web Page has large amount of information including some additional contents like hyperlinks, header footer, navigational panel; advertisements which may cause the content extraction to be complicated. Page Segmentation is used to detect the noisy content block by detecting malicious URL from Web Pages. Main aim of this research is detecting malicious URL during content extraction by checking different patterns of URL. Performance is analysed based on precision, recall, execution time and noise detected using proposed algorithm.
Charmi Patel, Prof. Hiteishi Diwanji
Page segmentation, Malicious URL, URL patterns, Text density
- Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction” ,TSINGHUA SCIENCE AND TECHNOLOGY, ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
- A.F.R.Rahman, H.Alam and R.Hartono, “Content extraction from HTML documents”, International workshop on Web Document Analysis, pp. 7-10, 2001.
- Warid Petprasit and Saichon Jaiyen, “Web Content Extraction Based on Subject Detection and Node Density”, 978-1-4799- 6049-1/15/$31.00 ©2015 IEEE
- W3C Document Object Model (2009) Website. http://www.w3.org/DOM
- F. Sun, D. Song, and L. Liao, “DOM Based Content Extraction via Text Density,” Special Interest Group on Information Retrieval, ACM, 2011
- Dandan Song, Fei Sun, Lejian Liao, “A hybrid approach for content extraction with text density and visual importance of DOM nodes” , Springer-Verlag London 2013
- Nupur S. Gawale, Nitin N. Patil, “Implementation of A System To Detect Malicious URLs for Twitter Users” ,IEEE- ICPC 2015
- Aanshi Bhardwaj, Veenu Mangat, “A Novel Approach for Content Extraction from Web Pages”, 978-1-4799-2291-8/14/$31.00 ©2014 IEEE
- Yogesh W. Wanjari, Vivek D. Mohod, Dipali B. Gaikwad, Sachin N. Deshmukh, “Automatic News Extraction System for Indian Online News Papers”, 978-1-4799- 6896-1/14/$31.00 ©2014 IEEE
- Shuang Lin, Jie Chen, Zhendong Niu, “Combining a Segmentation-Like Approach and a Density-Based Approach in Content Extraction”, TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007- 0214ll05/18llpp256-264 Volume 17, Number 3, June 2012
- Mr. Satish J. Pusdekar and Prof. Shaikh. Phiroj,” Using Visual Clues Concept for Extracting Main Data from Deep Web Pages”,IEEE,2014
- Tiliang Zhang, Hua Zhang, Fei Gao, “A Malicious Advertising Detection Scheme Based on the Depth of URL Strategy”, IEEE,2013
|Published in :
||Volume 2 | Issue 3 | May-June - 2016
|Date of Publication
Cite This Article
Charmi Patel, Prof. Hiteishi Diwanji, "A Research on Web Content Extraction and Noise Reduction through Text Density Using Malicious URL Pattern Detection", International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 3, pp.128-132, May-June-2016.
URL : http://ijsrset.com/IJSRSET162361.php