Review-FoCUS: Learning to Crawl Web Forums

Authors(5) :-Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad

A generic web crawler can be efficient in crawling the websites but it is not efficient when crawling a web forum. While crawling any forum, the generic crawler will crawl all pages including unnecessary pages like user profile pages, advertisement pages or redirection pages which might result in duplication. That’s why a new type of crawler is required for efficient forum crawling. This system will crawl only relevant contents from the forum with minimal overhead and maximum efficiency. Although different kinds of forums have different page layouts, they always have similar indirect navigation paths connected by specific URL types to lead users from entry pages to thread pages. This property of forums is observed and forum crawling problem is reduced to URL-type recognition problem in order to follow only useful (Thread, Index and Page-Flipping pages) URLs and ignore unnecessary (User profile, External links) URLs. To recognize the URL types, the ITF-regex (that matches only Index, Thread and Page Flipping URLs) is learned by using the URL training sets. URL training sets just contains the detected URLs of thread, index and page flipping pages. To detect the kind of URL, differentiate and detect thread, index and page flipping URLs. The common characteristics of those pages are used to detect the page type.

Authors and Affiliations

Rakesh S. Mane
D.Y.Patil COE, Pune, Maharastra, India
Gopal B. Bagga
D.Y.Patil COE, Pune, Maharastra, India
Devendra U. Bhute
D.Y.Patil COE, Pune, Maharastra, India
Abhijeet D. Nikam
D.Y.Patil COE, Pune, Maharastra, India
Prof. Sonali Gaikwad
D.Y.Patil COE, Pune, Maharastra, India

EIT path, forum crawling, ITF-regex, page classification, page type, URL pattern learning, URL type

  1. Jingtian Jiang, Xinying Song, Nenghai Yu and Chin-Yew Lin, FoCUS: Learning to crawl web forums. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013.
  2. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, iRobot: An Intelligent Crawler for Web Forums, Proc. 17th Intl Conf. World Wide Web, pp. 447-456,April-2008
  3. Y.Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma,Exploring Traversal Strategy for Web Forum Crawling, Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.
  4. Y. Guo, K. Li, K. Zhang, and G. Zhang, Board Forum Crawling: A Web Crawling Method for Web Forum, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence,pp. 475-478, 2006.
  5. Mrcio L.A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joo M. B. Cavalcanti, GoGetIt!: A Tool for Generating Structure-Driven Web Crawlers, May-2006

Publication Details

Published in : Volume 3 | Issue 2 | March-April 2017
Date of Publication : 2017-04-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 332-335
Manuscript Number : IJSRSET1732101
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad, " Review-FoCUS: Learning to Crawl Web Forums , International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 2, pp.332-335, March-April-2017.
Journal URL : http://ijsrset.com/IJSRSET1732101

Follow Us

Contact Us