IJSRSET calls volunteers interested to contribute towards the scientific development in the field of Science, Engineering and Technology

Home > IJSRSET1732101                                                     

Review-FoCUS: Learning to Crawl Web Forums


Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad
  • Abstract
  • Authors
  • Keywords
  • References
  • Details
A generic web crawler can be efficient in crawling the websites but it is not efficient when crawling a web forum. While crawling any forum, the generic crawler will crawl all pages including unnecessary pages like user profile pages, advertisement pages or redirection pages which might result in duplication. That’s why a new type of crawler is required for efficient forum crawling. This system will crawl only relevant contents from the forum with minimal overhead and maximum efficiency. Although different kinds of forums have different page layouts, they always have similar indirect navigation paths connected by specific URL types to lead users from entry pages to thread pages. This property of forums is observed and forum crawling problem is reduced to URL-type recognition problem in order to follow only useful (Thread, Index and Page-Flipping pages) URLs and ignore unnecessary (User profile, External links) URLs. To recognize the URL types, the ITF-regex (that matches only Index, Thread and Page Flipping URLs) is learned by using the URL training sets. URL training sets just contains the detected URLs of thread, index and page flipping pages. To detect the kind of URL, differentiate and detect thread, index and page flipping URLs. The common characteristics of those pages are used to detect the page type.

Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad

EIT path, forum crawling, ITF-regex, page classification, page type, URL pattern learning, URL type

  1. Jingtian Jiang, Xinying Song, Nenghai Yu and Chin-Yew Lin, FoCUS: Learning to crawl web forums. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013.
  2. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, iRobot: An Intelligent Crawler for Web Forums, Proc. 17th Intl Conf. World Wide Web, pp. 447-456,April-2008
  3. Y.Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma,Exploring Traversal Strategy for Web Forum Crawling, Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.
  4. Y. Guo, K. Li, K. Zhang, and G. Zhang, Board Forum Crawling: A Web Crawling Method for Web Forum, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence,pp. 475-478, 2006.
  5. Mrcio L.A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joo M. B. Cavalcanti, GoGetIt!: A Tool for Generating Structure-Driven Web Crawlers, May-2006

Publication Details

Published in : Volume 3 | Issue 2 | March-April - 2017
Date of Publication Print ISSN Online ISSN
2017-04-30 2395-1990 2394-4099
Page(s) Manuscript Number   Publisher
332-335 IJSRSET1732101   Technoscience Academy

Cite This Article

Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad, "Review-FoCUS: Learning to Crawl Web Forums ", International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 2, pp.332-335, March-April-2017.
URL : http://ijsrset.com/IJSRSET1732101.php