Review-FoCUS: Learning to Crawl Web Forums

Authors

  • Rakesh S. Mane  D.Y.Patil COE, Pune, Maharastra, India
  • Gopal B. Bagga  D.Y.Patil COE, Pune, Maharastra, India
  • Devendra U. Bhute  D.Y.Patil COE, Pune, Maharastra, India
  • Abhijeet D. Nikam  D.Y.Patil COE, Pune, Maharastra, India
  • Prof. Sonali Gaikwad  D.Y.Patil COE, Pune, Maharastra, India

Keywords:

EIT path, forum crawling, ITF-regex, page classification, page type, URL pattern learning, URL type

Abstract

A generic web crawler can be efficient in crawling the websites but it is not efficient when crawling a web forum. While crawling any forum, the generic crawler will crawl all pages including unnecessary pages like user profile pages, advertisement pages or redirection pages which might result in duplication. That’s why a new type of crawler is required for efficient forum crawling. This system will crawl only relevant contents from the forum with minimal overhead and maximum efficiency. Although different kinds of forums have different page layouts, they always have similar indirect navigation paths connected by specific URL types to lead users from entry pages to thread pages. This property of forums is observed and forum crawling problem is reduced to URL-type recognition problem in order to follow only useful (Thread, Index and Page-Flipping pages) URLs and ignore unnecessary (User profile, External links) URLs. To recognize the URL types, the ITF-regex (that matches only Index, Thread and Page Flipping URLs) is learned by using the URL training sets. URL training sets just contains the detected URLs of thread, index and page flipping pages. To detect the kind of URL, differentiate and detect thread, index and page flipping URLs. The common characteristics of those pages are used to detect the page type.

References

  1. Jingtian Jiang, Xinying Song, Nenghai Yu and Chin-Yew Lin, FoCUS: Learning to crawl web forums. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013.
  2. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, iRobot: An Intelligent Crawler for Web Forums, Proc. 17th Intl Conf. World Wide Web, pp. 447-456,April-2008
  3. Y.Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma,Exploring Traversal Strategy for Web Forum Crawling, Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.
  4. Y. Guo, K. Li, K. Zhang, and G. Zhang, Board Forum Crawling: A Web Crawling Method for Web Forum, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence,pp. 475-478, 2006.
  5. Mrcio L.A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joo M. B. Cavalcanti, GoGetIt!: A Tool for Generating Structure-Driven Web Crawlers, May-2006

Downloads

Published

2017-04-30

Issue

Section

Research Articles

How to Cite

[1]
Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad, " Review-FoCUS: Learning to Crawl Web Forums , International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 2, pp.332-335, March-April-2017.