Forum Crawler

Authors

  • Rakesh S. Mane  D.Y.Patil COE, Pune, Maharastra, India
  • Gopal B. Bagga  D.Y.Patil COE, Pune, Maharastra, India
  • Devendra U. Bhute  D.Y.Patil COE, Pune, Maharastra, India
  • Abhijeet D. Nikam  D.Y.Patil COE, Pune, Maharastra, India
  • Prof. Sonali Gaikwad  D.Y.Patil COE, Pune, Maharastra, India

Keywords:

Forum crawling, page classification, page type, URL pattern learning, URL type, signatures

Abstract

A generic web crawler can be efficient in crawling the web but it is not efficient when crawling a forum. While crawling any forum the generic crawler will crawl all pages including unnecessary pages. Also generic crawlers can't maintain relation between posts of different pages. Existing forum crawlers are not easy to configure and requires too much user interaction. That's why a new type of crawler is needed for efficient forum crawling. This system aims to crawl only relevant pages from the forum with minimal overhead. For achieving that this system uses signature based approach for generating regular expressions of relevant pages URL's. Different forum softwares have different page layout but the navigation paths are mostly similar. By generating regex for relevant paths this system makes sure that it only crawls relevant pages. For generating regex first the forum software is identified using predefined signatures then predefined URL patterns are selected for identified forum software. This patterns are used to generate regex for Thread and Index pages URL's. Flipping URL's are identified by using predefined signatures. This approach allows accurate and faster crawling of forums with minimum configuration.

References

  1. Jingtian Jiang, Xinying Song, Nenghai Yu and Chin-Yew Lin, FoCUS: Learning to crawl web forums. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013.
  2. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, iRobot: An Intelligent Crawler for Web Forums, Proc. 17th Intl Conf. World Wide Web, pp. 447-456,April-2008
  3. Y. Guo, K. Li, K. Zhang, and G. Zhang, Board Forum Crawling: A Web Crawling Method for Web Forum, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence,pp. 475-478, 2006.
  4. Y.Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma,Exploring Traversal Strategy for Web Forum Crawling, Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.
  5. Mrcio L.A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joo M. B. Cavalcanti, GoGetIt!: A Tool for Generating Structure-Driven Web Crawlers, May-2006

Downloads

Published

2017-06-30

Issue

Section

Research Articles

How to Cite

[1]
Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad, " Forum Crawler, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 3, pp.392-395, May-June-2017.