Forum Crawler

Authors(5) :-Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad

A generic web crawler can be efficient in crawling the web but it is not efficient when crawling a forum. While crawling any forum the generic crawler will crawl all pages including unnecessary pages. Also generic crawlers can't maintain relation between posts of different pages. Existing forum crawlers are not easy to configure and requires too much user interaction. That's why a new type of crawler is needed for efficient forum crawling. This system aims to crawl only relevant pages from the forum with minimal overhead. For achieving that this system uses signature based approach for generating regular expressions of relevant pages URL's. Different forum softwares have different page layout but the navigation paths are mostly similar. By generating regex for relevant paths this system makes sure that it only crawls relevant pages. For generating regex first the forum software is identified using predefined signatures then predefined URL patterns are selected for identified forum software. This patterns are used to generate regex for Thread and Index pages URL's. Flipping URL's are identified by using predefined signatures. This approach allows accurate and faster crawling of forums with minimum configuration.

Authors and Affiliations

Rakesh S. Mane
D.Y.Patil COE, Pune, Maharastra, India
Gopal B. Bagga
D.Y.Patil COE, Pune, Maharastra, India
Devendra U. Bhute
D.Y.Patil COE, Pune, Maharastra, India
Abhijeet D. Nikam
D.Y.Patil COE, Pune, Maharastra, India
Prof. Sonali Gaikwad
D.Y.Patil COE, Pune, Maharastra, India

Forum crawling, page classification, page type, URL pattern learning, URL type, signatures

  1. Jingtian Jiang, Xinying Song, Nenghai Yu and Chin-Yew Lin, FoCUS: Learning to crawl web forums. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 6, JUNE 2013.
  2. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, iRobot: An Intelligent Crawler for Web Forums, Proc. 17th Intl Conf. World Wide Web, pp. 447-456,April-2008
  3. Y. Guo, K. Li, K. Zhang, and G. Zhang, Board Forum Crawling: A Web Crawling Method for Web Forum, Proc. IEEE/WIC/ACM Intl Conf. Web Intelligence,pp. 475-478, 2006.
  4. Y.Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, andW.-Y. Ma,Exploring Traversal Strategy for Web Forum Crawling, Proc. 31st Ann. Intl ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 459-466, 2008.
  5. Mrcio L.A. Vidal, Altigran S. da Silva, Edleno S. de Moura, Joo M. B. Cavalcanti, GoGetIt!: A Tool for Generating Structure-Driven Web Crawlers, May-2006

Publication Details

Published in : Volume 3 | Issue 3 | May-June 2017
Date of Publication : 2017-06-30
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 392-395
Manuscript Number : IJSRSET173392
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Rakesh S. Mane, Gopal B. Bagga, Devendra U. Bhute, Abhijeet D. Nikam, Prof. Sonali Gaikwad, " Forum Crawler, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 3, pp.392-395, May-June-2017.
Journal URL : http://ijsrset.com/IJSRSET173392

Follow Us

Contact Us