A Framework for Collaborative Document Classification with GA-SVM

Authors

  • S. Chakraverty  Netaji Subhas Institute of Technology Dwarka, New Delhi, Delhi, India
  • U. Pandey  IMS Engineering College, Ghaziabad, Ghaziabad, Uttar Pradesh, India
  • P. Dutt  Netaji Subhas Institute of Technology Dwarka, New Delhi, Delhi, India

Keywords:

Lexical semantics, WordNet ontology, Wikipedia categories, Genetic Algorithm, Multiclass SVM, Category-keyword Strength, Statistical and Contextual Document Classification

Abstract

Text Classification has been addressed by purely statistical approaches that utilize the frequency of occurrence of significant terms as well as by tapping a range of semantic features conveyed by the text. Both approaches have proved their strengths, yet each has its own limitations when applied to corpuses with different sizes and expressive styles. This raises two interesting problems- given a corpus, how to automate the process of (i) finding an optimum blend of statistical and contextual contributions for the most appropriate classification, and (ii) determining the relative importance of different kinds of contextual features that are employed? In this paper, we address these issues by developing a Collaborative Document Classification (CDC) system that adapts according to a given corpus, the weighted contributions of statistical features, an array of lexical-semantic features derived from the WordNet ontology and categorical-semantic features obtained  from  the hierarchical organization of Wikipedia category pages.

Given the complexity of this multivariate problem, it is judicious to seek approximate solutions using metaheuristics. We employ a GA that embeds a multi-class SVM classifier into its fitness function evaluator to cull out an optimal mix of statistical and semantic features as tailored to a given corpus. We experimented on small as well as large data sets derived from three sources: the 20 Newsgroup corpus, the Reuters 21578 corpus and a Creative corpus that we handcrafted by collecting news articles from the Times of India news portal. Results indicate that the DC system was able to balance between statistical and context approaches and also beefed up the contributions of the most relevant semantic features for each corpus to achieve a high classification accuracy ranging from 88% to 100% with an average of 95.55%.  The results highlight the significance of a collaborative DC approach that taps the power of ontological databases and can adapt to varying corpora seamlessly. The final population output by the GA contains a set of non-inferior solutions that give trade-off possibilities between recall and precision.

References

  1. Youn and D. McLeod, "A Comparative Study for Email Classification," in Advances and Innovations in System, Computing Science and Software Engineering, 2007, pp. 387-391.
  2. Rudy Prabowo and Mike Thelwall, "Sentiment Analysis: A Combined Approach," Intenational Journal of Informetrics, Vol. 3, Nr. 2 (2009) , p. 143-157., vol. 3, no. 2, pp. 143-157, 2009.
  3. Zengmin Geng, Jujian Zhang, Xuefei Li, Jianxia Du, and Zhengdong Liu, "Research on Web Document Summarization," in Internet Technology and Applications, 2010, pp. 1-4.
  4. Marco Ernandes, Giovanni Angelini, Marco Gori, Leonardo Rigutini, and Franco Scarselli, "An Adaptive Context-Based Algorithm for Term Weighting," in 20th international joint conference on Artificial Intelligence, 2007, pp. 2748-2753.
  5. Babu Rengarajan, K.G Srinivasagan Krishnalal G, "A New Text Mining Approach Based on HMM-SVM for Web News Classification," International Journal of Computer Applications, vol. 1, no. 19, pp. 98-104, 2010.
  6. Upasana Pandey and S. Chakraverty, "A Review of Text Classification Approaches for Email Management," International Journal of Engineering and Technology, vol. 3, pp. 137-144, April 2011.
  7. Yiming Yang and Jan O. Pederson, "A comparative study on Feature Selection in Text Classification," in 14th Internation Conference on Machine Learning , San Francisco, US, 1997, pp. 412-420.
  8. Barak, Ido Dagan, and Eyal Shnarch, "Text categorization from category name via lexical reference," in Procs of NAACL HLT, June, 2009, pp. 33-36.
  9. Stephen Bloehdorn and Andreas Hotho, "Boosting for text classification with semantic features," in MSW 2004 workshop at 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, AUG ,2004, pp. 70-87.
  10. William W. Cohen and Y. Singer, "Context Sensitive Learning Methods for Text Categorization," ACM Transaction on Information Systems, vol. Vol 17, no. 2, pp. 141-173, 1999.
  11. Dinakar Jayarajan, Dipti Deodhare, and B Ravindran, "Lexical Chains as Document Feature," in 3rd International Joint Conference on Natural Language Processing, Hyderabad, India, 2008.
  12. Chakraverty, Bhawna Juneja, Ashima Arora, Pratishtha Jain Upasana Pandey, "Semantic document classification using lexical chaining and fuzzy approach ," Soft Computing and Engineering, vol. 1, 2011.
  13. Chakraverty, Rahul Jain Upasana Pandey, "Context Driven Technique for Document Classification ," Network Security, vol. 2, no. 2, pp. 23-27, 2011.
  14. Giovanni Angelini , Marco Gori , Leonardo Rigutini , Franco Scarselli Marco Ernandes, "An Adaptive Context based algorithm for Term Weighting," in 20th International Joint Conference on Artifical intelligence, San Francisco, USA, 2007, pp. 2748-2753.
  15. Wen Zhang, Taketoshi Yoshida, and Xijin Tang, "TFIDF,LSI and Multi-word in Information Retrivel and Text Categorization," in IEEE International Conference on System, Man, Cybernetics (SMC 2008), 2008, pp. 108-113.
  16. Jin Li and Wei Yi Liu Kun Yue, "An adaptive Markov Model for Text Categorization," in 3rd International Conference on Intelligent syatem and Knoweledge Engineering, 2008, pp. 802-807.
  17. Silky Arora and Shampa Chakraverty, "A Parallel Approach to Context-based Term Weighting," in World Congress on Information and Communication Technologies, 2011, pp. 951-956.
  18. M. Khalessizadeh, R. Zaefarian, and S. H. Nasseri and E. Ardil, "Genetic Mining: Genetic Algorithm for topic based on concept distribution," in World Academy of Science, Engineering and Technology, 2006, pp. 144-147.
  19. Sung-Hawn Min and Ingoo Han Jumin Lee, "Hybrid Genetic Algorithm and Support Vector Machine for Backruptcy Prediction," Expert Systems with Applications, vol. 31, no. 3, pp. 652-660, 2006.
  20. Yafei Wang Wei Zhao and Dan Li, "A New Feature Selection Algorithm in Text Categorization," in International Symposium on Computer, Communication, Control and Automation, 2010, pp. 146-149.
  21. Hua Zhou and Li Zhang Xiangru Meng, "Application of Support Vector Machine and Genetic Algorithm to Network Intrusion Detection," in International Conference on Wireless Communications, Networking and Mobile Computing, 2007, pp. 2267-2269.
  22. Damian Eads et al. (2002) Genetic Algorithm and Support Vector Machins for Time Series Classification. Online]. http://users.soe.ucsc.edu/~eads/papers/eads2002.pdf
  23. Xiaoyong Liu and Hui Fu. (2012) A Hybrid Algorithm for Text Classification Problem. Online]. http://pe.ord.pl/articles/2012/1b/2.pdf
  24. Meijuan Gao and Shiru Zhou Jingwen Tian, "Research of Web Classification Mining Based on Classify Support Vector Machine," in International Colloquium on Computing, Communication, Control and Management, 2009, pp. 21-24.
  25. Fagbola Temitayo, Olbbiyisi Stephen, and Adigun Abimbola, "Hybrid GA-SVM for Efficient Features Selection in E-mail Classification," Computer Engineering and Intelligent Systems, vol. 3, no. 3, 2012.
  26. Cheng-Lung Huang and Chieh-Jen Wang, "A GA based Feature Selection and Parameters Optimization for Support Vector Machine," Expert System with Applications, vol. 31, pp. 231-240, 2006.
  27. Sheng Ding and Li Chen, "Intelligent Optimization Methods for High Dimensional Data Classification for Support Vector Machine," 2010.
  28. Online]. http://wordnet.princeton.edu
  29. Wikipedia Categorization. Online]. http://en.wikipedia.org/wiki/Help:Category
  30. 20 NewsGroup data collection. Online]. http://people.csail.mit.edu/jrennie/20Newsgroups
  31. Reuters 21578 data collection. Online]. http://www.daviddlewis.com/resources/testcollections/reuters21578
  32. Times of India. Online]. http://timesofindia.indiatimes.com/topic
  33. Stop Word List. Online]. http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
  34. Porter stemming. Online]. http://www.phpkode.com/scripts/item/porter-stemming-algorithm
  35. David E. Goldberg, Genetic Algorithm, 4th ed. Delhi, India: Pearson Education, 2001.
  36. Koby Crammer and Yoram Singer, "The Algorithmic Implementation of Multiclass Kernel-based Vector Machines," Journal of Machine Learning Research, vol. 2, pp. 265–292, 2001.
  37. Lazzaro, S. Ryckebusch, and M. A. Mahowald, "Winner-take-all networks of O(N) complexity," in Advances in neural information processing systems 1, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989.
  38. Sathiya Keerthi Kaibo Duan, Wei Chu, Shirish Krishnaj Shevade, and Aun Neow Poo, "Multi-Category Classi?cation by Soft-Max Combination of Binary Classi?ers," in 4th International Workshop on Multiple Classifier Systems, 2003.
  39. Noraini Mohd Razali and John Geraghty, "Genetic Algorithm Performance with Different Selection Strategies in Solving TSP," in World Congress on Engineering, vol. II, London, July 6-8, 2011.
  40. Dino Isa, Lam Hong Lee, and V. P. Kallimani and R. Raj Kumar, "Text Document Preprocessing with the Bayes Formula for Classification using support Vector Machine," IEEE Transactions on Knoweledge and Data Engineering, vol. 20, no. 9, September 2008.
  41. Tao Peng, Fengling He, and Wenli Zuo, "Text Classification from Positive and Unlabelled Documents based on GA," in VECPAR, Brazil, p. 2006.

Downloads

Published

2016-12-30

Issue

Section

Research Articles

How to Cite

[1]
S. Chakraverty, U. Pandey, P. Dutt, " A Framework for Collaborative Document Classification with GA-SVM, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 6, pp.104-114, November-December-2016.