Multi - Class Document Classification : Effective and Systematized Method to Categorize Documents

Authors

  • Kaushika Pal  Assistant Professor, Sarvajanik College of Engineering and Technology, Surat, Gujarat, India
  • Dr. Biraj V. Patel  G. H. Patel, P.G. Department of Computer Science and Technology, Sardar Patel University, V.V. Nagar, Gujarat, India

DOI:

https://doi.org//10.32628/IJSRSET207117

Keywords:

Classification, NLP, Machine Learning, and Feature Set, Accuracy

Abstract

A large section of World Wide Web is full of Documents, content; Data, Big data, unformatted data, formatted data, unstructured and unorganized data and we need information infrastructure, which is useful and easily accessible as an when required. This research work is combining approach of Natural Language Processing and Machine Learning for content-based classification of documents. Natural Language Processing is used which will divide the problem of understanding entire document at once into smaller chucks and give us only with useful tokens responsible for Feature Extraction, which is machine learning technique to create Feature Set which helps to train classifier to predict label for new document and place it at appropriate location. Machine Learning subset of Artificial Intelligence is enriched with sophisticated algorithms like Support Vector Machine, K – Nearest Neighbor, Naïve Bayes, which works well with many Indian Languages and Foreign Language content’s for classification. This Model is successful in classifying documents with more than 70% of accuracy for major Indian Languages and more than 80% accuracy for English Language.

References

  1. Jayashri K., Mayura K. (2013) “Machine Learning Algorithms for Opinion Mining and Sentiment Classification” International Journal of Scientific and Research Publications, Volume 3, Issue 6. 724 - 729.
  2. Kaur, Jasleen and Jatinderkumar R. Saini.(2017) “Punjabi Poetry Classification: The Test of 10 Machine Learning Algorithms.” International Conference on Machine Learning and Computing (ICMLC 2017)-ACM
  3. Harikrishnna D M, K. Sreenivasa Rao. (2015) Children Story Classification based on Structure of the Story. IEEE International Conference on Advances in Computing, Communications and Informatics. 1485-1490
  4. K. Pal, B. V. Patel (2020). “Model for Classification of Poems in Hindi Language Based on Ras”, Smart Systems and IoT: Innovations in Computing, Smart Innovation, Systems and Technologies, Springer, 141. 655 – 662
  5. Shalini Puri, Satya Prakash Singh, (2018). “Hindi Text Document Classification System Using SVM and Fuzzy: A Survey”, International Journal of Rough Sets and Data Analysis.
  6. Shalini Puri, Satya Prakash Singh.(2019) An Efficient Hindi Text Classification Model Using SVM Computing and Network Sustainability Book.
  7. K Pal, B V Patel, (2017) “A Study of Current State of Work Done for Classification in Indian Languages”, International Journal of Scientific Research in Science and Technology. 3(7) 403 – 407
  8. K Pal, J Saini, (2014) “A study of current state of work and challenges in mining big data”, International Journal of Advanced Networking Applications. 73 – 76.
  9. Sang?Woon Kim1 and Joon?Min Gil (2019), “Research paper classification systems based on TF?IDF and LDA schemes”, Human- Centric Computing and Information Science, 1 – 21.
  10. Bipanjyot Kaur , Gourav Bathla (2018), “Document Classification using Various Classification Algorithms: A Survey”, International Journal on Future Revolution in Computer Science & Communication Engineering. 4(2), 150-155
  11. Upendra Singh, Saqib Hasan (2015), “Survey Paper on Document Classification and Classifiers”, International Journal of Computer Science Trends and Technology, 3(2) 83 – 87.

Downloads

Published

2020-02-15

Issue

Section

Research Articles

How to Cite

[1]
Kaushika Pal, Dr. Biraj V. Patel, " Multi - Class Document Classification : Effective and Systematized Method to Categorize Documents, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 7, Issue 1, pp.118-123, January-February-2020. Available at doi : https://doi.org/10.32628/IJSRSET207117