Classifying Blocks of Page Layout of a Document

Authors(4) :-Prof. Prashant Gadakh, Prof. Ramkrushna M, Prof. Bailappa Bhovi, Prof. Malayaj Kumar

In our work, we decided to do classification, for which we chose a dataset from University of California at Irvine machine learning repository. We took the dataset called page-blocks, which contains page layouts of a document created from segmentation process. After visualizing the data, we first ran Naive Bayes Classification algorithm to classify data. We noticed that the accuracy is not good. We then classified it again using Decision Tree algorithm. In this report, we discuss about the structure of the dataset, its visualization, classification algorithms and contrast their outputs. The report ends with a brief section about the future work that is possible on this dataset.

Authors and Affiliations

Prof. Prashant Gadakh
International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
Prof. Ramkrushna M
International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
Prof. Bailappa Bhovi
International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
Prof. Malayaj Kumar
International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India

Classification Algorithm, Page-Blocks , Blackand, Blackpix

  1. B565 (Data Mining) class notes
  2. Bishop, Christopher (2007): Pattern Recognition and Machine Learning, Springer
  3. https://en.wikipedia.org/wiki/Document_classification
  4. Pang-Ning Tan, Michael Steinback, Vipin Kumar (2007): Introduction to Data Mining,
  5. Pearson
  6. http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machi
  7. ne-learning-dataset/
  8. R. Longadge, S. S. Dongre, and L. Malik, "Class Imbalance Problem in Data Mining:
  9. Review," International Journal 0/ Computer Science and Network (lJCSN), vol. 2, 2013.
  10. P. Foster, “Machine learning from imbalanced data sets 101.” Proceedings of the AAAI
  11. 2000 workshop on imbalanced data sets, 2000, pp. 1-3

Publication Details

Published in : Volume 4 | Issue 1 | January-February 2018
Date of Publication : 2018-01-31
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 23-27
Manuscript Number : IJSRSET184124
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Prof. Prashant Gadakh, Prof. Ramkrushna M, Prof. Bailappa Bhovi, Prof. Malayaj Kumar, " Classifying Blocks of Page Layout of a Document, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 4, Issue 1, pp.23-27, January-February-2018.
Journal URL : http://ijsrset.com/IJSRSET184124

Article Preview

Follow Us

Contact Us