Classifying Blocks of Page Layout of a Document

Authors

  • Prof. Prashant Gadakh  International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
  • Prof. Ramkrushna M  International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
  • Prof. Bailappa Bhovi  International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India
  • Prof. Malayaj Kumar  International Institute of Information Technology, Hinjewadi, Pune, Maharashtra, India

Keywords:

Classification Algorithm, Page-Blocks , Blackand, Blackpix

Abstract

In our work, we decided to do classification, for which we chose a dataset from University of California at Irvine machine learning repository. We took the dataset called page-blocks, which contains page layouts of a document created from segmentation process. After visualizing the data, we first ran Naive Bayes Classification algorithm to classify data. We noticed that the accuracy is not good. We then classified it again using Decision Tree algorithm. In this report, we discuss about the structure of the dataset, its visualization, classification algorithms and contrast their outputs. The report ends with a brief section about the future work that is possible on this dataset.

References

  1. B565 (Data Mining) class notes
  2. Bishop, Christopher (2007): Pattern Recognition and Machine Learning, Springer
  3. https://en.wikipedia.org/wiki/Document_classification
  4. Pang-Ning Tan, Michael Steinback, Vipin Kumar (2007): Introduction to Data Mining,
  5. Pearson
  6. http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machi
  7. ne-learning-dataset/
  8. R. Longadge, S. S. Dongre, and L. Malik, "Class Imbalance Problem in Data Mining:
  9. Review," International Journal 0/ Computer Science and Network (lJCSN), vol. 2, 2013.
  10. P. Foster, “Machine learning from imbalanced data sets 101.” Proceedings of the AAAI
  11. 2000 workshop on imbalanced data sets, 2000, pp. 1-3

Downloads

Published

2018-01-31

Issue

Section

Research Articles

How to Cite

[1]
Prof. Prashant Gadakh, Prof. Ramkrushna M, Prof. Bailappa Bhovi, Prof. Malayaj Kumar, " Classifying Blocks of Page Layout of a Document, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 4, Issue 1, pp.23-27, January-February-2018.