Hadoop based Information Extract from Text Document

Authors(3) :-Deepak Motwani, V. K. Chaubey, A. S. Saxena

Hadoop is one of the generally received bunch figuring structures for handling of the Big Data. Despite the fact that Hadoop seemingly has turned into the standard answer for overseeing Big Data, it is not free from constraints. In nowadays developing technology researchers, students prefer all documents in txt format and doc format. Most text files are available in pdf format as per demand. Even all research papers are available in pdf format only and extracting a text from pdf format is one of the most difficult jobs. So for text extraction from multiple pdf files we have to apply some algorithms so that text extraction process takes place in comfortable mode. Text extraction is the basic step which we bear to follow before making a motion for further processing. We begin with the concise discussion concerning to the keyword. Steps involved in text extraction from any txt file. In this paper, we use a keyword based extraction method for extracting the text from txt file and with the help of these keywords we can get all the detail on that part of the research paper or any pdf file. Here we are also using the multithreading approach. Our approach is able to extract text in very less time, so time complexity is very less. The aim of this paper is to extract the text on the basis of particular keyword which is useful for the new researcher.

Authors and Affiliations

Deepak Motwani
Department of Computer Science and Engineering,. Mewar University, Rajasthan, India
V. K. Chaubey
Department of Computer Science and Engineering,. Mewar University, Rajasthan, India
A. S. Saxena
Department of Computer Science and Engineering,. Mewar University, Rajasthan, India

Hadoop Big Data, Text Extraction, Keyword Based Extraction, Map Reduce

  1. M. Andrade and A. Valencia, Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families, Bioinformatics, Vol. 14(7), , pp. 600-607, 1998.
  2. Y. Matsuo and M. Ishizuka, ”Keyword extraction from a single document using word co-occurrence
     statistical information,” International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169,
  3. L. Plas, V. Pallotta, M. Rajman, and H. Ghorbel. Automatic keyword extraction from spoken text. A comparison of two lexical resources: the EDR and WordNet. In Proceedings of the LREC,2004.
  4. Y. HaCohen-Kerner, ”Automatic extraction of keywords from abstracts,” in Proc. 7th Int. Conf. Knowledge-Based Intell. Inf. Eng. Syst., vol. 2773, pp. 843-849., 2003
  5. Christian Wartena, Rogier Brussee, and Wout Slakhorst. Keyword ex-traction using word co-occurrence. In Proceedings of the 2010 Workshops on Database and Expert Systems Applications, DEXA ’10, , Washington, DC, USA, 2010. IEEE Computer Society, pages 54–58,2010.
  6. A. Hulth.. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP, pp 216-223, 2003
  7. Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. Keygraph: Au-tomatic indexing by co-occurrence graph based on building construc tion metaphor. In Proceedings of the Advances in Digital Libraries Conference, ADL ’98, pages 12–, Washington, DC, USA, 1998.
  8. A Hulth, Combining machine learning and natural language processing for automatic keyword extraction. Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences (together with KTH). 2004,
  9. H. Sayyadi, M. Hurst, and A. Maykov. Event detection and tracking in social streams. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), 2009.
  10. H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM Journal of Research and Development, Vol. 1(4), , Pp.309-317, 1957.
  11. http://hadoop.apache.org/mapreduce
  12. F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, pages 99–110, 2010.
  13.  Dr. Siddaraju, Sowmya C L, Rashmi K, Rahul M “Efficient Analysis of Big `Data Using Map Reduce Framework” International Journal of Recent Development in Engineering and Technology  (ISSN 2347-6435(Online) Volume 2, Issue 6, June 2014

Publication Details

Published in : Volume 2 | Issue 1 | January-February 2016
Date of Publication : 2015-02-25
License:  This work is licensed under a Creative Commons Attribution 4.0 International License.
Page(s) : 156-160
Manuscript Number : IJSRSET162146
Publisher : Technoscience Academy

Print ISSN : 2395-1990, Online ISSN : 2394-4099

Cite This Article :

Deepak Motwani, V. K. Chaubey, A. S. Saxena, " Hadoop based Information Extract from Text Document , International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 1, pp.156-160, January-February-2016.
Journal URL : http://ijsrset.com/IJSRSET162146

Article Preview