Hadoop based Information Extract from Text Document

Deepak Motwani; V. K. Chaubey; A. S. Saxena

doi:10.32628/IJSRSET162146

Authors

Deepak Motwani Department of Computer Science and Engineering,. Mewar University, Rajasthan, India
V. K. Chaubey Department of Computer Science and Engineering,. Mewar University, Rajasthan, India
A. S. Saxena Department of Computer Science and Engineering,. Mewar University, Rajasthan, India

Keywords:

Hadoop Big Data, Text Extraction, Keyword Based Extraction, Map Reduce

Abstract

Hadoop is one of the generally received bunch figuring structures for handling of the Big Data. Despite the fact that Hadoop seemingly has turned into the standard answer for overseeing Big Data, it is not free from constraints. In nowadays developing technology researchers, students prefer all documents in txt format and doc format. Most text files are available in pdf format as per demand. Even all research papers are available in pdf format only and extracting a text from pdf format is one of the most difficult jobs. So for text extraction from multiple pdf files we have to apply some algorithms so that text extraction process takes place in comfortable mode. Text extraction is the basic step which we bear to follow before making a motion for further processing. We begin with the concise discussion concerning to the keyword. Steps involved in text extraction from any txt file. In this paper, we use a keyword based extraction method for extracting the text from txt file and with the help of these keywords we can get all the detail on that part of the research paper or any pdf file. Here we are also using the multithreading approach. Our approach is able to extract text in very less time, so time complexity is very less. The aim of this paper is to extract the text on the basis of particular keyword which is useful for the new researcher.

References

M. Andrade and A. Valencia, Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families, Bioinformatics, Vol. 14(7), , pp. 600-607, 1998.
Y. Matsuo and M. Ishizuka, ”Keyword extraction from a single document using word co-occurrence
statistical information,” International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169,
2004.
L. Plas, V. Pallotta, M. Rajman, and H. Ghorbel. Automatic keyword extraction from spoken text. A comparison of two lexical resources: the EDR and WordNet. In Proceedings of the LREC,2004.
Y. HaCohen-Kerner, ”Automatic extraction of keywords from abstracts,” in Proc. 7th Int. Conf. Knowledge-Based Intell. Inf. Eng. Syst., vol. 2773, pp. 843-849., 2003
Christian Wartena, Rogier Brussee, and Wout Slakhorst. Keyword ex-traction using word co-occurrence. In Proceedings of the 2010 Workshops on Database and Expert Systems Applications, DEXA ’10, , Washington, DC, USA, 2010. IEEE Computer Society, pages 54–58,2010.
A. Hulth.. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of EMNLP, pp 216-223, 2003
Yukio Ohsawa, Nels E. Benson, and Masahiko Yachida. Keygraph: Au-tomatic indexing by co-occurrence graph based on building construc tion metaphor. In Proceedings of the Advances in Digital Libraries Conference, ADL ’98, pages 12–, Washington, DC, USA, 1998.
A Hulth, Combining machine learning and natural language processing for automatic keyword extraction. Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences (together with KTH). 2004,
H. Sayyadi, M. Hurst, and A. Maykov. Event detection and tracking in social streams. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), 2009.
H. P. Luhn, A statistical approach to mechanized encoding and searching of literary information, IBM Journal of Research and Development, Vol. 1(4), , Pp.309-317, 1957.
http://hadoop.apache.org/mapreduce
F. N. Afrati and J. D. Ullman. Optimizing Joins in a Map-Reduce Environment. In EDBT, pages 99–110, 2010.
Dr. Siddaraju, Sowmya C L, Rashmi K, Rahul M “Efficient Analysis of Big `Data Using Map Reduce Framework” International Journal of Recent Development in Engineering and Technology (ISSN 2347-6435(Online) Volume 2, Issue 6, June 2014

Hadoop based Information Extract from Text Document

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite