A Novel Framework for Tweet segmentation and its Application to Named Entity Recognition

Authors

  • Anuja A. Thete  Department of Computer Science and Engineering, Jagadambha College of Engineering and Tehnology,Yavatmal, Sant Gadge Baba Amravati University, Amravati, Maharashtra, India
  • J. S. Karnewar  Department of Computer Science and Engineering, Jagadambha College of Engineering and Tehnology,Yavatmal, Sant Gadge Baba Amravati University, Amravati, Maharashtra, India

Keywords:

Twitter Stream, Tweet Segmentation, Named Entity Recognition, Linguistic Processing

Abstract

Twitter has become one of the most important communication channels with its ability providing the most up-to-date and newsworthy information. Considering wide use of twitter as the source of information, reaching an interesting tweet for user among a bunch of tweets is challenging. A huge amount of tweets sent per day by hundred millions of users, information overload is inevitable. For extracting information in large volume of tweets, Named Entity Recognition (NER), methods on formal texts. However, many applications in Information Retrieval (IR) and Natural Language Processing (NLP) suffer severely from the noisy and short nature of tweets. In this paper, we propose a novel framework for tweet segmentation in a batch mode, called HybridSeg by splitting tweets into meaningful segments, the semantic or context information is well preserved and easily extracted by the downstream applications. HybridSeg finds the optimal segmentation of a tweet by maximizing the sum of the stickiness scores of its candidate segments. The stickiness score considers the probability of a segment being a phrase in English (i.e., global context) and the probability of a segment being a phrase within the batch of tweets (i.e., local context). For the latter, we propose and evaluate two models to derive local context by considering the linguistic features and term-dependency in a batch of tweets, respectively. HybridSeg is also designed to iteratively learn from confident segments as pseudo feedback. As an application, we show that high accuracy is achieved in named entity recognition by applying segment-based part-of-speech (POS) tagging.

References

  1. C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee, “Twiner: Named entity recognition in targeted twitter stream,” in SIGIR,2012, pp. 721–730.
  2.  C. Li, A. Sun, J. Weng, and Q. He, “Exploiting hybrid contexts for tweet segmentation,” in SIGIR, Volume No. 3 , 2013, pp. 523–532.
  3. A. Ritter, S. Clark, Mausam, and O. Etzioni, “Named entity recognition in tweets: An experimental study,” in EMNLP,2011, pp. 1524–1534.
  4. X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named entities in tweets,” in ACL, 2011, pp. 359–367.
  5.  X. Liu, X. Zhou, Z. Fu, F. Wei, and M. Zhou, “Exacting social events for tweets using a factor graph,” in AAAI, Volume No. 2 , 2012.
  6. A. Cui, M. Zhang, Y. Liu, S. Ma, and K. Zhang, “Discover breaking events with popular hashtags in twitter,” in CIKM, 2012, pp. 1794–1798.
  7.  A. Ritter, Mausam, O. Etzioni, and S. Clark, “Open domain event extraction from twitter,” in KDD, 2012, pp. 1104–1112.
  8. X. Meng, F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang, “Entitycentric topic-oriented opinion summarization in twitter,” in KDD, 2012, pp. 379–387.
  9. Z. Luo, M. Osborne, and T. Wang, “Opinion retrieval in twitter,” in ICWSM  , 2012, pp. 202- 215
  10.  X. Wang, F. Wei, X. Liu, M. Zhou, and M. Zhang, “Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach,” in CIKM, 2011, pp. 1031–1040.
  11.  K.-L. Liu, W.-J. Li, and M. Guo, “Emoticon smoothed language models for twitter sentiment analysis,” in AAAI, 2012.
  12.  J. Weng, C. Li, A. Sun, Q. He, “Tweet Segmentation and its Application to Named Entity Recognition,” in IEEE Transactions, 2015, pp. 1–15.
  13. C. Li, A. Sun, J. Weng, and Q. He, “Exploiting hybrid contexts for tweet segmentation,” in SIGIR, 2013, pp. 523–532.
  14. L. Ratinov and D. Roth, “Design challenges and misconceptions in named entity recognition,” in CoNLL,  2009, pp. 147– 155.
  15. K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith, “Part-of-speech tagging for twitter: annotation, features, and experiments,” in ACL-HLT , 2011, pp. 42–47.
  16. B. Han and T. Baldwin, “Lexical normalisation of short text messages: Makn sens a #twitter,” in ACL,  2011, pp. 368–378.
  17. F. C. T. Chua, W. W. Cohen, J. Betteridge, and E.-P. Lim, “Community-based classification of noun phrases in twitter,” in CIKM, 2012, pp. 1702–1706
  18. G. Zhou and J. Su, “Named entity recognition using an hmmbased chunk tagger,” in ACL, 2002, pp. 473–480.
  19.  S. Cucerzan, “Large-scale named entity disambiguation based on wikipedia data,” in EMNLP-CoNLL, 2007, pp. 708–716.
  20. Chenliang Li, Aixin Sun, Jianshu Weng, and Qi He,”Exploiting hybrid contexts for tweet segmentation” In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’13,pages 523–532, New York, NY, USA, 2013.

Downloads

Published

2017-12-31

Issue

Section

Research Articles

How to Cite

[1]
Anuja A. Thete, J. S. Karnewar, " A Novel Framework for Tweet segmentation and its Application to Named Entity Recognition, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 2, Issue 2, pp.397-402, March-April-2016.