Similarity Measures to identify Text Similarity

Authors

  • Manpreet Singh Lehal  Department of Computer Science and IT, Lyallpur Khalsa College, Jalandhar, Punjab, India

Keywords:

Parallel Data, EBMT, NGD

Abstract

Similarity and distance measures compute the similarity between words, sentences and documents into numeric value similarity scores and bring out the degree of parallelism or distance from one another. A number of similarity measures have been used by the researchers but their effectiveness differs from one language pair to another and also on the basis of quality of the corpus. Selection of right similarity measure is crucial to the performance of translation tasks and extraction of parallel data

References

  1. Achananuparp, P., Hu, X., & Shen, X. (2008). The evaluation of sentence similarity measures. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-540-85836-2_29
  2. Alexander Strehl, Joydeep Ghosh, R. M. (2000). Impact of Similarity Measures on Web-page Clustering. Workshop of Arti Ial Intelligene for Web Searh, July 2000 by AAAI, 58--64.
  3. Aliguliyev, R. M. (2009). A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2008.11.022
  4. Bandyopadhyay, S., & Mallick, K. (2013). A new path based hybrid measure for gene ontology similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 116–127.
  5. Bani-Ahmad, S., Cakmak, A., Ozsoyoglu, G., & Hamdani, A. A. (2005). Evaluating Publication Similarity Measures. IEEE Data Engineering Bulletin.
  6. Barrón-Cedeno, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism detection across distant language pairs. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 37–45.
  7. Chen, F., Farahat, A., & Brants, T. (2004). Multiple similarity measures and source-pair information in story link detection. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 313–320.
  8. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
  9. Gabrilovich, E., Markovitch, S., & others. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI, 7, 1606–1611.
  10. Gomaa, W. H., Fahmy, A. A., & others. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13–18.
  11. Hall, P. A. V, & Dowling, G. R. (1980). Approximate string matching. ACM Computing Surveys (CSUR), 12(4), 381–402.
  12. Heltshe, J. F. (1988). Jackknife Estimate of the Matching Coefficient of Similarity. Biometrics, 44(2), 447. https://doi.org/10.2307/2531858
  13. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2), 1–25.
  14. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.
  15. Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14(5–7), 491–498.
  16. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. ArXiv Preprint Cmp-Lg/9709008.
  17. Jiang, Y., Zhang, X., Tang, Y., & Nie, R. (2015). Feature-based approaches to semantic similarity assessment of concepts using Wikipedia. Information Processing & Management, 51(3), 215–234.
  18. Kolb, P. (2009). Experiments on the difference between semantic similarity and relatedness. Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), 81–88.
  19. Krause, E. F. (1986). Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation.
  20. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.
  21. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 49(2), 265–283.
  22. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Proceedings of the 5th Annual International Conference on Systems Documentation, 24–26.
  23. Lin, D. (1998). Extracting collocations from text corpora. First Workshop on Computational Terminology, 57–63.
  24. Liu, N., Zhang, B., Yan, J., Yang, Q., Yan, S., Chen, Z., Bai, F., & Ma, W. Y. (2004). Learning similarity measures in non-orthogonal space. International Conference on Information and Knowledge Management, Proceedings.
  25. Lund, K. (1995). Semantic and associative priming in high-dimensional semantic space. Proc. of the 17th Annual Conferences of the Cognitive Science Society, 1995.
  26. Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208.
  27. Mandreoli, F., Martoglia, R., & Tiberio, P. (2002). Searching Similar (Sub) Sentences for Example-Based Machine Translation. SEBD.
  28. Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Generalized latent semantic analysis for term representation. Proc. of RANLP.
  29. Mihalcea, R., Corley, C., Strapparava, C., Jiang, J. J., Conrath, D. W., Leacock, C., Chodorow, M., Wu, Z., Palmer, M., Hirst, G., St-Onge, D., others, Banerjee, S., Pedersen, T., Patwardhan, S., Hall, P. A. V, Dowling, G. R., Peterson, J. L., Jaro, M. A., … Inkpen, D. (2006). Approximate string matching. WordNet: An Electronic Lexical Database, 104(2), 1606–1611.
  30. Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (n.d.). WordNet: An online lexical database. 1990. Int. J. Lexicograph, 3(4).
  31. Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453.
  32. Peterson, J. L. (1980). Computer programs for detecting and correcting spelling errors. Communications of the ACM, 23(12), 676–687.
  33. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. ArXiv Preprint Cmp-Lg/9511007.
  34. Smith, T. F., Waterman, M. S., & others. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195–197.
  35. Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. European Conference on Machine Learning, 491–502.
  36. Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2), 19–28.
  37. Winkler, W. E. (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.
  38. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. ArXiv Preprint Cmp-Lg/9406033.
  39. Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52.
  40. Yao, A. D., Cheng, D. L., Pan, I., & Kitamura, F. (2020). Deep Learning in Neuroradiology: A Systematic Review of Current Algorithms and Approaches for the New Wave of Imaging Technology. Radiology: Artificial Intelligence, 2(2), e190026.

Downloads

Published

2017-08-30

Issue

Section

Research Articles

How to Cite

[1]
Manpreet Singh Lehal, " Similarity Measures to identify Text Similarity, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 5, pp.788-794, July-August-2017.