Comparison of Cosine, Euclidean Distance and Jaccard Distance

Authors

  • Manpreet Singh Lehal  Department of Computer Science and IT, Lyallpur Khalsa College, Jalandhar, Punjab, India

Keywords:

Euclidean Distance, Cosine Similarity, Jaccard Distance

Abstract

The task of measuring sentence similarity is defined as determining how similar the meaning of two sentences is. The higher the score, the more similar the meaning of the two sentences. The task of identifying similarity is not an easy one because of variability in natural language expressions. Hence the similarity metrics give varied results in many of the cases and choosing the right measure is crucial to the efficiency of the system. This paper compares and analyses three similarity measures: Euclidean Distance, Cosine Similarity and Jaccard Distance and points out the usage of each metric.

References

  1. Aliguliyev, R. M. (2009). A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2008.11.022
  2. Bandyopadhyay, S., & Mallick, K. (2013). A new path based hybrid measure for gene ontology similarity. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(1), 116–127.
  3. Bani-Ahmad, S., Cakmak, A., Ozsoyoglu, G., & Hamdani, A. A. (2005). Evaluating Publication Similarity Measures. IEEE Data Engineering Bulletin.
  4. Barrón-Cedeno, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism detection across distant language pairs. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 37–45.
  5. Chen, F., Farahat, A., & Brants, T. (2004). Multiple similarity measures and source-pair information in story link detection. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 313–320.
  6. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
  7. Gabrilovich, E., Markovitch, S., & others. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI, 7, 1606–1611.
  8. Gomaa, W. H., Fahmy, A. A., & others. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68(13), 13–18.
  9. Hall, P. A. V, & Dowling, G. R. (1980). Approximate string matching. ACM Computing Surveys (CSUR), 12(4), 381–402.
  10. Heltshe, J. F. (1988). Jackknife Estimate of the Matching Coefficient of Similarity. Biometrics, 44(2), 447. https://doi.org/10.2307/2531858
  11. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(2), 1–25.
  12. Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420.
  13. Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Statistics in Medicine, 14(5–7), 491–498.
  14. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. ArXiv Preprint Cmp-Lg/9709008.
  15. Jiang, Y., Zhang, X., Tang, Y., & Nie, R. (2015). Feature-based approaches to semantic similarity assessment of concepts using Wikipedia. Information Processing & Management, 51(3), 215–234.
  16. Kolb, P. (2009). Experiments on the difference between semantic similarity and relatedness. Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009), 81–88.
  17. Krause, E. F. (1986). Taxicab geometry: An adventure in non-Euclidean geometry. Courier Corporation.
  18. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211.
  19. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 49(2), 265–283.
  20. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. Proceedings of the 5th Annual International Conference on Systems Documentation, 24–26.
  21. Liu, N., Zhang, B., Yan, J., Yang, Q., Yan, S., Chen, Z., Bai, F., & Ma, W. Y. (2004). Learning similarity measures in non-orthogonal space. International Conference on Information and Knowledge Management, Proceedings.
  22. Mandreoli, F., Martoglia, R., & Tiberio, P. (2002). Searching Similar (Sub) Sentences for Example-Based Machine Translation. SEBD.
  23. Mihalcea, R., Corley, C., Strapparava, C., Jiang, J. J., Conrath, D. W., Leacock, C., Chodorow, M., Wu, Z., Palmer, M., Hirst, G., St-Onge, D., others, Banerjee, S., Pedersen, T., Patwardhan, S., Hall, P. A. V, Dowling, G. R., Peterson, J. L., Jaro, M. A., … Inkpen, D. (2006). Approximate string matching. WordNet: An Electronic Lexical Database, 104(2), 1606–1611.
  24. Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (n.d.). WordNet: An online lexical database. 1990. Int. J. Lexicograph, 3(4).
  25. Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453.
  26. Peterson, J. L. (1980). Computer programs for detecting and correcting spelling errors. Communications of the ACM, 23(12), 676–687.
  27. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. ArXiv Preprint Cmp-Lg/9511007.
  28. Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2), 19–28.
  29. Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52.

Downloads

Published

2017-12-30

Issue

Section

Research Articles

How to Cite

[1]
Manpreet Singh Lehal, " Comparison of Cosine, Euclidean Distance and Jaccard Distance, International Journal of Scientific Research in Science, Engineering and Technology(IJSRSET), Print ISSN : 2395-1990, Online ISSN : 2394-4099, Volume 3, Issue 8, pp.1376-1381, November-December-2017.