ACM Home Page
Please provide us with feedback. Feedback
Semantic text similarity using corpus-based word similarity and string similarity
Full text PdfPdf (296 KB)
Source
ACM Transactions on Knowledge Discovery from Data (TKDD) archive
Volume 2 ,  Issue 2  (July 2008) table of contents
Article No. 10  
Year of Publication: 2008
ISSN:1556-4681
Authors
Aminul Islam  University of Ottawa, ON, Canada
Diana Inkpen  University of Ottawa, ON, Canada
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 81,   Downloads (12 Months): 761,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376815.1376819
What is a DOI?

ABSTRACT

We present a method for measuring the semantic similarity of texts using a corpus-based measure of semantic word similarity and a normalized and modified version of the Longest Common Subsequence (LCS) string matching algorithm. Existing methods for computing text similarity have focused mainly on either large documents or individual words. We focus on computing the similarity between two sentences or two short paragraphs. The proposed method can be exploited in a variety of applications involving textual knowledge representation and knowledge discovery. Evaluation results on two different data sets show that our method outperforms several competing methods.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
Burgess, C., Livesay, K., and Lund, K. 1998. Explorations in context space: Words, sentences, discourse. Disc. Proc. 25, 2--3, 211--257.
 
4
5
 
6
Corley, C. and Mihalcea, R. 2005. Measures of text semantic similarity. In Proceedings of the ACL workshop on Empirical Modeling of Semantic Equivalence (Ann Arbor, MI).
 
7
 
8
Erkan, G. and Radev, D. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Research 22, 457--479.
 
9
Foltz, P., Kintsch, W., and Landauer, T. 1998. The measurement of textual coherence with latent semantic analysis. Disc. Proc. 25, 2--3, 285--307.
 
10
Frawley, W. 1992. Linguistic Semantics. Lawrence Erlbaum Associates, Hillsdale, NJ.
 
11
Hatzivassiloglou, V., Klavans, J., and Eskin, E. 1999. Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 203--212.
 
12
Islam, A. and Inkpen, D. 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation. (Genoa, Italy). 1033--1038.
 
13
 
14
Jackendoff, R. 1983. Semantics and Cognition. MIT Press, Cambridge, MA.
 
15
Jarmasz, M. and Szpakowicz, S. 2003. Roget's thesaurus and semantic similarity. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. 212--219.
 
16
Jiang, J. and Conrath, D. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics.
 
17
Katarzyna, W.-W. and Szczepaniak, P. 2005. Classification of rss-formatted documents using full text similarity measures. In Proceedings of the 5th International Conference on Web Engineering, D. Lowe and M. Gaedke, Eds. LNCS 3579. Springer, 400--405.
 
18
 
19
Kondrak, G. 2005. N-gram similarity and distance. In Proceedings of the 12h International Conference on String Processing and Information Retrieval (Buenos Aires, Argentina). 115--126.
 
20
Landauer, T. and Dumais, S. 1997. A solution to platos problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psych. Rev. 104, 2, 211--240.
 
21
Landauer, T., Foltz, P., and Laham, D. 1998. Introduction to latent semantic analysis. Dis. Proc. 25, 2--3, 259--284.
 
22
Lapata, M. and Barzilay, R. 2005. Automatic evaluation of text coherence: Models and representations. In Proceedings of the 19th International Joint Conference on AI.
 
23
Leacock, C. and Chodorow, M. 1998. WordNet: An electronic lexical database. MIT Press, Chapter Combining local context andWordNet similarity for word sense identification, 265--283.
24
 
25
Li, Y., Bandar, Z., and Mclean, D. 2003. An approach for measuring semantic similarity using multiple information sources. IEEE Trans. Knowl. Data Eng. 15, 4, 871--882.
 
26
 
27
 
28
 
29
Liu, T. and Guo, J. 2005. Text similarity computing based on standard deviation. In Proceedings of the International Conference on Intelligent Computing, D.-S. Huang, X.-P. Zhang, and G.-B. Huang, Eds. Lecture Notes in Computer Science, vol. 3644. Springer-Verlag, New York, 456--464.
 
30
Liu, Y. and Zong, C. 2004. Example-based chinese-english mt. In Proceedings of the 2004 IEEE International Conference on Systems, Man, and Cybernetics. Vol. 1--7. IEEE Computer Society Press, Los Alamitos, CA, 6093--6096.
 
31
32
 
33
 
34
 
35
Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the American Association for Artificial Intelligence. (Boston, MA).
 
36
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. 1993. Introduction to wordnet: An on-line lexical database. Tech. Rep. 43, Cognitive Science Laboratory, Princeton University, Princeton, NJ.
 
37
Miller, G. A. and Charles, W. G. 1991. Contextual correlates of semantic similarity. Lang. and Cognitive Processes 6, 1, 1--28.
 
38
 
39
 
40
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on AI.
 
41
42
 
43
Salton, G. and Lesk, M. 1971. Computer Evaluation of Indexing and Text Processing. Prentice Hall, Inc. Englewood Cliffs, NJ.
 
44
 
45
 
46
Sinclair, J., Ed. 2001. Collins Cobuild English Dictionary for Advanced Learners, third ed. Harper Collins.
 
47
 
48
 
49
Wiemer-Hastings, P. 2000. Adding syntactic information to lsa. In Proceedings of the 22nd Annual Conference Cognitive Science Society. 989--993.
 
50

Collaborative Colleagues:
Aminul Islam: colleagues
Diana Inkpen: colleagues