ACM Home Page
Please provide us with feedback. Feedback
Exploiting hierarchical domain structure to compute similarity
Full text PdfPdf (286 KB)
Source ACM Transactions on Information Systems (TOIS) archive
Volume 21 ,  Issue 1  (January 2003) table of contents
Pages: 64 - 93  
Year of Publication: 2003
ISSN:1046-8188
Authors
Prasanna Ganesan  Stanford University, Stanford, CA
Hector Garcia-Molina  Stanford University, Stanford, CA
Jennifer Widom  Stanford University, Stanford, CA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 190,   Citation Count: 38
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/635484.635487
What is a DOI?

ABSTRACT

The notion of similarity between objects finds use in many contexts, for example, in search engines, collaborative filtering, and clustering. Objects being compared often are modeled as sets, with their similarity traditionally determined based on set intersection. Intersection-based measures do not accurately capture similarity in certain domains, such as when the data is sparse or when there are known relationships between items within sets. We propose new measures that exploit a hierarchical domain structure in order to produce more intuitive similarity scores. We extend our similarity measures to provide appropriate results in the presence of multisets (also handled unsatisfactorily by traditional measures), for example, to correctly compute the similarity between customers who buy several instances of the same product (say milk), or who buy several products in the same category (say dairy products). We also provide an experimental comparison of our measures against traditional similarity measures, and report on a user study that evaluated how well our measures match human intuition.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Breese, J. S., Heckerman, D., and Kadie, C. 1998. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence.
 
3
 
4
Das, G., Mannila, H., and Ronkainen, P. 1998. Similarity of attributes by external probes. In Proceedings of Knowledge Discovery and Data Mining (KDD), 23--29.
 
5
de Buenaga Rodríguez, M., Gómez-Hidalgo, J. M., and Díaz-Agudo, B. 1997. Using WordNet to complement training information in text categorization. In Proceedings of the Second International Conference on Recent Advances in Natural Language Processing.
 
6
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic indexing. J. Amer. Soc. Inf. Sci. 41, 6, 391--407.
 
7
Feldman, R. and Dagan, I. 1995. Knowledge discovery in textual databases. In Proceedings of KDD-95.
 
8
Ganesan, P., Garcia-Molina, H., and Widom, J. 2002. Exploiting hierarchical domain structure to compute similarity. Tech. Rep., Available at http://dbpubs.stanford.edu/pub/2001-27.
9
 
10
 
11
 
12
 
13
Jeh, G. and Widom, J. 2001. Simrank: A measure of structural-context similarity. Tech. Rep. Stanford University. Available at http://dbpubs.stanford.edu/pub/2001-41.
 
14
Joshi, A. and Krishnapuram, R. 2000. On mining Web access logs. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 63--69.
 
15
 
16
Lee, J. and Kim, M. 1993. Information retrieval based on a conceptual distance in is-a heirarchy. J. Doc. 49, 2, 188--207.
 
17
Lustig, G. 1967. A new class of association factors in mechanized information storage, retrieval and dissemination. In F. I. D./I. F. I. P. Joint Conference (Rome).
 
18
McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York.
 
19
Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm. In Proceedings of ICDE 2002.
 
20
Miller, G. R. B. Fellbaum, C., Gross, D., and Miller, K. 1990. Introduction to WordNet: An on-line lexical database. J. Lexicog. 3, 4, 234--244.
 
21
Nasraoui, O., Frigui, H., Joshi, A., and Krishnapuram, R. 1999. Mining Web access logs using relational competitive fuzzy clustering. In Proceedings of the Eighth International Fuzzy Systems Association World Congress---IFSA 99 (Taipei).
 
22
OPD. The Open Directory Project. Available at http://dmoz.org/.
 
23
Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Trans. Syst. Man Cybern. 19, 1, 17--30.
 
24
Resnick, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 448--453.
25
 
26
Richardson, R. and Smeaton, A. F. 1995. Using WordNet in a knowledge-based approach to information retrieval. In Proceedings of the Seventeenth BCS-IRSG Colloquium on Information Retrieval.
 
27
 
28
 
29
 
30
Sankoff, D. and Kruskal, J. B. 1983. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA.
 
31
Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. 2000. Application of dimensionality reduction in recommender system---a case study. In Proceedings of the ACM WebKDD 2000 Workshop.
32
33
 
34
Scott, S. and Matwin, S. 1998. Text classification using WordNet hypernyms. In Proceedings of the Use of WordNet in Natural Language Processing Systems. Association for Computational Linguistics.
 
35
Shasha, D. and Zhang, K. 1997. Approximate tree pattern matching. In Pattern Matching Algorithms, Oxford University Press, New York.
 
36
Shivakumar, N. and Garcia-Molina, H. 1995. Scam: A copy detection mechanism for digital documents. In Proceedings of the Second International Conference in Theory and Practice of Digital Libraries.
 
37
Sibson, R. 1972. Order invariant methods for data analysis. J. Roy. Stat. Soc. 34, 3, 311--349.
 
38
Sneath, P. and Sokal, R. 1973. Numerical Taxonomy. W. H. Freeman, San Francisco.
 
39
Soergel, D. 1967. Mathematical analysis of documentation systems. An attempt at a theory of classification and search request formulation. Inf. Stor. Retrieval 3, 3, 129--173.
 
40
 
41
Strehl, A., Ghosh, J., and Mooney, R. 2000. Impact of similarity measures on Web-page clustering. In Proceedings of the AAAI Workshop on AI for Web Search.
 
42

CITED BY  38

Collaborative Colleagues:
Prasanna Ganesan: colleagues
Hector Garcia-Molina: colleagues
Jennifer Widom: colleagues