ACM Home Page
Please provide us with feedback. Feedback
The power of two min-hashes for similarity search among hierarchical data objects
Full text PdfPdf (231 KB)
Source
Symposium on Principles of Database Systems archive
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems table of contents
Vancouver, Canada
SESSION: Searching & clustering table of contents
Pages 211-220  
Year of Publication: 2008
ISBN:978-1-60558-152-1
Authors
Sreenivas Gollapudi  Microsoft Research, Mountain View, CA, USA
Rina Panigrahy  Micsrosoft Research, Mountain View, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 10,   Downloads (12 Months): 120,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376916.1376946
What is a DOI?

ABSTRACT

In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Augsten, M. Böhlen, C. Dyreson, and J. Gamper. Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancún, Mexico, April 2008. To appear.
 
2
 
3
 
4
5
6
 
7
8
9
 
10
11
 
12
Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.
13
 
14
 
15
K. Kailing, H-P. Kriegel, S. Schonauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc. 9th Intl Conference on Extending Database Technology, pages 676--693, 2004.
 
16
T. Margush and F. R. McMorris. Consensus n-trees. Bulletin of Mathematical Biology, 3:239--244, 1981.
17
18
 
19
K. Zhang. A constrained editing distance between unordered labeled trees. Algorithmica, 15:205--222, 1996.
 
20
 
21
Li Zhang. On matching nodes between trees. Technical Report HPL-2003-67, HP Laboratories, Palo Alto, CA, April 2003.

Collaborative Colleagues:
Sreenivas Gollapudi: colleagues
Rina Panigrahy: colleagues