ACM Home Page
Please provide us with feedback. Feedback
Substructure similarity search in graph databases
Full text PdfPdf (386 KB)
Source International Conference on Management of Data archive
Proceedings of the 2005 ACM SIGMOD international conference on Management of data table of contents
Baltimore, Maryland
SESSION: Research papers: graph and tree-structured data table of contents
Pages: 766 - 777  
Year of Publication: 2005
ISBN:1-59593-060-4
Authors
Xifeng Yan  University of Illinois at Urbana-Champaign
Philip S. Yu  IBM T. J. Watson Research Center
Jiawei Han  University of Illinois at Urbana-Champaign
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 179,   Citation Count: 11
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066157.1066244
What is a DOI?

ABSTRACT

Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The protein data bank. Nucleic Acids Research, 28:235--242, 2000.
 
3
 
4
5
 
6
 
7
L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a dbms for approximate string processing. Data Engineering Bulletin, 24:28--37, 2001.
 
8
T. Hagadone. Molecular substructure similarity searching: efficient retrieval in two-dimensional structure databases. J. Chem. Inf. Comput. Sci., 32:515--521, 1992.
 
9
L. Holder, D. Cook, and S. Djoko. Substructure discovery in the subdue system. In Proc. AAAI'94 Workshop on Knowledge Discovery in Databases (KDD'94), pages 169--180, 1994.
 
10
K. Kailing, H. Kriegel, S. Schnauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc: 9th Int. Conf. on Extending Database Technology (EDBT'04), pages 676--693, 2004.
 
11
M. Kanehisa and S. Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28:27--30, 2000.
 
12
 
13
14
 
15
 
16
National Library of Medicine. http://chem.sis.nlm.nih.gov/chemidplus.
 
17
 
18
J. Raymond, E. Gardiner, and P. Willett. Rascal: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45:631--644, 2002.
19
 
20
S. Srinivasa and S. Kumar. A platform based on the multi-dimensional data model for analysis of bio-molecular structures. In Proc. 2003 Int. Conf. on Very Large Data Bases, pages 975--986, 2003.
 
21
 
22
J. Ullmann. Binary n-gram technique for automatic correction of substitution, deletion, insertion, and reversal errors in words. The Computer Journal, 20:141--147, 1977.
 
23
 
24
P. Willett, J. Barnard, and G. Downs. Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38:983--996, 1998.
25

CITED BY  11
Collaborative Colleagues:
Xifeng Yan: colleagues
Philip S. Yu: colleagues
Jiawei Han: colleagues