|
ABSTRACT
Advanced database systems face a great challenge raised by the emergence of massive, complex structural data in bioinformatics, chem-informatics, and many other applications. The most fundamental support needed in these applications is the efficient search of complex structured data. Since exact matching is often too restrictive, similarity search of complex structures becomes a vital operation that must be supported efficiently.In this paper, we investigate the issues of substructure similarity search using indexed features in graph databases. By transforming the edge relaxation ratio of a query graph into the maximum allowed missing features, our structural filtering algorithm, called Grafil, can filter many graphs without performing pairwise similarity computations. It is further shown that using either too few or too many features can result in poor filtering performance. Thus the challenge is to design an effective feature set selection strategy for filtering. By examining the effect of different feature selection mechanisms, we develop a multi-filter composition strategy, where each filter uses a distinct and complementary subset of the features. We identify the criteria to form effective feature sets for filtering, and demonstrate that combining features with similar size and selectivity can improve the filtering and search performance significantly. Moreover, the concept presented in Grafil can be applied to searching approximate non-consecutive sequences, trees, and other complicated structures as well.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The protein data bank. Nucleic Acids Research, 28:235--242, 2000.
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava. Using q-grams in a dbms for approximate string processing. Data Engineering Bulletin, 24:28--37, 2001.
|
| |
8
|
T. Hagadone. Molecular substructure similarity searching: efficient retrieval in two-dimensional structure databases. J. Chem. Inf. Comput. Sci., 32:515--521, 1992.
|
| |
9
|
L. Holder, D. Cook, and S. Djoko. Substructure discovery in the subdue system. In Proc. AAAI'94 Workshop on Knowledge Discovery in Databases (KDD'94), pages 169--180, 1994.
|
| |
10
|
K. Kailing, H. Kriegel, S. Schnauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc: 9th Int. Conf. on Extending Database Technology (EDBT'04), pages 676--693, 2004.
|
| |
11
|
M. Kanehisa and S. Goto. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28:27--30, 2000.
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
National Library of Medicine. http://chem.sis.nlm.nih.gov/chemidplus.
|
| |
17
|
|
| |
18
|
J. Raymond, E. Gardiner, and P. Willett. Rascal: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45:631--644, 2002.
|
 |
19
|
|
| |
20
|
S. Srinivasa and S. Kumar. A platform based on the multi-dimensional data model for analysis of bio-molecular structures. In Proc. 2003 Int. Conf. on Very Large Data Bases, pages 975--986, 2003.
|
| |
21
|
|
| |
22
|
J. Ullmann. Binary n-gram technique for automatic correction of substitution, deletion, insertion, and reversal errors in words. The Computer Journal, 20:141--147, 1977.
|
| |
23
|
|
| |
24
|
P. Willett, J. Barnard, and G. Downs. Chemical similarity searching. J. Chem. Inf. Comput. Sci., 38:983--996, 1998.
|
 |
25
|
|
CITED BY 11
|
|
Liping Wang , Qing Li , Na Li , Guozhu Dong , Yu Yang, Substructure similarity measurement in chinese recipes, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
Chen Chen , Xifeng Yan , Philip S. Yu , Jiawei Han , Dong-Qing Zhang , Xiaohui Gu, Towards graph containment search and indexing, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ruoming Jin , Yang Xiang , Ning Ruan , David Fuhry, 3-HOP: a high-compression indexing scheme for reachability query, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|