| Combinatorial pattern discovery for scientific data: some preliminary results |
| Full text |
Pdf
(1.04 MB)
|
| Source
|
International Conference on Management of Data
archive
Proceedings of the 1994 ACM SIGMOD international conference on Management of data
table of contents
Minneapolis, Minnesota, United States
Pages: 115 - 125
Year of Publication: 1994
ISBN:0-89791-639-5
Also published in ...
|
|
Authors
|
|
Jason Tsong-Li Wang
|
Computer and Information Science, New Jersey Institute of Technology, Newark, NJ
|
|
Gung-Wei Chirn
|
Computer and Information Science, New Jersey Institute of Technology, Newark, NJ
|
|
Thomas G. Marr
|
Cold Spring Harbor Laboratory, 100 Bungtown Rodad, Cold Spring Harbor, NY
|
|
Bruce Shapiro
|
Image Processing Section, Laboratory of Mathematical Biology, Division of Cancer Biology and Diagnosis, National Cancer, Institute, National Institutes of Health, Frederick, MD
|
|
Dennis Shasha
|
Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY
|
|
Kaizhong Zhang
|
Department of Computer Science, The University of Western Ontario, London, Ontario, Canada N6A 5B7
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 9, Downloads (12 Months): 54, Citation Count: 41
|
|
|
ABSTRACT
Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
3
|
D. J. Bacon and W. J. Anderson. Multiple sequence alignment. Journal of Molecular Biology, 191:153- 161, 1986.
|
| |
4
|
A. Bairoeh. PROSITE: A dictionary of sites and patterns in proteins. Nucleic Acids Research, 20:2013-2018, 1992.
|
| |
5
|
A. Bairoch and B. Boeekmann. The SWISS- PROT protein sequence data bank. Nucleic Acidn Research, 20:2019-2022, 1992.
|
| |
6
|
W. Buntine and M. D. Alto, editors. Collected Notes on the Workshop for Patfern Discover# in Large Databases. Technical Report FIA-91- 07, NASA Ames Research Center, Mof{'ett Field, California, April 1991.
|
| |
7
|
|
| |
8
|
W. G. Cochran. Sampling Techniques. Wiley, 1977.
|
| |
9
|
|
| |
10
|
W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus. Knowledge discovery in databases: An overview. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 1-27. AAAI/MIT Press, 1991.
|
 |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
S. Henikoff and J. G. Henikoff. Automated assembly of protein blocks for database searching. Nucleic Acids Research, 19(23):6565-6572, 1991.
|
 |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
Nabil Kamel , M. Delobel , Thomas G. Marr , Robert Robbins , Jean Thierry-Mieg , Akira Tsugita, Data and Knowledge Bases for Genome Mapping: What Lies Ahead? (Panel), Proceedings of the 17th International Conference on Very Large Data Bases, p.309, September 03-06, 1991
|
| |
20
|
|
| |
21
|
R. J. Lipton, T. G. Marr, and J. D. Welsh. Computational approaches to discovering semantics in molecular biology. Proceedings of the IEEE, 77(7):1056-1060, July 1989.
|
 |
22
|
Richard J. Lipton , Jeffrey F. Naughton , Donovan A. Schneider, Practical selectivity estimation through adaptive sampling, Proceedings of the 1990 ACM SIGMOD international conference on Management of data, p.1-11, May 23-26, 1990, Atlantic City, New Jersey, United States
|
 |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
M. A. Roytberg. A search for common patterns in many sequences. Computer Applications in #he Biosciences, 8(1):57-64, 1992.
|
| |
27
|
D. Sankoff and J. B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983.
|
| |
28
|
B. A. Shapiro and K. Zhang. Comparing multipie RNA secondary structures using tree comparisons. Computer Applications in the Biosciences, 6(zl):309-31#, 1000.
|
| |
29
|
D. Shasha, J. T. L. Wang, K. Zhang, and F. Y. Shih. Exact and approximate algorithms for unordered tree matching. IEEE Transactions on Systems, Man and Cybernetics, 24(3), March 1994.
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
M. Vingron and P. Argos. A fast and sensitive multiple sequence alignment algorithm. Computer Applications in the Biosciences, 5:115-122, 1989.
|
 |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
|
 |
39
|
|
| |
40
|
|
| |
41
|
|
| |
42
|
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison Wesley, Reading, MA, 1949.
|
CITED BY 42
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Helen Pinto , Jiawei Han , Jian Pei , Ke Wang , Qiming Chen , Umeshwar Dayal, Multi-dimensional sequential pattern mining, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
Jason T. L. Wang , Qicheng Ma , Dennis Shasha , Cathy H. Wu, Application of neural networks to biological data mining: a case study in protein sequence classification, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.305-309, August 20-23, 2000, Boston, Massachusetts, United States
|
|
|
|
|
|
Laxmi Parida , Isidore Rigoutsos , Aris Floratos , Dan Platt , Yuan Gao, Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.297-308, January 09-11, 2000, San Francisco, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
Xiong Wang , Jason T. L. Wang , Dennis Shasha , Bruce A. Shapiro , Isidore Rigoutsos , Kaizhong Zhang, Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining, IEEE Transactions on Knowledge and Data Engineering, v.14 n.4, p.731-749, July 2002
|
|
|
|
|
|
Rong She , Fei Chen , Ke Wang , Martin Ester , Jennifer L. Gardy , Fiona S. L. Brinkman, Frequent-subsequence-based prediction of outer membrane proteins, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2003, Washington, D.C.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Jianyong Wang , Helen Pinto , Qiming Chen , Umeshwar Dayal , Mei-Chun Hsu, Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach, IEEE Transactions on Knowledge and Data Engineering, v.16 n.11, p.1424-1440, November 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rakesh Agrawal , King-Ip Lin , Harpreet S. Sawhney , Kyuseok Shim, Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases, Proceedings of the 21th International Conference on Very Large Data Bases, p.490-501, September 11-15, 1995
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|