| Extracting redundancy-aware top-k patterns |
| Full text |
Pdf
(831 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Philadelphia, PA, USA
SESSION: Research track papers
table of contents
Pages: 444 - 453
Year of Publication: 2006
ISBN:1-59593-339-5
|
|
Authors
|
|
Dong Xin
|
University of Illinois at Urbana-Champaign, Urbana, IL
|
|
Hong Cheng
|
University of Illinois at Urbana-Champaign, Urbana, IL
|
|
Xifeng Yan
|
University of Illinois at Urbana-Champaign, Urbana, IL
|
|
Jiawei Han
|
University of Illinois at Urbana-Champaign, Urbana, IL
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 95, Citation Count: 10
|
|
|
ABSTRACT
Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously.In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated Ranking of Database Query Results. Proc. of 2003 Biennial Conf. on Innovative Data Systems Research (CIDR'03), pages 1--12, 2003.
|
| |
3
|
|
| |
4
|
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Let., 21:133--137, 1997.
|
| |
9
|
Magnús M. Halldórsson , Kazuo Iwano , Naoki Katoh , Takeshi Tokuyama, Finding subsets maximizing minimum structures, Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms, p.150-159, January 22-24, 1995, San Francisco, California, United States
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
T. Mielikäinen and H. Mannila. The pattern ordering problem. Proc. of 2003 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'03), pages 327--338, 2003.
|
| |
17
|
D. Mount. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Lab., 2001.
|
| |
18
|
|
| |
19
|
S. Ravi, D. Rosenkrantz, and G. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42:299--310, 1994.
|
| |
20
|
C. Ruemmler and J. Wilkes. Unix disk access patterns. Usenix Conf. (USENIX'93), pages 405--420, Winter, 1993.
|
 |
21
|
|
| |
22
|
|
| |
23
|
A. Singhal. Modern information retrieval: A brief overview. Bull. IEEE CS Tech. Comm. Data Eng., 24(4):35--43, 2001.
|
 |
24
|
|
| |
25
|
|
 |
26
|
Xifeng Yan , Hong Cheng , Jiawei Han , Dong Xin, Summarizing itemset patterns: a profile-based approach, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
[doi> 10.1145/1081870.1081907]
|
| |
27
|
X. Yan, J. Han and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proc. of the Third SIAM Int. Conf. on Data Mining, San Francisco (SDM'03), 2003.
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
Ming Hua , Jian Pei , Ada W. C. Fu , Xuemin Lin , Ho-Fung Leung, Efficiently answering top-k typicality queries on large databases, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
Jinfeng Zhuang , Steven C.H. Hoi , Aixin Sun , Rong Jin, Representative entry selection for profiling blogs, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Ruoming Jin , Muad Abu-Ata , Yang Xiang , Ning Ruan, Effective and efficient itemset pattern summarization: regression-based approaches, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
REVIEW
"Apostolos N Papadopoulos : Reviewer"
To be considered useful, top-k discovered patterns should not only contain a high degree of significance, but also a low degree of redundancy. This is true in many application domains, such as Web search engines and bioinformatics. The state of th
more...
|