ACM Home Page
Please provide us with feedback. Feedback
Extracting redundancy-aware top-k patterns
Full text PdfPdf (831 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Philadelphia, PA, USA
SESSION: Research track papers table of contents
Pages: 444 - 453  
Year of Publication: 2006
ISBN:1-59593-339-5
Authors
Dong Xin  University of Illinois at Urbana-Champaign, Urbana, IL
Hong Cheng  University of Illinois at Urbana-Champaign, Urbana, IL
Xifeng Yan  University of Illinois at Urbana-Champaign, Urbana, IL
Jiawei Han  University of Illinois at Urbana-Champaign, Urbana, IL
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 95,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1150402.1150452
What is a DOI?

ABSTRACT

Observed in many applications, there is a potential need of extracting a small set of frequent patterns having not only high significance but also low redundancy. The significance is usually defined by the context of applications. Previous studies have been concentrating on how to compute top-k significant patterns or how to remove redundancy among patterns separately. There is limited work on finding those top-k patterns which demonstrate high-significance and low-redundancy simultaneously.In this paper, we study the problem of extracting redundancy-aware top-k patterns from a large collection of frequent patterns. We first examine the evaluation functions for measuring the combined significance of a pattern set and propose the MMS (Maximal Marginal Significance) as the problem formulation. The problem is known as NP-hard. We further present a greedy algorithm which approximates the optimal solution with performance bound O(log k) (with conditions on redundancy), where k is the number of reported patterns. The direct usage of redundancy-aware top-k patterns is illustrated through two real applications: disk block prefetch and document theme extraction. Our method can also be applied to processing redundancy-aware top-k queries in traditional database.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated Ranking of Database Query Results. Proc. of 2003 Biennial Conf. on Innovative Data Systems Research (CIDR'03), pages 1--12, 2003.
 
3
 
4
5
 
6
 
7
 
8
R. Hassin, S. Rubinstein, and A. Tamir. Approximation algorithms for maximum dispersion. Operations Research Let., 21:133--137, 1997.
 
9
 
10
 
11
12
 
13
 
14
15
 
16
T. Mielikäinen and H. Mannila. The pattern ordering problem. Proc. of 2003 European Conf. on Principles of Data Mining and Knowledge Discovery (PKDD'03), pages 327--338, 2003.
 
17
D. Mount. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Lab., 2001.
 
18
 
19
S. Ravi, D. Rosenkrantz, and G. Tayi. Heuristic and special case algorithms for dispersion problems. Operations Research, 42:299--310, 1994.
 
20
C. Ruemmler and J. Wilkes. Unix disk access patterns. Usenix Conf. (USENIX'93), pages 405--420, Winter, 1993.
21
 
22
 
23
A. Singhal. Modern information retrieval: A brief overview. Bull. IEEE CS Tech. Comm. Data Eng., 24(4):35--43, 2001.
24
 
25
26
 
27
X. Yan, J. Han and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. Proc. of the Third SIAM Int. Conf. on Data Mining, San Francisco (SDM'03), 2003.

CITED BY  10


REVIEW

"Apostolos N Papadopoulos : Reviewer"

To be considered useful, top-k discovered patterns should not only contain a high degree of significance, but also a low degree of redundancy. This is true in many application domains, such as Web search engines and bioinformatics. The state of th  more...

Collaborative Colleagues:
Dong Xin: colleagues
Hong Cheng: colleagues
Xifeng Yan: colleagues
Jiawei Han: colleagues