|
ABSTRACT
We present a generalization of frequent itemsets allowing for the notion of errors in the itemset definition. We motivate the problem and present an efficient algorithm that identifies error-tolerant frequent clusters of items in transactional data (customer-purchase data, web browsing data, text, etc.). The algorithm exploits sparseness of the underlying data to find large groups of items that are correlated over database records (rows). The notion of transaction coverage allows us to extend the algorithm and view it as a fast clustering algorithm for discovering segments of similar transactions in binary sparse data. We evaluate the new algorithm on three real-world applications: clustering high-dimensional data, query selectivity estimation and collaborative filtering. Results show that the algorithm consistently uncovers structure in large sparse databases that other traditional clustering algorithms fail to find.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
AGGR98
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
 |
AIS93
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
AMSTV96
|
Rakesh Agrawal , Heikki Mannila , Ramakrishnan Srikant , Hannu Toivonen , A. Inkeri Verkamo, Fast discovery of association rules, Advances in knowledge discovery and data mining, American Association for Artificial Intelligence, Menlo Park, CA, 1996
|
| |
AS94
|
|
 |
B98
|
|
 |
BFG99
|
Kristin P. Bennett , Usama Fayyad , Dan Geiger, Density-based indexing for approximate nearest-neighbor queries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.233-243, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312236]
|
| |
BFR98
|
P. S. Bradley, U. M. Fayyad and C. Reina. Scaling EM (Expectation-Maximization) Clustering to Large Databases. Technical Report MSR-TR-98-35, Microsoft Research, 1998.
|
| |
BHK98
|
J. Breese, D. Heckerman and C. Kadie. Empirical Analysis of Predictive Algorithms for Collaborative Filtering. Technical Report MSR-TR-98-12, Microsoft Research, 1998.
|
| |
CS96
|
|
| |
DLR77
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1973.
|
| |
DH73
|
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, New York, 1973.
|
| |
FI93
|
U.M. Fayyad and K.B. Irani. "Multi-interval Discretization of Continuous-valued Attributes for Classification Learning." Proc. of the 13th Intl. Joint Conf. on Artificial Intelligence. HCAI-93: Chambery, France (1993).
|
 |
GGR99
|
Venkatesh Ganti , Johannes Gehrke , Raghu Ramakrishnan, CACTUS—clustering categorical data using summaries, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.73-83, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312201]
|
| |
GKP89
|
|
| |
GKR98
|
|
| |
GMS97
|
|
 |
GRS98
|
Sudipto Guha , Rajeev Rastogi , Kyuseok Shim, CURE: an efficient clustering algorithm for large databases, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.73-84, June 01-04, 1998, Seattle, Washington, United States
|
| |
GRS99
|
|
| |
MH98
|
M. Meila and D. Heckerman. An Experimental Comparison of Several Clustering and Initialization Methods. Technical Report MSR-TR-98-06, Microsoft Research, 1998.
|
| |
MRK97
|
B. Miller, J. Riedl and J. Konstan. Experiences with GroupLens: Making Usenet Useful Again. In Proc. USENIX 1997 Tech. Conf., pp. 219-231, Anaheim, CA, 1997.
|
| |
NH94
|
|
| |
RG99
|
R. Ramakrishnan and J. Gehrke. Principles of Database Management (2 ad Edition). 1999.
|
 |
R*97
|
Paul Resnick , Neophytos Iacovou , Mitesh Suchak , Peter Bergstrom , John Riedl, GroupLens: an open architecture for collaborative filtering of netnews, Proceedings of the 1994 ACM conference on Computer supported cooperative work, p.175-186, October 22-26, 1994, Chapel Hill, North Carolina, United States
[doi> 10.1145/192844.192905]
|
 |
RV97
|
|
 |
SFB99
|
Jayavel Shanmugasundaram , Usama Fayyad , P. S. Bradley, Compressed data cubes for OLAP aggregate query approximation on continuous dimensions, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.223-232, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312231]
|
 |
SA96
|
|
| |
YFB00
|
C. Yang, U. Fayyad and P. S. Bradley. Efficient Discovery of Error-Tolerant Frequent Itemsets in High Dimensions. Technical Report MSR-TR-2000-20, Microsoft Research, 2000.
|
| |
ZPOL97
|
M. J. Zaki, S. Parthasarathy, M. Ogihara and W. Li. New Algorithms for Fast Discovery of Association Rules. In Proc. of the Third lnt'l Conf. On Knowledge Discovery in Databases and Data Mining, pp. 283-286, 1997.
|
 |
ZRL96
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
| |
Z49
|
G.E. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, Inc, 1949.
|
CITED BY 19
|
|
|
|
|
|
|
|
Michael Steinbach , Pang-Ning Tan , Hui Xiong , Vipin Kumar, Generalizing the notion of support, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
Xifeng Yan , Hong Cheng , Jiawei Han , Dong Xin, Summarizing itemset patterns: a profile-based approach, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
Cheng-Ru Lin , Chang-Hung Lee , Ming-Syan Chen , Philip S. Yu, Distributed data mining in a chain store database of short transactions, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rohit Gupta , Gang Fang , Blayne Field , Michael Steinbach , Vipin Kumar, Quantitative evaluation of approximate frequent pattern mining algorithms, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|