| Efficient clustering of high-dimensional data sets with application to reference matching |
| Full text |
Pdf
(274 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Boston, Massachusetts, United States
Pages: 169 - 178
Year of Publication: 2000
ISBN:1-58113-233-6
|
|
Authors
|
|
Andrew McCallum
|
WhizBang! Labs - Research, 4616 Henry Street, Pittsburgh, PA and School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
|
|
Kamal Nigam
|
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
|
|
Lyle H. Ungar
|
Computer and Info. Science, University of Pennsylvania, Philadelphia, PA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 30, Downloads (12 Months): 212, Citation Count: 82
|
|
|
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
H. Akaike. On entropy maximization principle. Applications of Statistics, pages 27-41, 1977.
|
| |
2
|
M. R. Anderberg. Cluster Analysis for Application. Academic Press, 1973.
|
| |
3
|
P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. 4th International Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, August 1998.
|
| |
4
|
I. P. Felligi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183-1210, 1969.
|
 |
5
|
|
 |
6
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
 |
7
|
|
| |
8
|
H. Hirsh. Integrating mulitple sources of information in text classification using whril. In Snowbird Learning Conference, April 2000.
|
| |
9
|
J. Hylton. Identifying and merging related bibliographic records. MIT LCS Masters Thesis, 1996.
|
| |
10
|
B. Kilss and W. Alvey, editors. Record Linkage Techniques-1985, 1985. Statistics of Income Division, Internal Revenue Service Publication 1299-2-96. Available from http://www.fcsm.gov/.
|
| |
11
|
|
| |
12
|
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
|
| |
13
|
A. Monge and C. Elkan. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, August 1996.
|
| |
14
|
A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 workshop on data mining and knowledge discovery, May 1997.
|
| |
15
|
|
| |
16
|
H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954-959, 1959.
|
| |
17
|
S. Omohundro. Five balltree construction algorithms. Technical report 89-063, International Computer Science Institute, Berkeley, California, 1989.
|
| |
18
|
K. Rose. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE, 86(11):2210-2239, 1998.
|
| |
19
|
|
| |
20
|
M. Sankaran, S. Suresh, M. Wong, and D. Nesamoney. Method for incremental aggregation of dynamically increasing database data sets. U.S. Patent 5,794,246, 1998.
|
| |
21
|
D. Sanko and J. B. Kruskal. Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
|
| |
22
|
J. W. Tukey and J. O. Pedersen. Method and apparatus for information access employing overlapping clusters. U.S. Patent 5,787,422, 1998.
|
 |
23
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
CITED BY 83
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mark Steyvers , Padhraic Smyth , Michal Rosen-Zvi , Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
Moninder Singh , Jayant R. Kalagnanam , Sudhir Verma , Amit J. Shah , Swaroop K. Chalasani, Automated cleansing for spend analytics, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
|
|
|
|
|
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Isaac G. Councill , Huajing Li , Ziming Zhuang , Sandip Debnath , Levent Bolelli , Wang Chien Lee , Anand Sivasubramaniam , C. Lee Giles, Learning metadata from the evidence in an on-line citation matching scheme, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
Su Yan , Dongwon Lee , Min-Yen Kan , Lee C. Giles, Adaptive sorted neighborhood methods for efficient record linkage, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiaoxun Zhang , Xueying Wang , Honglei Guo , Zhili Guo , Xian Wu , Zhong Su, Floatcascade learning for fast imbalanced web mining, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In-Su Kang , Seung-Hoon Na , Seungwoo Lee , Hanmin Jung , Pyung Kim , Won-Kyung Sung , Jong-Hyeok Lee, On co-authorship for author disambiguation, Information Processing and Management: an International Journal, v.45 n.1, p.84-97, January, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Omar Benjelloun , Hector Garcia-Molina , David Menestrina , Qi Su , Steven Euijong Whang , Jennifer Widom, Swoosh: a generic approach to entity resolution, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.1, p.255-276, January 2009
|
|
|
|
|
|
|
|
|
Steven Euijong Whang , David Menestrina , Georgia Koutrika , Martin Theobald , Hector Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Carina F. Dorneles , Marcos Freitas Nunes , Carlos A. Heuser , Viviane P. Moreira , Altigran S. da Silva , Edleno S. de Moura, A strategy for allowing meaningful and comparable scores in approximate matching, Information Systems, v.34 n.8, p.740-756, December, 2009
|
|
|
|
|
|
|
|