|
ABSTRACT
We describe and evaluate experimentally a method for clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. 1990. Class-based n-gram models of natural language. In Proceedings of the IBM Natural Language ITL, pages 283--298, Paris, France, March.
|
| |
2
|
Kenneth W. Church and William A. Gale. 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5:19--54.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38.
|
| |
7
|
Richard O. Duda and Peter E. Hart. 1973. Pattern Classification and Scene Analysis. Wiley-Interscience, New York, New York.
|
| |
8
|
|
| |
9
|
Donald Hindle. 1993. A parser for text corpora. In B.T.S. Atkins and A. Zampoli, editors, Computational Approaches to the Lexicon. Oxford University Press, Oxford, England. To appear.
|
| |
10
|
Edwin T. Jaynes. 1983. Brandeis lectures. In Roger D. Rosenkrantz, editor, E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics, number 158 in Synthese Library, chapter 4, pages 40--76. D. Reidel, Dordrecht, Holland.
|
| |
11
|
Philip Resnik. 1992. WordNet and distributional analysis: A class-based approach to lexical discovery. In AAAI Workshop on Statistically-Based Natural-Language-Processing Techniques, San Jose, California, July.
|
| |
12
|
Kenneth Rose, Eitan Gurewitz, and Geoffrey C. Fox. 1990. Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8):945--948.
|
| |
13
|
|
| |
14
|
David Yarowsky. 1992. CONC: Tools for text corpora. Technical Memorandum 11222-921222-29, AT&T Bell Laboratories.
|
CITED BY 158
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zheng Chen , Shengping Liu , Liu Wenyin , Geguang Pu , Wei-Ying Ma, Building a web thesaurus from web link structure, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
|
|
|
|
|
|
|
|
|
Zvika Marx , Ido Dagan , Eli Shamir, A generalized framework for revealing analogous themes across related topics, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.979-986, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mats Rooth , Stefan Riezler , Detlef Prescher , Glenn Carroll , Franz Beil, Inducing a semantically annotated lexicon via EM-based clustering, Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, p.104-111, June 20-26, 1999, College Park, Maryland
|
|
|
Marius Paşca , Dekang Lin , Jeffrey Bigham , Andrei Lifchits , Alpa Jain, Names and similarities on the web: fact extraction in the fast lane, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, p.809-816, July 17-18, 2006, Sydney, Australia
|
|
|
|
|
|
|
|
Wray Buntine , Bernd Fischer , Thomas Pressburger, Towards automated synthesis of data mining programs, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.372-376, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Han , Eren Manavoglu , Hongyuan Zha , Kostas Tsioutsiouliklis , C. Lee Giles , Xiangmin Zhang, Rule-based word clustering for document metadata extraction, Proceedings of the 2005 ACM symposium on Applied computing, March 13-17, 2005, Santa Fe, New Mexico
|
|
|
|
|
|
|
|
|
|
|
|
Kazuhiro Morita , El-Sayed Atlam , Masao Fuketra , Kazuhiko Tsuda , Masaki Oono , Jun-ichi Aoe, Word classification and hierarchy using co-occurrence word information, Information Processing and Management: an International Journal, v.40 n.6, p.957-972, November 2004
|
|
|
|
|
|
Roman Yangarber , Ralph Grishman , Pasi Tapanainen , Silja Huttunen, Unsupervised discovery of scenario-level patterns for Information Extraction, Proceedings of the sixth conference on Applied natural language processing, p.282-289, April 29-May 04, 2000, Seattle, Washington
|
|
|
|
|
Ravi Kumar , Jasmine Novak , Bo Pang , Andrew Tomkins, On anonymizing query logs via token-based hashing, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
Moises Goldszmidt , Derek Palma , Bikash Sabata, On the quantification of e-business capacity, Proceedings of the 3rd ACM conference on Electronic Commerce, p.235-244, October 14-17, 2001, Tampa, Florida, USA
|
|
Young C. Park , Young S. Han , Key-Sun Choi, Automatic thesaurus construction using Bayesian networks, Proceedings of the fourth international conference on Information and knowledge management, p.212-217, November 29-December 02, 1995, Baltimore, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Krishna Kummamuru , Rohit Lotlikar , Shourya Roy , Karan Singal , Raghu Krishnapuram, A hierarchical monothetic document clustering algorithm for summarization and browsing search results, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ron Bekkerman , Ran El-Yaniv , Naftali Tishby , Yoad Winter, On feature distributional clustering for text categorization, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.146-153, September 2001, New Orleans, Louisiana, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dragomir R. Radev , Hong Qi , Zhiping Zheng , Sasha Blair-Goldensohn , Zhu Zhang , Weiguo Fan , John Prager, Mining the web for answers to natural language questions, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shlomo Dubnov , Ran El-Yaniv , Yoram Gdalyahu , Elad Schneidman , Naftali Tishby , Golan Yona, A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles, Machine Learning, v.47 n.1, p.35-61, April 2002
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Hui Han , Lee Giles , Hongyuan Zha , Cheng Li , Kostas Tsioutsiouliklis, Two supervised learning approaches for name disambiguation in author citations, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
|