|
ABSTRACT
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval.
We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
W.B. Croft. Clustering large files of documents using the single-link method. Journal of the Amemcan Soczety for Informatzon Science, 28:341-344, 1977.
|
 |
3
|
|
| |
4
|
A. Grifiiths, H.C. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sczence, 37:3-11, 1986.
|
| |
5
|
|
| |
6
|
N. aardine and C.J. van Rijsbergen. The use of hierarchical clustering in information retrieval. Informatzon Storage and Retrzeval, 7:217-240, 1971.
|
| |
7
|
O. Pedersen, D. R. Cutting, and a. w. Tukey. Snippet search: a single phrase approach to text access. In Proceedings of the 1991 Yoznt Statistical Meetings. American Statistical Association, 1991. Also available as Xerox PARC technical report SSL- 91-08.
|
| |
8
|
G. Salton. The SMART Retmeval System. Prentice- Hall, Englewood Cliffs, N.J., 1971.
|
| |
9
|
|
| |
10
|
R. Sibson. SLINK: an optimally efficient algorithm for the single link cluster method. Computer Journal, 16:30-34, 1973.
|
| |
11
|
|
| |
12
|
C.j. van Rijsbergen and W.B. Croft. Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing Management, 11:171-182, 1975.
|
| |
13
|
P. Willett. Document clustering using an inverted file approach. Journal of Informatzon Sczence, 2:223- 231, 1980.
|
| |
14
|
P. Willett. A fast procedure for the calculation of similarity coefficients in automatic classification. Informatzon Processzng ~ Management, 17:53-60, 1981.
|
| |
15
|
|
CITED BY 242
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Charu C. Aggarwal , Stephen C. Gates , Philip S. Yu, On the merits of building categorization systems by supervised clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.352-356, August 15-18, 1999, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ke Wang , Chu Xu , Bing Liu, Clustering transactions using large items, Proceedings of the eighth international conference on Information and knowledge management, p.483-490, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
Lucy Terry Nowell , Robert K. France , Deborah Hix , Lenwood S. Heath , Edward A. Fox, Visualizing search results: some alternatives to query-document similarity, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.67-75, August 18-22, 1996, Zurich, Switzerland
|
|
|
|
|
|
|
|
|
Peter Pirolli , Patricia Schank , Marti Hearst , Christine Diehl, Scatter/gather browsing communicates the topic structure of a very large text collection, Proceedings of the SIGCHI conference on Human factors in computing systems: common ground, p.213-220, April 13-18, 1996, Vancouver, British Columbia, Canada
|
|
|
Yahiko Kambayashi , Kaoru Katayama , Toshihiro Kakimoto , Hajime Iwamoto, Flexible search functions for multimedia data with text and other auxiliary data, Proceedings of the 1998 ACM symposium on Applied Computing, p.498-504, February 27-March 01, 1998, Atlanta, Georgia, United States
|
|
|
|
|
|
|
|
|
|
|
|
Eytan Adar , David Kargar , Lynn Andrea Stein, Haystack: per-user information environments, Proceedings of the eighth international conference on Information and knowledge management, p.413-422, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
Adam M. Fass , Eric A. Bier , Eyton Adar, PicturePiper: using a re-configurable pipeline to find images on the Web, Proceedings of the 13th annual ACM symposium on User interface software and technology, p.51-62, November 06-08, 2000, San Diego, California, United States
|
|
|
Moses Charikar , Chandra Chekuri , Tomás Feder , Rajeev Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.626-635, May 04-06, 1997, El Paso, Texas, United States
|
|
|
Sougata Mukherjea , Kyoji Hirata , Yoshinori Hara, Using clustering and visualization for refining the results of a WWW image search engine, Proceedings of the 1998 workshop on New paradigms in information visualization and manipulation, p.29-35, November 02-07, 1998, Washington, D.C., United States
|
|
|
Ramana Rao , Stuart K. Card , Walter Johnson , Leigh Klotz , Randall H. Trigg, Protofoil: storing and finding the information worker's paper documents in an electronic file cabinet, Proceedings of the SIGCHI conference on Human factors in computing systems: celebrating interdependence, p.180-185, April 24-28, 1994, Boston, Massachusetts, United States
|
|
|
Eric C. Jensen , Steven M. Beitzel , Angelo J. Pilotto , Nazli Goharian , Ophir Frieder, Parallelizing the buckshot algorithm for efficient document clustering, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
|
|
|
|
|
|
Douglass R. Cutting , David R. Karger , Jan O. Pedersen, Constant interaction-time scatter/gather browsing of very large document collections, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.126-134, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
James Pitkow , Peter Pirolli, Life, death, and lawfulness on the electronic frontier, Proceedings of the SIGCHI conference on Human factors in computing systems, p.383-390, March 22-27, 1997, Atlanta, Georgia, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A. B. Kahng , I. Mandoiu , P. Pevzner , S. Reda , A. Zelikovsky, Engineering a scalable placement heuristic for DNA probe arrays, Proceedings of the seventh annual international conference on Research in computational molecular biology, p.148-156, April 10-14, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anton Leuski , Chin-Yew Lin , Liang Zhou , Ulrich Germann , Franz Josef Och , Eduard Hovy, Cross-lingual C*ST*RD: English access to Hindi information, ACM Transactions on Asian Language Information Processing (TALIP), v.2 n.3, p.245-269, September 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mehran Sahami , Salim Yusufali , Michelle Q. W. Baldonaldo, SONIA: a service for organizing networked information autonomously, Proceedings of the third ACM conference on Digital libraries, p.200-209, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
|
|
|
|
|
|
Daniel E. Rose , Richard Mander , Tim Oren , Dulce B. Poncéleon , Gitt Salomon , Yin Yin Wong, Content awareness in a file system interface: implementing the “pile” metaphor for organizing information, Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, p.260-269, June 27-July 01, 1993, Pittsburgh, Pennsylvania, United States
|
|
|
|
|
|
Ramana Rao , Jan O. Pedersen , Marti A. Hearst , Jock D. Mackinlay , Stuart K. Card , Larry Masinter , Per-Kristian Halvorsen , George C. Robertson, Rich interaction in the digital library, Communications of the ACM, v.38 n.4, p.29-39, April 1995
|
|
|
|
|
|
Eui-Hong Han , George Karypis , Doug Mewhort , Keith Hatchard, Intelligent metasearch engine for knowledge management, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
|
|
|
|
|
|
|
|
|
George Buchanan , Ann Blandford , Harold Thimbleby , Matt Jones, Integrating information seeking and structuring: exploring the role of spatial hypertext in a digital library, Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, August 09-13, 2004, Santa Cruz, CA, USA
|
|
|
|
|
|
|
|
|
Saverio Perugini , Kathleen McDevitt , Ryan Richardson , Manuel A. Pérez-Quiñones , Rao Shen , Naren Ramakrishnan , Chris Williams , Edward A. Fox, Enhancing usability in CITIDEL: multimodal, multilingual, and interactive visualization interfaces, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
|
|
|
|
|
Hiroyuki Kaji , Yasutsugu Morimoto , Toshiko Aizono , Noriyuki Yamasaki, Corpus-dependent association thesauri for information retrieval, Proceedings of the 18th conference on Computational linguistics, p.404-410, July 31-August 04, 2000, Saarbrücken, Germany
|
|
|
|
|
|
|
|
|
|
|
|
Gautam Pant , Kostas Tsioutsiouliklis , Judy Johnson , C. Lee Giles, Panorama: extending digital libraries with topical crawlers, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
|
|
|
|
|
|
Yiming Yang , Jaime G. Carbonell , Ralf D. Brown , Thomas Pierce , Brian T. Archibald , Xin Liu, Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems, v.14 n.4, p.32-43, July 1999
|
|
|
|
|
|
|
|
|
Mark Steyvers , Padhraic Smyth , Michal Rosen-Zvi , Thomas Griffiths, Probabilistic author-topic models for information discovery, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA
|
|
|
Yiming Yang , Jaime G. Carbonell , Ralf D. Brown , Thomas Pierce , Brian T. Archibald , Xin Liu, Learning Approaches for Detecting and Tracking News Events, IEEE Intelligent Systems, v.14 n.4, p.32-43, July 1999
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
David Vogel , Steffen Bickel , Peter Haider , Rolf Schimpfky , Peter Siemen , Steve Bridges , Tobias Scheffer, Classifying search engine queries using the web as background knowledge, ACM SIGKDD Explorations Newsletter, v.7 n.2, p.117-122, December 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mike Cammarano , Xin (Luna) Dong , Bryan Chan , Jeff Klingner , Justin Talbot , Alon Halevey , Pat Hanrahan, Visualization of Heterogeneous Data, IEEE Transactions on Visualization and Computer Graphics, v.13 n.6, p.1200-1207, November 2007
|
|
|
|
|
|
|
|
|
Wray Buntine , Jaakko Lofstrom , Jukka Perkio , Sami Perttu , Vladimir Poroshin , Tomi Silander , Henry Tirri , Antti Tuominen , Ville Tuulos, A Scalable Topic-Based Open Source Search Engine, Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, p.228-234, September 20-24, 2004
|
|
|
|
|
|
|
|
|
|
|
|
Ron Weiss , Bienvenido Vélez , Mark A. Sheldon, HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering, Proceedings of the the seventh ACM conference on Hypertext, p.180-193, March 16-20, 1996, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xiaohua Sun , Patrick Chiu , Jeffrey Huang , Maribeth Back , Wolf Polak, Implicit brushing and target snapping: data exploration and sense-making on large displays, Proceedings of the working conference on Advanced visual interfaces, May 23-26, 2006, Venezia, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Nachiketa Sahoo , Jamie Callan , Ramayya Krishnan , George Duncan , Rema Padman, Incremental hierarchical clustering of text documents, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ronald N. Kostoff , J. Antonio Del Río , Héctor D. Cortés , Charles Smith , Andrew Smith , Caroline Wagner , Loet Leydesdorff , George Karypis , Guido Malpohl , Rene Tshiteya, Clustering methodologies for identifying country core competencies, Journal of Information Science, v.33 n.1, p.21-40, February 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Soumen Chakrabarti , Byron E. Dom , David Gibson , Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins, Topic Distillation and Spectral Filtering, Artificial Intelligence Review, v.13 n.5-6, p.409-435, Dec. 1999
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ann Blandford , Anne Adams , Simon Attfield , George Buchanan , Jeremy Gow , Stephann Makri , Jon Rimmer , Claire Warwick, The PRET A Rapporter framework: Evaluating digital libraries from the perspective of information work, Information Processing and Management: an International Journal, v.44 n.1, p.4-21, January, 2008
|
|
|
|
|
|
|
|
|
|
|
|
Jiyin He , Wouter Weerkamp , Martha Larson , Maarten de Rijke, Blogger, stick to your story: modeling topical noise in blogs with coherence measures, Proceedings of the second workshop on Analytics for noisy unstructured text data, p.39-46, July 24-24, 2008, Singapore
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Young-Min Kim , Jean-François Pessiot , Massih Reza Amini , Patrick Gallinari, An extension of PLSA for document clustering, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Frank Shipman , Andreas Girgensohn , Lynn Wilcox, Authoring, viewing, and generating hypervideo: An overview of Hyper-Hitchcock, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), v.5 n.2, p.1-19, November 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Patrick Chiu , Jeffrey Huang , Maribeth Back , Nicholas Diakopoulos , John Doherty , Wolf Polak , Xiaohua Sun, mTable: browsing photos and videos on a tabletop system, Proceeding of the 16th ACM international conference on Multimedia, October 26-31, 2008, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Robert Villa , Iván Cantador , Hideo Joho , Joemon M. Jose, An aspectual interface for supporting complex search tasks, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|