ACM Home Page
Please provide us with feedback. Feedback
Exploiting asymmetry in hierarchical topic extraction
Full text PdfPdf (230 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the 15th ACM international conference on Information and knowledge management table of contents
Arlington, Virginia, USA
SESSION: Taxonomy and similarity table of contents
Pages: 475 - 482  
Year of Publication: 2006
ISBN:1-59593-433-2
Authors
Sreenivas Gollapudi  Microsoft Search Labs
Rina Panigrahy  Stanford University
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 100,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183614.1183683
What is a DOI?

ABSTRACT

Topic or feature extraction is often used as an important step in document classification and text mining. Topics are succinct representation of content in a document collection and hence are very effective when used as content identifiers in peer-to-peer systems and other large scale distributed content management systems. Effective topic extraction is dependent on the accuracy of term clustering that often has to deal with problems like synonymy and polysemy. Retrieval techniques based on spectral analysis like Latent Semantic Indexing (LSI) are often used to effectively solve these problems. Most of the spectral retrieval schemes produce term similarity measures that are symmetric and often, not an accurate characterization of term relationships. Another drawback of LSI is its running time that is polynomial in the dimensions of the m x n matrix, A. This can get prohibitively large for some IR applications. In this paper, we present efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection. The relationship is characterized by the term co-occurrences and other higher-order similarity measures. Our LSH based scheme can be viewed as a simple alternative to LSI. We show the efficacy of our algorithms via experiments on a set of large documents. An interesting feature of our algorithms is that it produces a natural hierarchical decomposition of the topic space instead of a flat clustering.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Proc. of the North American Chapter of the Association for Computational Linguistics (NAACL-2001), Pittsburgh, PA, June 2001., 2001.
4
 
5
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
6
7
 
8
 
9
 
10
G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph clustering and minimum-cut trees. Internet Mathematics, 1(4):385--408, 2004.
 
11
 
12
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.
 
13
14
15
16
17
 
18
A. Kontostathis and W. Pottenger. Detecting patterns in the LSIterm-term matrix. In Proc. Workshop on the Foundation of Data Mining and Discovery, IEEE International Conference on Data Mining (ICDM'02), 2002.
 
19
T. K. Landauer and S. T. Dumais. A solution to plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104:211--240, 1997.
 
20
 
21
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
 
22


Collaborative Colleagues:
Sreenivas Gollapudi: colleagues
Rina Panigrahy: colleagues