|
ABSTRACT
Topic or feature extraction is often used as an important step in document classification and text mining. Topics are succinct representation of content in a document collection and hence are very effective when used as content identifiers in peer-to-peer systems and other large scale distributed content management systems. Effective topic extraction is dependent on the accuracy of term clustering that often has to deal with problems like synonymy and polysemy. Retrieval techniques based on spectral analysis like Latent Semantic Indexing (LSI) are often used to effectively solve these problems. Most of the spectral retrieval schemes produce term similarity measures that are symmetric and often, not an accurate characterization of term relationships. Another drawback of LSI is its running time that is polynomial in the dimensions of the m x n matrix, A. This can get prohibitively large for some IR applications. In this paper, we present efficient algorithms using the technique of Locality-Sensitive Hashing (LSH) to extract topics from a document collection based on the asymmetric relationships between terms in a collection. The relationship is characterized by the term co-occurrences and other higher-order similarity measures. Our LSH based scheme can be viewed as a simple alternative to LSI. We show the efficacy of our algorithms via experiments on a set of large documents. An interesting feature of our algorithms is that it produces a natural hierarchical decomposition of the topic space instead of a flat clustering.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Proc. of the North American Chapter of the Association for Computational Linguistics (NAACL-2001), Pittsburgh, PA, June 2001., 2001.
|
 |
4
|
|
| |
5
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis. Graph clustering and minimum-cut trees. Internet Mathematics, 1(4):385--408, 2004.
|
| |
11
|
|
| |
12
|
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.
|
| |
13
|
|
 |
14
|
Piotr Indyk , Rajeev Motwani , Prabhakar Raghavan , Santosh Vempala, Locality-preserving hashing in multidimensional spaces, Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, p.618-625, May 04-06, 1997, El Paso, Texas, United States
[doi> 10.1145/258533.258656]
|
 |
15
|
|
 |
16
|
|
 |
17
|
|
| |
18
|
A. Kontostathis and W. Pottenger. Detecting patterns in the LSIterm-term matrix. In Proc. Workshop on the Foundation of Data Mining and Discovery, IEEE International Conference on Data Mining (ICDM'02), 2002.
|
| |
19
|
T. K. Landauer and S. T. Dumais. A solution to plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104:211--240, 1997.
|
| |
20
|
|
| |
21
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
| |
22
|
|
|