ACM Home Page
Please provide us with feedback. Feedback
On scaling latent semantic indexing for large peer-to-peer systems
Full text PdfPdf (209 KB)
Source Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval table of contents
Sheffield, United Kingdom
SESSION: Dimensionality reduction table of contents
Pages: 112 - 121  
Year of Publication: 2004
ISBN:1-58113-881-4
Authors
Chunqiang Tang  University of Rochester, Rochester, NY
Sandhya Dwarkadas  University of Rochester, Rochester, NY
Zhichen Xu  Yahoo! Inc., Sunnyvale, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 18,   Downloads (12 Months): 91,   Citation Count: 22
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1008992.1009014
What is a DOI?

ABSTRACT

The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information re-trieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this di-rection is pSearch [32, 33]. pSearch places documents onto a peer-to- peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.This paper addresses the above limitations of LSI and makes the following contributions. (1) To reduce the cost of SVD, we reduce the size of its input matrix through document clustering and term selection. Our method retains the retrieval quality of LSI but is several orders of magnitude more efficient. (2) Through extensive experimentation, we found that proper normalization of semantic vectors for terms and documents improves recall by 76%. (3) To further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and document selection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
 
3
 
4
 
5
6
 
7
 
8
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
 
9
 
10
S. Dumais. Using LSI for information filtering: TREC-3 experiments. In Third Text REtrieval Conference (TREC-3), 1995.
 
11
 
12
G. Golub and C. V. Loan. Matrix Computations. The Jason Hopkins University Press, Baltimore, Maryland, second edition edition, 1989.
13
 
14
P. Husbands, H. Simon, and C. Ding. the use of singular value decomposition for text retrieval. In M. Berry, editor, Proc. of SIAM Comp. Info. Retrieval Workshop, October 2000.
15
16
17
 
18
 
19
J. Li, B. T. Loo, J. Hellerstein, F. Kaashoek, D. R. Karger, and R. Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS'03, February 2003.
 
20
D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002.
21
22
 
23
H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT, 43(2):1--22, 2003.
 
24
C. D. Prete, J. T. McArthur, R. L. Villars, I. L. Nathan Redmond, and D. Reinsel. Industry developments and models, Disruptive Innovation in Enterprise Computing: storage. IDC, February 2003.
25
 
26
S. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. In INFOCOM'02, 2002.
 
27
S. E. Robertson, S. Walker, S. Jones, M. M. HancockBeaulieu, and M. Gatford. Okapi at TREC-3. In TREC-3, 1994.
28
29
 
30
SVDPACK. http://www.netlib.org/svdpack.
 
31
C. Tang and S. Dwarkadas. Peer-to-Peer Information Retrieval in Distributed Hashtable Systems. In NSDI'04, 2004.
32
 
33
C. Tang, Z. Xu, and M. Mahalingam. pSearch: Information Retrieval in Structured Overlays. In The First Workshop on Hot Topics in Networks (HotNets I), 2002. Older but partially expanded version available as technical report HPL-2002-198, "PeerSearch: Efficient Information Retrieval in Peer- to-Peer Networks".
 
34
Text Retrieval Conference (TREC). http://trec.nist.gov.
 
35
36

CITED BY  22

Collaborative Colleagues:
Chunqiang Tang: colleagues
Sandhya Dwarkadas: colleagues
Zhichen Xu: colleagues