|
ABSTRACT
The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information re-trieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this di-rection is pSearch [32, 33]. pSearch places documents onto a peer-to- peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.This paper addresses the above limitations of LSI and makes the following contributions. (1) To reduce the cost of SVD, we reduce the size of its input matrix through document clustering and term selection. Our method retains the retrieval quality of LSI but is several orders of magnitude more efficient. (2) Through extensive experimentation, we found that proper normalization of semantic vectors for terms and documents improves recall by 76%. (3) To further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and document selection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.
|
| |
9
|
|
| |
10
|
S. Dumais. Using LSI for information filtering: TREC-3 experiments. In Third Text REtrieval Conference (TREC-3), 1995.
|
| |
11
|
|
| |
12
|
G. Golub and C. V. Loan. Matrix Computations. The Jason Hopkins University Press, Baltimore, Maryland, second edition edition, 1989.
|
 |
13
|
|
| |
14
|
P. Husbands, H. Simon, and C. Ding. the use of singular value decomposition for text retrieval. In M. Berry, editor, Proc. of SIAM Comp. Info. Retrieval Workshop, October 2000.
|
 |
15
|
|
 |
16
|
|
 |
17
|
Leah S. Larkey , Margaret E. Connell , Jamie Callan, Collection selection and results merging with topically organized U.S. patents and TREC data, Proceedings of the ninth international conference on Information and knowledge management, p.282-289, November 06-11, 2000, McLean, Virginia, United States
[doi> 10.1145/354756.354830]
|
| |
18
|
|
| |
19
|
J. Li, B. T. Loo, J. Hellerstein, F. Kaashoek, D. R. Karger, and R. Morris. On the Feasibility of Peer-to-Peer Web Indexing and Search. In IPTPS'03, February 2003.
|
| |
20
|
D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002.
|
 |
21
|
|
 |
22
|
Christos H. Papadimitriou , Hisao Tamaki , Prabhakar Raghavan , Santosh Vempala, Latent semantic indexing: a probabilistic analysis, Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, p.159-168, June 01-04, 1998, Seattle, Washington, United States
[doi> 10.1145/275487.275505]
|
| |
23
|
H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT, 43(2):1--22, 2003.
|
| |
24
|
C. D. Prete, J. T. McArthur, R. L. Villars, I. L. Nathan Redmond, and D. Reinsel. Industry developments and models, Disruptive Innovation in Enterprise Computing: storage. IDC, February 2003.
|
 |
25
|
Sylvia Ratnasamy , Paul Francis , Mark Handley , Richard Karp , Scott Schenker, A scalable content-addressable network, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.161-172, August 2001, San Diego, California, United States
|
| |
26
|
S. Rhea and J. Kubiatowicz. Probabilistic Location and Routing. In INFOCOM'02, 2002.
|
| |
27
|
S. E. Robertson, S. Walker, S. Jones, M. M. HancockBeaulieu, and M. Gatford. Okapi at TREC-3. In TREC-3, 1994.
|
 |
28
|
|
 |
29
|
|
| |
30
|
SVDPACK. http://www.netlib.org/svdpack.
|
| |
31
|
C. Tang and S. Dwarkadas. Peer-to-Peer Information Retrieval in Distributed Hashtable Systems. In NSDI'04, 2004.
|
 |
32
|
Chunqiang Tang , Zhichen Xu , Sandhya Dwarkadas, Peer-to-peer information retrieval using self-organizing semantic overlay networks, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
[doi> 10.1145/863955.863976]
|
| |
33
|
C. Tang, Z. Xu, and M. Mahalingam. pSearch: Information Retrieval in Structured Overlays. In The First Workshop on Hot Topics in Networks (HotNets I), 2002. Older but partially expanded version available as technical report HPL-2002-198, "PeerSearch: Efficient Information Retrieval in Peer- to-Peer Networks".
|
| |
34
|
Text Retrieval Conference (TREC). http://trec.nist.gov.
|
| |
35
|
|
 |
36
|
|
CITED BY 22
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gleb Skobeltsyn , Toan Luu , Ivana Podnar Zarko , Martin Rajman , Karl Aberer, Web text retrieval with a P2P query-driven index, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
Igor Mekterovic , Krešimir Križanovic , Mirta Baranovic, Content-based search using self-organizing peer-to-peer network, Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems, p.50-55, February 20-22, 2008, Cambridge, UK
|
|
|
Gleb Skobeltsyn , Toan Luu , Ivana Podnar arko , Martin Rajman , Karl Aberer, Query-driven indexing for scalable peer-to-peer text retrieval, Future Generation Computer Systems, v.25 n.1, p.89-99, January, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|