|
|||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||
ABSTRACT
Indexing is an important Information Retrieval (IR) operation, which must be parallelised to support large-scale document corpora. We propose a novel adaptation of the state-of-the-art single-pass indexing algorithm in terms of the MapReduce programming model. We then experiment with this adaptation, in the context of the Hadoop MapReduce implementation. In particular, we explore the scale of improvements that can be achieved when using firstly more processing hardware and secondly larger corpora. Our results show that indexing speed increases in a close to linear fashion when scaling corpus size or number of processing machines. This suggests that the proposed indexing implementation is viable to support upcoming large-scale corpora. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
INDEX TERMS
Primary Classification:
General Terms:
Collaborative Colleagues:
|
|||||||||||||||||||||||||||||||||||||