ACM Home Page
Please provide us with feedback. Feedback
Simple BM25 extension to multiple weighted fields
Full text PdfPdf (205 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the thirteenth ACM international conference on Information and knowledge management table of contents
Washington, D.C., USA
SESSION: IR-1 (information retrieval): information retrieval models table of contents
Pages: 42 - 49  
Year of Publication: 2004
ISBN:1-58113-874-1
Authors
Stephen Robertson  Microsoft Research, Cambridge, U.K.
Hugo Zaragoza  Microsoft Research, Cambridge, U.K.
Michael Taylor  Microsoft Research, Cambridge, U.K.
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 62,   Downloads (12 Months): 333,   Citation Count: 42
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1031171.1031181
What is a DOI?

ABSTRACT

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies <i>before</i> the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Nick Craswell and David Hawking. Overview of the TREC-2002 Web track. In TREC 2002, 2003.
 
3
INEX. Initiative for the evaluation of XML retrieval (INEX), http://inex.is.informatik.uni-duisburg.de:2003.
4
 
5
Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary and Westfield College, University of London, 2000.
6
 
7
N.Craswell, D.Hawking, A.McLean, T.Upstill, R.Wilkinson and M.Wu. TREC 12 web track at CSIRO. In TREC 2003, 2004.
 
8
Hongbo Xu, Zhifeng Yang, Bin Wang, Bin Liu, Jun Cheng, Yue Liu, Zhe Yang, Xueqi Cheng and Shuo Bai TREC-11 experiments at CAS-ICT: Filtering and Web. In TREC 2002, 2003.
 
9
Lide Wu, Xuanjing Huang, Junyu Niu, Yingju Xia, Zhe Feng and Yaqian Zhou. FDU at TREC2002: Filtering, Q&A, Web and Video tasks. In TREC 2002, 2003.
 
10
Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel and Aya Soffer. Topic distillation with knowledge agents. In TREC 2002, 2003.
 
11
Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear combinations based on document structure and varied stemming for Arabic retrieval. In TREC 2002, 2003.
 
12
Nie Yu, Ji Donghong and Yang Lingpeng. LIT at TREC-2002: Web track. In TREC 2002, 2003.
 
13
Shuang Liu, Clement Yu and Wensheng Wu. UIC at TREC-2002: Web track. In TREC 2002, 2003.
 
14
Jacques Savoy and Yves Rasolofo. Report on TREC-11 experiment: Arabic, Named Page and Topic Distillation searches. In TREC 2002, 2003.
15
 
16
Benjamin Piwowarski and Patrick Gallinari. A machine learning model for information retrieval with structured documents. In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition (MLDM'03), pages 425--438, Leipzig, Germany, July 2003. Springer Verlag.
 
17
ReutersI. Reuters corpus volume 1, http://about.reuters.com/researchandstandards/corpus/index.asp.
 
18
 
19

CITED BY  42
 
 
 
 
 
 
 
 

Collaborative Colleagues:
Stephen Robertson: colleagues
Hugo Zaragoza: colleagues
Michael Taylor: colleagues