|
ABSTRACT
This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies <i>before</i> the non-linear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
David Carmel , Yoelle S. Maarek , Matan Mandelbrod , Yosi Mass , Aya Soffer, Searching XML documents via XML fragments, Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, July 28-August 01, 2003, Toronto, Canada
[doi> 10.1145/860435.860464]
|
| |
2
|
Nick Craswell and David Hawking. Overview of the TREC-2002 Web track. In TREC 2002, 2003.
|
| |
3
|
INEX. Initiative for the evaluation of XML retrieval (INEX), http://inex.is.informatik.uni-duisburg.de:2003.
|
 |
4
|
|
| |
5
|
Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary and Westfield College, University of London, 2000.
|
 |
6
|
Sung Hyon Myaeng , Don-Hyun Jang , Mun-Seok Kim , Zong-Cheol Zhoo, A flexible model for retrieval of SGML documents, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.138-145, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.290980]
|
| |
7
|
N.Craswell, D.Hawking, A.McLean, T.Upstill, R.Wilkinson and M.Wu. TREC 12 web track at CSIRO. In TREC 2003, 2004.
|
| |
8
|
Hongbo Xu, Zhifeng Yang, Bin Wang, Bin Liu, Jun Cheng, Yue Liu, Zhe Yang, Xueqi Cheng and Shuo Bai TREC-11 experiments at CAS-ICT: Filtering and Web. In TREC 2002, 2003.
|
| |
9
|
Lide Wu, Xuanjing Huang, Junyu Niu, Yingju Xia, Zhe Feng and Yaqian Zhou. FDU at TREC2002: Filtering, Q&A, Web and Video tasks. In TREC 2002, 2003.
|
| |
10
|
Einat Amitay, David Carmel, Adam Darlow, Ronny Lempel and Aya Soffer. Topic distillation with knowledge agents. In TREC 2002, 2003.
|
| |
11
|
Abdur Chowdhury, Mohammed Aljlayl, Eric Jensen, Steve Beitzel, David Grossman and Ophir Frieder. Linear combinations based on document structure and varied stemming for Arabic retrieval. In TREC 2002, 2003.
|
| |
12
|
Nie Yu, Ji Donghong and Yang Lingpeng. LIT at TREC-2002: Web track. In TREC 2002, 2003.
|
| |
13
|
Shuang Liu, Clement Yu and Wensheng Wu. UIC at TREC-2002: Web track. In TREC 2002, 2003.
|
| |
14
|
Jacques Savoy and Yves Rasolofo. Report on TREC-11 experiment: Arabic, Named Page and Topic Distillation searches. In TREC 2002, 2003.
|
 |
15
|
|
| |
16
|
Benjamin Piwowarski and Patrick Gallinari. A machine learning model for information retrieval with structured documents. In Petra Perner, editor, Machine Learning and Data Mining in Pattern Recognition (MLDM'03), pages 425--438, Leipzig, Germany, July 2003. Springer Verlag.
|
| |
17
|
ReutersI. Reuters corpus volume 1, http://about.reuters.com/researchandstandards/corpus/index.asp.
|
| |
18
|
|
| |
19
|
|
CITED BY 42
|
Jian Hu , Gang Wang , Fred Lochovsky , Jian-tao Sun , Zheng Chen, Understanding user's query intent with wikipedia, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
Petteri Nurmi , Eemil Lagerspetz , Wray Buntine , Patrik Floréen , Joonas Kukkonen , Peter Peltonen, Natural language retrieval of grocery products, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
Jinwen Guo , Shengliang Xu , Shenghua Bao , Yong Yu, Tapping on the potential of q&a community by recommending answer providers, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
Petteri Nurmi , Eemil Lagerspetz , Wray Buntine , Patrik Floréen , Joonas Kukkonen, Product retrieval for grocery stores, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, July 20-24, 2008, Singapore, Singapore
|
|
|
|
|
|
David Fernandes , Edleno S. de Moura , Berthier Ribeiro-Neto , Altigran S. da Silva , Marcos André Gonçalves, Computing block importance for searching on web sites, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
|
|
|
|
Michael Taylor , Hugo Zaragoza , Nick Craswell , Stephen Robertson , Chris Burges, Optimisation methods for ranking functions with multiple parameters, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
|
|
Michael Taylor , John Guiver , Stephen Robertson , Tom Minka, SoftRank: optimizing non-smooth rank metrics, Proceedings of the international conference on Web search and web data mining, February 11-12, 2008, Palo Alto, California, USA
|
|
|
|
Shuming Shi , Ji-Rong Wen , Qing Yu , Ruihua Song , Wei-Ying Ma, Gravitation-based model for information retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
Yunhua Hu , Guomao Xin , Ruihua Song , Guoping Hu , Shuming Shi , Yunbo Cao , Hang Li, Title extraction from bodies of HTML documents and its application to web page retrieval, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
Yunhua Hu , Hang Li , Yunbo Cao , Dmitriy Meyerzon , Qinghua Zheng, Automatic extraction of titles from general documents using machine learning, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
|
|
Yuefeng Li , Xujuan Zhou , Peter Bruza , Yue Xu , Raymond Y.K. Lau, A two-stage text mining model for information filtering, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yunhua Hu , Hang Li , Yunbo Cao , Li Teng , Dmitriy Meyerzon , Qinghua Zheng, Automatic extraction of titles from general documents using machine learning, Information Processing and Management: an International Journal, v.42 n.5, p.1276-1293, September 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Andrei Broder , Peter Ciccolo , Evgeniy Gabrilovich , Vanja Josifovski , Donald Metzler , Lance Riedel , Jeffrey Yuan, Online expansion of rare queries for sponsored search, Proceedings of the 18th international conference on World wide web, April 20-24, 2009, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Zaiqing Nie , Yunxiao Ma , Shuming Shi , Ji-Rong Wen , Wei-Ying Ma, Web object retrieval, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
Yewei Xue , Yunhua Hu , Guomao Xin , Ruihua Song , Shuming Shi , Yunbo Cao , Chin-Yew Lin , Hang Li, Web page title extraction and its application, Information Processing and Management: an International Journal, v.43 n.5, p.1332-1347, September, 2007
|
|
|
|