ACM Home Page
Please provide us with feedback. Feedback
Unsupervised query segmentation using generative language models and wikipedia
Full text PdfPdf (247 KB)
Source
International World Wide Web Conference archive
Proceeding of the 17th international conference on World Wide Web table of contents
Beijing, China
SESSION: Search: query analysis table of contents
Pages 347-356  
Year of Publication: 2008
ISBN:978-1-60558-085-2
Authors
Bin Tan  University of Illinois at Urbana-Champaign, Urbana, IL, USA
Fuchun Peng  Yahoo! Inc., Sunnyvale, CA, USA
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 242,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1367497.1367545
What is a DOI?

ABSTRACT

In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia.

Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
S. Bergsma and Q. I. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 819--826, 2007.
 
5
M. R. Brent and T. A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61:93--125, 1996.
 
6
 
7
R. Bunescu and M. Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of 11th Conference of European Chapter of the Association for Computational Linguistics (EACL), pages 9--16, 2006.
 
8
 
9
S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL 2007, pages 708--716, 2007.
 
10
D. Ahn and V. Jijkoun and G. Mishne and K. Muller and M. de. Rijke. Using Wikipedia at the TREC QA Track. In The Thirteenth Text Retrieval Conference (TREC 2004), 2005.
 
11
 
12
E. Gabrilovich and S. Markovitch. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 1301--1306, 2006.
 
13
14
 
15
 
16
17
 
18
 
19
 
20
21
 
22
L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third Workshop on Very Large Corpora, pages 82--94, 1995.
 
23
K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In The Twelfth International World Wide Web Conference (WWW), 2003.
 
24
J. I. Serrano and L. Araujo. Statistical Recognition of Noun Phrases in Unrestricted Text. In IDA ?05: Proceedings of the 6th International Conference on Advances in Intelligent Data Analysis, pages 397--408, 2005.
 
25
 
26