| Unsupervised query segmentation using generative language models and wikipedia |
| Full text |
Pdf
(247 KB)
|
Source
|
International World Wide Web Conference
archive
Proceeding of the 17th international conference on World Wide Web
table of contents
Beijing, China
SESSION: Search: query analysis
table of contents
Pages 347-356
Year of Publication: 2008
ISBN:978-1-60558-085-2
|
|
Authors
|
|
Bin Tan
|
University of Illinois at Urbana-Champaign, Urbana, IL, USA
|
|
Fuchun Peng
|
Yahoo! Inc., Sunnyvale, CA, USA
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 23, Downloads (12 Months): 242, Citation Count: 1
|
|
|
ABSTRACT
In this paper, we propose a novel unsupervised approach to query segmentation, an important task in Web search. We use a generative query model to recover a query's underlying concepts that compose its original segmented form. The model's parameters are estimated using an expectation-maximization (EM) algorithm, optimizing the minimum description length objective function on a partial corpus that is specific to the query. To augment this unsupervised learning, we incorporate evidence from Wikipedia. Experiments show that our approach dramatically improves performance over the traditional approach that is based on mutual information, and produces comparable results with a supervised method. In particular, the basic generative language model contributes a 7.4% improvement over the mutual information based method (measured by segment F1 on the Intersection test set). EM optimization further improves the performance by 14.3%. Additional knowledge from Wikipedia provides another improvement of 24.3%, adding up to a total of 46% improvement (from 0.530 to 0.774).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
| |
4
|
S. Bergsma and Q. I. Wang. Learning Noun Phrase Query Segmentation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 819--826, 2007.
|
| |
5
|
M. R. Brent and T. A. Cartwright. Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61:93--125, 1996.
|
| |
6
|
|
| |
7
|
R. Bunescu and M. Pasca. Using Encyclopedic Knowledge for Named Entity Disambiguation. In Proceedings of 11th Conference of European Chapter of the Association for Computational Linguistics (EACL), pages 9--16, 2006.
|
| |
8
|
|
| |
9
|
S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL 2007, pages 708--716, 2007.
|
| |
10
|
D. Ahn and V. Jijkoun and G. Mishne and K. Muller and M. de. Rijke. Using Wikipedia at the TREC QA Track. In The Thirteenth Text Retrieval Conference (TREC 2004), 2005.
|
| |
11
|
|
| |
12
|
E. Gabrilovich and S. Markovitch. Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. In Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI), pages 1301--1306, 2006.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of the Third Workshop on Very Large Corpora, pages 82--94, 1995.
|
| |
23
|
K. M. Risvik, T. Mikolajewski, and P. Boros. Query Segmentation for Web Search. In The Twelfth International World Wide Web Conference (WWW), 2003.
|
| |
24
|
J. I. Serrano and L. Araujo. Statistical Recognition of Noun Phrases in Unrestricted Text. In IDA ?05: Proceedings of the 6th International Conference on Advances in Intelligent Data Analysis, pages 397--408, 2005.
|
| |
25
|
|
| |
26
|
|
|