|
ABSTRACT
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries---they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability.In response, this paper introduces the Bindings Engine (BE), which supports queries containing typed variables and string-processing functions. For example, in response to the query "powerful ‹noun›" BE will return all the nouns in its index that immediately follow the word "powerful", sorted by frequency. In response to the query "Cities such as ProperNoun(Head(‹NounPhrase›))", BE will return a list of proper nouns likely to be city names.BE's novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, BE can yield several orders of magnitude speedup for large-scale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how BE's space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a BE-based application extracts thousands of facts from the Web at interactive speeds in response to simple user queries.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Corpus Colossal. The Economist, Jan. 2005.
|
 |
2
|
|
| |
3
|
|
| |
4
|
Dirk Bahle , Hugh E. Williams , Justin Zobel, Optimised phrase querying and browsing of large text databases, Proceedings of the 24th Australasian conference on Computer science, p.11-19, January 29-February 02, 2001, Gold Coast, Queensland, Australia
|
 |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
E. Brill, J. Lin, M. Banko, S. T. Dumais, and A. Y. Ng. Data-Intensive Question Answering. In TREC 2001 Proceedings, 2001.
|
| |
9
|
|
 |
10
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
| |
11
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
[doi> 10.1016/j.artint.2005.03.001]
|
| |
12
|
A. Y. Halevy and J. Madhavan. Corpus-Based Knowledge Representation. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1567--1572, 2003.
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
A. O. Mendelzon, G. A. Mihalia, and T. Milo. Querying the World Wide Web. International Journal on Digital Libraries, 1996.
|
| |
18
|
R. C. Miller and B. C. Myers. Lightweight Structured Text Processing. In Proceedings of 1999 USENIX Annual Technical Conference, pages 131--144, Monterey, CA, 1999.
|
 |
19
|
John Prager , Eric Brown , Anni Coden , Dragomir Radev, Question-answering by predictive annotation, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.184-191, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345574]
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
H. E. Williams, J. Zobel, and P. Anderson. What's Next? Index Structures for Efficient Phrase Querying. In J. Roddick, editor, Proceedings on the Australasian Database Conference, pages 141--152, Auckland, New Zealand, 1999.
|
CITED BY 17
|
|
Yutaka Matsuo , Junichiro Mori , Masahiro Hamasaki , Keisuke Ishida , Takuichi Nishimura , Hideaki Takeda , Koiti Hasida , Mitsuru Ishizuka, POLYPHONET: an advanced social network extraction system from the web, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
Michael J. Cafarella , Doug Downey , Stephen Soderland , Oren Etzioni, KnowItNow: fast, scalable information extraction from the web, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.563-570, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
José E. Moreira , Maged M. Michael , Dilma Da Silva , Doron Shiloach , Parijat Dube , Li Zhang, Scalability of the Nutch search engine, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
|
|
|
Glenn Ammons , Jonathan Appavoo , Maria Butrico , Dilma Da Silva , David Grove , Kiyokuni Kawachiya , Orran Krieger , Bryan Rosenburg , Eric Van Hensbergen , Robert W. Wisniewski, Libra: a library operating system for a jvm in a virtualized execution environment, Proceedings of the 3rd international conference on Virtual execution environments, June 13-15, 2007, San Diego, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Yutaka Matsuo , Junichiro Mori , Masahiro Hamasaki , Takuichi Nishimura , Hideaki Takeda , Koiti Hasida , Mitsuru Ishizuka, POLYPHONET: An advanced social network extraction system from the Web, Web Semantics: Science, Services and Agents on the World Wide Web, v.5 n.4, p.262-278, December, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|