|
ABSTRACT
Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. This paperintroduces KnowItAll, a system that aims to automate the tedious process ofextracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.The paper describes preliminary experiments in which an instance of KnowItAll, running for four days on a single machine, was able to automatically extract 54,753 facts. KnowItAll associates a probability with each fact enabling it to trade off precision and recall. The paper analyzes KnowItAll's architecture and reports on lessons learned for the design of large-scale information extraction systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question answering using the Worldwide Web. In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases, 2002.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
| |
7
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, v.118 n.1-2, p.69-113, April 2000
[doi> 10.1016/S0004-3702(00)00004-7]
|
 |
8
|
Stephen Dill , Nadav Eiron , David Gibson , Daniel Gruhl , R. Guha , Anant Jhingran , Tapas Kanungo , Sridhar Rajagopalan , Andrew Tomkins , John A. Tomlin , Jason Y. Zien, SemTag and seeker: bootstrapping the semantic web via automated semantic annotation, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775178]
|
| |
9
|
|
| |
10
|
O. Etzioni. Moving up the information food chain: softbots as information carnivores. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, 1996. Revised version reprinted in AI Magazine special issue, Summer '97.
|
| |
11
|
Charles L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19(1):17--37, 1982.
|
| |
12
|
D. Freitag and A. McCallum. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, 1999.
|
| |
13
|
|
| |
14
|
|
| |
15
|
Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Muslea. Accurately and reliably extracting data from the Web: A machine learning approach. IEEE Data Engineering Bulletin, 23(4):33--41, 2000.
|
| |
16
|
N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 729--737. San Francisco, CA: Morgan Kaufmann, 1997.
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
A. McCallum. Efficiently inducing features or conditional random fields. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003.
|
| |
21
|
D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu, A. Novischi, A. Badulescu, and O. Bolohan. Lcc tools for question answering.
|
 |
22
|
Dragomir R. Radev , Hong Qi , Zhiping Zheng , Sasha Blair-Goldensohn , Zhu Zhang , Weiguo Fan , John Prager, Mining the web for answers to natural language questions, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502610]
|
| |
23
|
|
| |
24
|
|
| |
25
|
M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden markov models for information extraction. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003.
|
| |
26
|
|
| |
27
|
S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert. CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1314--21, 1995.
|
| |
28
|
|
| |
29
|
Ellen M. Voorhees. Overview of the TREC 2001 question answering track. In Text REtrieval Conference, 2001.
|
CITED BY 79
|
|
|
|
|
|
|
|
Ganesh Ramakrishnan , Soumen Chakrabarti , Deepa Paranjpe , Pushpak Bhattacharya, Is question answering an acquired skill?, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
Jihad Boulos , Nilesh Dalvi , Bhushan Mandhani , Shobhit Mathur , Chris Re , Dan Suciu, MYSTIQ: a system for finding more answers by using probabilities, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
|
|
|
|
|
|
Robert McCann , Bedoor AlShebli , Quoc Le , Hoa Nguyen , Long Vu , AnHai Doan, Mapping maintenance for data integration systems, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
|
|
|
|
|
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
Vijay Krishnan , Sujatha Das , Soumen Chakrabarti, Enhanced answer type inference from questions using sequential models, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.315-322, October 06-08, 2005, Vancouver, British Columbia, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Michael L. Wick , Khashayar Rohanimanesh , Karl Schultz , Andrew McCallum, A unified approach for schema matching, coreference and canonicalization, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
|
|
|
Cynthia Matuszek , Michael Witbrock , Robert C. Kahlert , John Cabral , Dave Schneider , Purvesh Shah , Doug Lenat, Searching for common sense: populating Cyc™ from the web, Proceedings of the 20th national conference on Artificial intelligence, p.1430-1435, July 09-13, 2005, Pittsburgh, Pennsylvania
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Methods for domain-independent information extraction from the web: an experimental comparison, Proceedings of the 19th national conference on Artifical intelligence, p.391-398, July 25-29, 2004, San Jose, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Yunliang Jiang , Hui-Ting Yang , Kevin Chen-chuan Chang , Yi-Shin Chen, AIDE: ad-hoc intents detection engine over query logs, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
Lei Ji , Jun Yan , Ning Liu , Wen Zhang , Weiguo Fan , Zheng Chen, ExSearch: a novel vertical search engine for online barter business, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the Web: An experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
|
|
|
|
|
|
|
|
|
|
|