|
ABSTRACT
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Eugene Agichtein , Luis Gravano , Jeff Pavel , Viktoriya Sokolova , Aleksandr Voskoboynik, Snowball: a prototype system for extracting relations from large text collections, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.612, May 21-24, 2001, Santa Barbara, California, United States
|
 |
2
|
|
| |
3
|
S. Bell and P. Brockhausen. Discovery of data dependencies in relational databases. In European Conference on Machine Learning, 1995.
|
| |
4
|
T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858--867, 2007.
|
| |
5
|
M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In under review, 2008.
|
| |
6
|
M. J. Cafarella, D. Suciu, and O. Etzioni. Navigating extracted data with schema discovery. In Web DB, 2007.
|
| |
7
|
|
| |
8
|
Kenneth Ward Church , Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada
[doi> 10.3115/981623.981633]
|
 |
9
|
Robin Dhamankar , Yoonkyong Lee , AnHai Doan , Alon Halevy , Pedro Domingos, iMAP: discovering complex semantic matches between database schemas, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007612]
|
 |
10
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
 |
11
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
 |
12
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
[doi> 10.1145/1242572.1242583]
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko, and C. Yu. Structured data meets the web: A few observations. IEEE Data Eng. Bull., 29(4): 19--26, 2006.
|
| |
19
|
|
| |
20
|
|
| |
21
|
R. Miller and P. Andritsos. Schema discovery. IEEE Data Eng. Bull., 26(3):40--45, 2003.
|
 |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
M. Yoshida and K. Torisawa. A method to integrate tables of the world wide web. In Proceedings of the 1st International Workshop on Web Document Analysis, pages 31--34, 2001.
|
| |
30
|
R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences, 2003.
|
CITED BY 5
|
|
Arjun Dasgupta , Nan Zhang , Gautam Das , Surajit Chaudhuri, Privacy preservation of aggregates in hidden databases: why and how?, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Bryan Chan , Justin Talbot , Leslie Wu , Nathan Sakunkoo , Mike Cammarano , Pat Hanrahan, Vispedia: on-demand data integration for interactive visualization and exploration, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
|
|
|
|
|
|
Nilesh Dalvi , Ravi Kumar , Bo Pang , Raghu Ramakrishnan , Andrew Tomkins , Philip Bohannon , Sathiya Keerthi , Srujana Merugu, A web of concepts, Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 29-July 01, 2009, Providence, Rhode Island, USA
|
|