| Efficiently incorporating user feedback into information extraction and integration programs |
| Full text |
Pdf
(601 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 3: information extraction
table of contents
Pages 87-100
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Xiaoyong Chai
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
Ba-Quy Vuong
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
AnHai Doan
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
Jeffrey F. Naughton
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 88, Downloads (12 Months): 330, Citation Count: 0
|
|
|
ABSTRACT
Many applications increasingly employ information extraction and integration (IE/II) programs to infer structures from unstructured data. Automatic IE/II are inherently imprecise. Hence such programs often make many IE/II mistakes, and thus can significantly benefit from user feedback. Today, however, there is no good way to automatically provide and process such feedback. When finding an IE/II mistake, users often must alert the developer team (e.g., via email or Web form) about the mistake, and then wait for the team to manually examine the program internals to locate and fix the mistake, a slow, error-prone, and frustrating process. In this paper we propose a solution for users to directly provide feedback and for IE/II programs to automatically process such feedback. In our solution a developer U uses hlog, a declarative IE/II language, to write an IE/II program P. Next, U writes declarative user feedback rules that specify which parts of P's data (e.g., input, intermediate, or output data) users can edit, and via which user interfaces. Next, the so-augmented program P is executed, then enters a loop of waiting for and incorporating user feedback. Given user feedback F on a data portion of P, we show how to automatically propagate F to the rest of P, and to seamlessly combine F with prior user feedback. We describe the syntax and semantics of hlog, a baseline execution strategy, and then various optimization techniques. Finally, we describe experiments with real-world data that demonstrate the promise of our solution.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
Philip Bohannon , Srujana Merugu , Cong Yu , Vipul Agarwal , Pedro DeRose , Arun Iyer , Ankur Jain , Vinay Kakade , Mridul Muralidharan , Raghu Ramakrishnan , Warren Shen, Purple SOX extraction management system, ACM SIGMOD Record, v.37 n.4, December 2008
[doi> 10.1145/1519103.1519107]
|
| |
4
|
|
 |
5
|
|
| |
6
|
X. Chai, B.-Q. Vuong, A. Doan, and J. F. Naughton. Efficiently incorporating user feedback into information extraction and integration programs. Technical report. {Online} Available: http://www.cs.wisc.edu/~xchai/papers/hlog_report.pdf.
|
| |
7
|
|
 |
8
|
Fei Chen , Byron J. Gao , AnHai Doan , Jun Yang , Raghu Ramakrishnan, Optimizing complex extraction programs over evolving text data, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
[doi> 10.1145/1559845.1559881]
|
 |
9
|
|
| |
10
|
|
| |
11
|
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In ACL-02.
|
| |
12
|
Pedro DeRose , Xiaoyong Chai , Byron J. Gao , Warren Shen , AnHai Doan , Philip Bohannon , Xiaojin Zhu, Building Community Wikipedias: A Machine-Human Partnership Approach, Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, p.646-655, April 07-12, 2008
[doi> 10.1109/ICDE.2008.4497473]
|
| |
13
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
| |
14
|
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. DBLife: A community information management platform for the database research community. In CIDR-07.
|
 |
15
|
AnHai Doan , Pedro Domingos , Alon Y. Halevy, Reconciling schemas of disparate data sources: a machine-learning approach, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.509-520, May 21-24, 2001, Santa Barbara, California, United States
|
 |
16
|
AnHai Doan , Jeffrey F. Naughton , Raghu Ramakrishnan , Akanksha Baid , Xiaoyong Chai , Fei Chen , Ting Chen , Eric Chu , Pedro DeRose , Byron Gao , Chaitanya Gokhale , Jiansheng Huang , Warren Shen , Ba-Quy Vuong, Information extraction challenges in managing unstructured data, ACM SIGMOD Record, v.37 n.4, December 2008
[doi> 10.1145/1519103.1519106]
|
| |
17
|
A. Doan, R. Ramakrishnan, F. Chen, P. DeRose, Y. Lee, R. McCann, M. Sayyadian, and W. Shen. Community information management. IEEE Data Eng. Bull., 29(1), 2006.
|
| |
18
|
|
 |
19
|
|
 |
20
|
Georg Gottlob , Christoph Koch , Robert Baumgartner , Marcus Herzog , Sergio Flesca, The Lixto data extraction project: back and forth between theory and practice, Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 14-16, 2004, Paris, France
[doi> 10.1145/1055558.1055560]
|
| |
21
|
J. Gray, R. A. Lorie, G. R. Putzolu, and I. L. Traiger. Granularity of locks and degrees of consistency in a shared data base. In IFIP-76.
|
| |
22
|
|
 |
23
|
|
| |
24
|
A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems, techniques, and applications. Data Eng. Bulletin, 18(2), 1995.
|
| |
25
|
|
 |
26
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
 |
27
|
|
 |
28
|
|
| |
29
|
Y. Katsis, A. Deutsch, and Y. Papakonstantinou. Interactive source registration in community-oriented information integration. In VLDB-08.
|
 |
30
|
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
 |
34
|
Warren Shen , Pedro DeRose , Robert McCann , AnHai Doan , Raghu Ramakrishnan, Toward best-effort information extraction, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376718]
|
| |
35
|
|
 |
36
|
|
|