| Optimizing complex extraction programs over evolving text data |
| Full text |
Pdf
(690 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 35th SIGMOD international conference on Management of data
table of contents
Providence, Rhode Island, USA
SESSION: Research session 9: data on the web
table of contents
Pages 321-334
Year of Publication: 2009
ISBN:978-1-60558-551-2
|
|
Authors
|
|
Fei Chen
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
Byron J. Gao
|
Texas State University-San Marcos, San Marcos, TX, USA
|
|
AnHai Doan
|
University of Wisconsin-Madison, Madison, WI, USA
|
|
Jun Yang
|
Duke University, Durham, NC, USA
|
|
Raghu Ramakrishnan
|
Yahoo! Research, Santa Clara, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 52, Downloads (12 Months): 186, Citation Count: 1
|
|
|
ABSTRACT
Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE ``blackbox.'' In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional ``workflow.'' In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
]]http://langrid.nict.go.jp.
|
| |
2
|
]]E. Agichtein and S. Sarawagi. Scalable information extraction and integration (tutorial). KDD-06.
|
| |
3
|
]]K. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao, F. Reiss, E. J. Shekita, D. Simmen, S. Tata, S. Vaithyanathan, and H. Zhu. Towards a scalable enterprise content analytics platform. IEEE Data Eng. Bull., 32(1):28--35, 2009.
|
| |
4
|
]]B. Bhattacharjee, V. Ercegovac, J. Glider, R. Golding, G. Lohman, V. Markl, H. Pirahesh, J. Rao, R. Rees, F. Reiss, E. Shekita, and G. Swart. Impliance: A next generation information management appliance. CIDR-07.
|
 |
5
|
Yuhan Cai , Xin Luna Dong , Alon Halevy , Jing Michelle Liu , Jayant Madhavan, Personal information management with SEMEX, Proceedings of the 2005 ACM SIGMOD international conference on Management of data, June 14-16, 2005, Baltimore, Maryland
[doi> 10.1145/1066157.1066289]
|
| |
6
|
|
| |
7
|
]]F. Chen, B. J. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. Technical report, UW-Madison, 2009. Availableat http://www.cs.wisc.edu/~fchen/delex-tr.pdf
|
 |
8
|
|
| |
9
|
Eric Chu , Akanksha Baid , Ting Chen , AnHai Doan , Jeffrey Naughton, A relational approach to incrementally extracting and querying structure in unstructured data, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
| |
10
|
]]W. Cohen and A. McCallum. Information extraction from the world wide web(tutorial). KDD-03.
|
| |
11
|
]]H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE:A framework and graphical development environment for robust NLP tools and applications. ACL-02.
|
| |
12
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
| |
13
|
]]P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. DBLife: A community information management platform for the database research community (demo). CIDR-07.
|
 |
14
|
AnHai Doan , Jeffrey F. Naughton , Raghu Ramakrishnan , Akanksha Baid , Xiaoyong Chai , Fei Chen , Ting Chen , Eric Chu , Pedro DeRose , Byron Gao , Chaitanya Gokhale , Jiansheng Huang , Warren Shen , Ba-Quy Vuong, Information extraction challenges in managing unstructured data, ACM SIGMOD Record, v.37 n.4, December 2008
[doi> 10.1145/1519103.1519106]
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
]]M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. ECIR-07.
|
 |
20
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
| |
21
|
]]A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text batabases. ICDE-07.
|
 |
22
|
|
 |
23
|
Lipyeow Lim , Min Wang , Sriram Padmanabhan , Jeffrey Scott Vitter , Ramesh Agarwal, Dynamic maintenance of web indexes using landmarks, Proceedings of the 12th international conference on World Wide Web, May 20-24, 2003, Budapest, Hungary
[doi> 10.1145/775152.775167]
|
| |
24
|
]]E. Myers. An O(ND)difference algorithm and its variations. Algorithmica, 1(1):251--256, 1986.
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
]]D. S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty, R. Hoffmann, K. Patel, and M. Skinner. Intelligence in wikipedia. AAAI-08.
|
 |
30
|
|
 |
31
|
|
CITED BY
|
|
Xiaoyong Chai , Ba-Quy Vuong , AnHai Doan , Jeffrey F. Naughton, Efficiently incorporating user feedback into information extraction and integration programs, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|