ACM Home Page
Please provide us with feedback. Feedback
Optimizing complex extraction programs over evolving text data
Full text PdfPdf (690 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 9: data on the web table of contents
Pages 321-334  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Fei Chen  University of Wisconsin-Madison, Madison, WI, USA
Byron J. Gao  Texas State University-San Marcos, San Marcos, TX, USA
AnHai Doan  University of Wisconsin-Madison, Madison, WI, USA
Jun Yang  Duke University, Durham, NC, USA
Raghu Ramakrishnan  Yahoo! Research, Santa Clara, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 43,   Downloads (12 Months): 194,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559881
What is a DOI?

ABSTRACT

Most information extraction (IE) approaches have considered only static text corpora, over which we apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and so to keep extracted information up to date we often must apply IE repeatedly, to consecutive corpus snapshots. Applying IE from scratch to each snapshot can take a lot of time. To avoid doing this, we have recently developed Cyclex, a system that recycles previous IE results to speed up IE over subsequent corpus snapshots. Cyclex clearly demonstrated the promise of the recycling idea. The work itself however is limited in that it considers only IE programs that contain a single IE ``blackbox.'' In practice, many IE programs are far more complex, containing multiple IE blackboxes connected in a compositional ``workflow.''

In this paper, we present Delex, a system that removes the above limitation. First we identify many difficult challenges raised by Delex, including modeling complex IE programs for recycling purposes, implementing the recycling process efficiently, and searching for an optimal execution plan in a vast plan space with different recycling alternatives. Next we describe our solutions to these challenges. Finally, we describe extensive experiments with both rule-based and learning-based IE programs over two real-world data sets, which demonstrate the utility of our approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
E. Agichtein and S. Sarawagi. Scalable information extraction and integration (tutorial). KDD-06.
 
3
K. Beyer, V. Ercegovac, R. Krishnamurthy, S. Raghavan, J. Rao, F. Reiss, E. J. Shekita, D. Simmen, S. Tata, S. Vaithyanathan, and H. Zhu. Towards a scalable enterprise content analytics platform. IEEE Data Eng. Bull., 32(1):28--35, 2009.
 
4
B. Bhattacharjee, V. Ercegovac, J. Glider, R. Golding, G. Lohman, V. Markl, H. Pirahesh, J. Rao, R. Rees, F. Reiss, E. Shekita, and G. Swart. Impliance: A next generation information management appliance. CIDR-07.
5
 
6
 
7
F. Chen, B. J. Gao, A. Doan, J. Yang, and R. Ramakrishnan. Optimizing complex extraction programs over evolving text data. Technical report, UW-Madison, 2009. Availableat http://www.cs.wisc.edu/~fchen/delex-tr.pdf
 
8
J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers. TODS-03.
 
9
 
10
W. Cohen and A. McCallum. Information extraction from the world wide web(tutorial). KDD-03.
 
11
H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE:A framework and graphical development environment for robust NLP tools and applications. ACL-02.
 
12
 
13
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. DBLife: A community information management platform for the database research community (demo). CIDR-07.
14
15
16
 
17
 
18
 
19
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. ECIR-07.
20
 
21
A. Jain, A. Doan, and L. Gravano. SQL queries over unstructured text batabases. ICDE-07.
22
23
 
24
E. Myers. An O(ND)difference algorithm and its variations. Algorithmica, 1(1):251--256, 1986.
 
25
 
26
 
27
 
28
 
29
D. S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty, R. Hoffmann, K. Patel, and M. Skinner. Intelligence in wikipedia. AAAI-08.
30
31


Collaborative Colleagues:
Fei Chen: colleagues
Byron J. Gao: colleagues
AnHai Doan: colleagues
Jun Yang: colleagues
Raghu Ramakrishnan: colleagues