ACM Home Page
Please provide us with feedback. Feedback
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents
Full text PdfPdf (1.63 MB)
Source International Conference on Management of Data archive
Proceedings of the 1998 ACM SIGMOD international conference on Management of data table of contents
Seattle, Washington, United States
Pages: 283 - 294  
Year of Publication: 1998
ISBN:0-89791-995-5
Also published in ...
Author
Brad Adelberg  Northwestern University, Computer Science Department
Sponsors
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGART: ACM Special Interest Group on Artificial Intelligence
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 111,   Citation Count: 61
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/276304.276330
What is a DOI?

ABSTRACT

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
Abi97
 
Ade98
B. Adelberg. NoDoSE - a tool for semiautomatic data extraction from text files. Technical report, Computer Science Department, Northwestern University, 1998.
 
AK97a
 
AK97b
N. Ashish and C.A. Knoblock. Wrapper generation for semi-structured internet sources. In Workshop on management of semistructured data, 1997.
 
CGMH+97
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the processing society of japan, 1997.
 
Gol90
 
HGMC+97
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.
 
KGP88
 
KWD97
N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
 
Liv90
M. Livny. DeNet user's guide. Technical report, University of Wisconsin-Madison, 1990.

CITED BY  61