|
ABSTRACT
Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
Abi97
|
|
| |
Ade98
|
B. Adelberg. NoDoSE - a tool for semiautomatic data extraction from text files. Technical report, Computer Science Department, Northwestern University, 1998.
|
| |
AK97a
|
|
| |
AK97b
|
N. Ashish and C.A. Knoblock. Wrapper generation for semi-structured internet sources. In Workshop on management of semistructured data, 1997.
|
| |
CGMH+97
|
S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom. The TSIMMIS project: integration of heterogeneous information sources. In Proceedings of the processing society of japan, 1997.
|
| |
Gol90
|
|
| |
HGMC+97
|
J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Workshop on management of semistructured data, 1997.
|
| |
KGP88
|
|
| |
KWD97
|
N. Kushmerick, D.S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proceedings of IJCAI, 1997.
|
| |
Liv90
|
M. Livny. DeNet user's guide. Technical report, University of Wisconsin-Madison, 1990.
|
CITED BY 61
|
|
David Mattox , Len Seligman , Ken Smith, Rapper: a wrapper generator with linguistic knowledge, Proceedings of the 2nd international workshop on Web information and data management, p.6-11, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
A. Kruger , C. L. Giles , F. M. Coetzee , E. Glover , G. W. Flake , S. Lawrence , C. Omlin, DEADLINER: building a new niche search engine, Proceedings of the ninth international conference on Information and knowledge management, p.272-281, November 06-11, 2000, McLean, Virginia, United States
|
|
|
Andrew Crossen , Jay Budzik , Mason Warner , Larry Birnbaum , Kristian J. Hammond, XLibris: an automated library research assistant, Proceedings of the 6th international conference on Intelligent user interfaces, p.49-52, January 14-17, 2001, Santa Fe, New Mexico, United States
|
|
|
|
|
|
Hasan Davulcu , Guizhen Yang , Michael Kifer , I. V. Ramakrishnan, Computational aspects of resilient data extraction from semistructured sources (extended abstract), Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.136-144, May 15-18, 2000, Dallas, Texas, United States
|
|
|
Satoshi Morinaga , Kenji Yamanishi , Kenji Tateishi , Toshikazu Fukushima, Mining product reputations on the Web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
|
|
|
|
|
|
|
|
|
Chuang-Hue Moh , Ee-Peng Lim , Wee-Keong Ng, Re-engineering structures from Web documents, Proceedings of the fifth ACM conference on Digital libraries, p.67-76, June 02-07, 2000, San Antonio, Texas, United States
|
|
|
Stephen W. Liddle , Douglas M. Campbell , Chad Crawford, Automatically extracting structure and data from business reports, Proceedings of the eighth international conference on Information and knowledge management, p.86-93, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
David W. Embley , Douglas M. Campbell , Randy D. Smith , Stephen W. Liddle, Ontology-based extraction and structuring of information from data-rich unstructured documents, Proceedings of the seventh international conference on Information and knowledge management, p.52-59, November 02-07, 1998, Bethesda, Maryland, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Zehua Liu , Wee Keong Ng , Feifei Li , Ee-Peng Lim, A visual tool for building logical data models of websites, Proceedings of the 4th international workshop on Web information and data management, November 08-08, 2002, McLean, Virginia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Berthier Ribeiro-Neto , Alberto H. F. Laender , Altigran S. da Silva, Extracting semi-structured data through examples, Proceedings of the eighth international conference on Information and knowledge management, p.94-101, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
N. Agrawal , R. Ananthanarayanan , R. Gupta , S. Joshi , R. Krishnapuram , S. Negi, The eShopmonitor: a comprehensive data extraction tool for monitoring web sites, IBM Journal of Research and Development, v.48 n.5/6, p.679-692, September/November 2004
|
|
|
Hongkun Zhao , Weiyi Meng , Zonghuan Wu , Vijay Raghavan , Clement Yu, Fully automatic wrapper generation for search engines, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|