ACM Home Page
Please provide us with feedback. Feedback
Clustering web documents with tables for information extraction
Full text PdfPdf (152 KB)
Source
International Conference On Knowledge Capture archive
Proceedings of the 4th international conference on Knowledge capture table of contents
Whistler, BC, Canada
POSTER SESSION: Posters table of contents
Pages: 169 - 170  
Year of Publication: 2007
ISBN:978-1-59593-643-1
Authors
Kostyantyn Shchekotykhin  University of Klagenfurt, Klagenfurt, Austria
Dietmar Jannach  University of Klagenfurt, Klagenfurt, Austria
Gerhard Friedrich  University of Klagenfurt, Klagenfurt, Austria
Sponsors
SIGART: ACM Special Interest Group on Artificial Intelligence
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 63,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1298406.1298438
What is a DOI?

ABSTRACT

One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form - which is for instance a typical way of describing items in online shops - existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure > 0.9).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
G. Stoilos, G. B. Stamou, and S. D. Kollias. A string metric for ontology alignment. In International Semantic Web Conference, pages 624--637, 2005.

Collaborative Colleagues:
Kostyantyn Shchekotykhin: colleagues
Dietmar Jannach: colleagues
Gerhard Friedrich: colleagues