ACM Home Page
Please provide us with feedback. Feedback
Learning to extract information from large domain-specific websites using sequential models
Full text PdfPdf (205 KB)
Source ACM SIGKDD Explorations Newsletter archive
Volume 6 ,  Issue 2  (December 2004) table of contents
Pages: 61 - 66  
Year of Publication: 2004
ISSN:1931-0145
Authors
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 35,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1046456.1046464
What is a DOI?

ABSTRACT

In this article we describe a novel information extraction task on the web and show how it can be solved effectively using the emerging conditional exponential models. The task involves learning to find specific goal pages on large domain-specific websites. An example of such a task is to find computer science publications starting from university root pages. We encode this as a sequential labeling problem solved using Conditional Random Fields (CRFs). These models enable us to exploit a wide variety of features including keywords and patterns extracted from and around hyperlinks and HTML pages, dependency among labels of adjacent pages, and existing databases of named entities in a unified probabilistic framework. This is an important advantage over previous rule-based or generative models for tackling the challenges of diversity on web data.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
Lawrence R. Rabiner. A tutorial on Hidden Markov Models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77(2), pages 257--286, February 1989.
 
10
 
11
 
12
V. G. Vinod Vydiswaran and Sunita Sarawagi. Learning to extract information from large websites using sequential models. In COMAD, 2005.
Collaborative Colleagues:
Sunita Sarawagi: colleagues
V. G. Vinod Vydiswaran: colleagues