ACM Home Page
Please provide us with feedback. Feedback
Towards combining web classification and web information extraction: a case study
Full text MovMov (12:45),  PdfPdf (2.05 MB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1235-1244  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Ping Luo  HP Labs China, Beijing, China
Fen Lin  Institute of Computing Technology, CAS, Beijing, China
Yuhong Xiong  HP Labs China, Beijing, China
Yong Zhao  HP Labs China, Beijing, China
Zhongzhi Shi  Institute of Computing Technology, CAS, Beijing, China
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 63,   Downloads (12 Months): 144,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557152
What is a DOI?

ABSTRACT

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
M. Castellanos, Q. Chen, U. Dayal, M. Hsu, M. Lemon, P. Siegel, andJ. Stinger. Component adviser: a tool for automatically extracting electronic component data from web datasheets. In Proc. of the Workshop on Reuse of Web-based Information, the 7th WWW, 1998.
 
3
D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 2000.
 
4
A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. of the 21st NIPS, 2007.
 
5
 
6
7
 
8
Z. Nie, J. Wen, and W. Ma. Object-level vertical search. In Proc. of the Conf. on Innovative Data Systems Research, 2007.
 
9
V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the 19th IJCAI, 2005.
 
10
 
11
 
12
13
14

Collaborative Colleagues:
Ping Luo: colleagues
Fen Lin: colleagues
Yuhong Xiong: colleagues
Yong Zhao: colleagues
Zhongzhi Shi: colleagues