| Towards combining web classification and web information extraction: a case study |
| Full text |
Mov
(12:45),
Pdf
(2.05 MB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1235-1244
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Ping Luo
|
HP Labs China, Beijing, China
|
|
Fen Lin
|
Institute of Computing Technology, CAS, Beijing, China
|
|
Yuhong Xiong
|
HP Labs China, Beijing, China
|
|
Yong Zhao
|
HP Labs China, Beijing, China
|
|
Zhongzhi Shi
|
Institute of Computing Technology, CAS, Beijing, China
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 53, Downloads (12 Months): 153, Citation Count: 0
|
|
|
ABSTRACT
Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
M. Castellanos, Q. Chen, U. Dayal, M. Hsu, M. Lemon, P. Siegel, andJ. Stinger. Component adviser: a tool for automatically extracting electronic component data from web datasheets. In Proc. of the Workshop on Reuse of Web-based Information, the 7th WWW, 1998.
|
| |
3
|
D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley, New York, 2000.
|
| |
4
|
A. Kulesza and F. Pereira. Structured learning with approximate inference. In Proc. of the 21st NIPS, 2007.
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
Z. Nie, J. Wen, and W. Ma. Object-level vertical search. In Proc. of the Conf. on Innovative Data Systems Research, 2007.
|
| |
9
|
V. Punyakanok, D. Roth, W. Yih, and D. Zimak. Learning and inference over constrained output. In Proc. of the 19th IJCAI, 2005.
|
| |
10
|
|
| |
11
|
|
| |
12
|
Yewei Xue , Yunhua Hu , Guomao Xin , Ruihua Song , Shuming Shi , Yunbo Cao , Chin-Yew Lin , Hang Li, Web page title extraction and its application, Information Processing and Management: an International Journal, v.43 n.5, p.1332-1347, September, 2007
[doi> 10.1016/j.ipm.2006.11.007]
|
 |
13
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, 2D Conditional Random Fields for Web information extraction, Proceedings of the 22nd international conference on Machine learning, p.1044-1051, August 07-11, 2005, Bonn, Germany
[doi> 10.1145/1102351.1102483]
|
 |
14
|
Jun Zhu , Zaiqing Nie , Ji-Rong Wen , Bo Zhang , Wei-Ying Ma, Simultaneous record detection and attribute labeling in web data extraction, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
[doi> 10.1145/1150402.1150457]
|
|