|
ABSTRACT
In web classification, most researchers assume that the objects to classify are individual web pages from one or more web sites. In practice, the assumption is too restrictive since a web page itself may not always correspond to a concept instance of some semantic concept (or category) given to the classification task. In this paper, we want to relax this assumption and allow a concept instance to be represented by a subgraph of web pages or a set of web pages. We identify several new issues to be addressed when the assumption is removed, and formulate the web unit mining problem. We also propose an iterative web unit mining (iWUM) method that first finds subgraphs of web pages using some knowledge about web site structure. From these web subgraphs, web units are constructed and classified into semantic concepts (or categories) in an iterative manner. Our experiments using the WebKB dataset showed that iWUM improves the overall classification performance and works very well on the more structured parts of a web site.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Andrei Z. Broder , Robert Krauthgamer , Michael Mitzenmacher, Improved classification via connectivity information, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.576-585, January 09-11, 2000, San Francisco, California, United States
|
 |
3
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
 |
7
|
Martin Ester , Hans-Peter Kriegel , Matthias Schubert, Web site mining: a new way to spot competitors, customers and suppliers in the world wide web, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775084]
|
| |
8
|
L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for hypertext classification. In Proc. of Intl Joint Conf. on Artificial Intelligence Workshop on Text Learning: Beyond Supervision, Seattle, WA, 2001.
|
| |
9
|
D. Hawking and N. Craswell. Overview of the TREC-2001 web track. In Proc. of TREC, Maryland, 2001. http://trec.nist.gov/.
|
| |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
D. Mladenic. Turning Yahoo to automatic web-page classifier. In Proc. of 13th European Conf. on Artificial Intelligence, pages 473--474, Brighton, UK, 1998.
|
| |
14
|
J. M. Pierre. On the automated classification of web sites. Linköping Electronic Articles in Computer and Info. Science, 6, 2001.
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
| |
18
|
T. Westerveld, D. Hiemstra, and W. Kraaij. Retrieving web pages using content, links, urls and anchors. In Proc. of TREC, Maryland, 2001. http://trec.nist.gov/.
|
 |
19
|
|
| |
20
|
|
CITED BY 4
|
|
Vassil Gedov , Carsten Stolz , Ralph Neuneier , Michal Skubacz , Dietmar Seipel, Matching web site structure and content, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
|
|
|
Qiankun Zhao , Tie-Yan Liu , Sourav S. Bhowmick , Wei-Ying Ma, Event detection from evolution of click-through data, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|