| Web site mining: a new way to spot competitors, customers and suppliers in the world wide web |
| Full text |
Pdf
(953 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Edmonton, Alberta, Canada
SESSION: Web page classification
table of contents
Pages: 249 - 258
Year of Publication: 2002
ISBN:1-58113-567-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 73, Citation Count: 13
|
|
|
ABSTRACT
When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Soumen Chakrabarti , Byron Dom , Piotr Indyk, Enhanced hypertext categorization using hyperlinks, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.307-318, June 01-04, 1998, Seattle, Washington, United States
|
| |
2
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence, v.118 n.1-2, p.69-113, April 2000
[doi> 10.1016/S0004-3702(00)00004-7]
|
| |
3
|
|
| |
4
|
DMOZ open directory project, http://dmoz.org/
|
| |
5
|
|
 |
6
|
Neal Lesh , Mohammed J. Zaki , Mitsunori Ogihara, Mining features for sequence classification, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.342-346, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312275]
|
| |
7
|
McCallum A., Nigam K.: A Comparison of Event Models for Naive Bayes Text Classification, Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
8
|
Menshikov M.V., Volkov S.E.: Branching Markov Chains: Qualitative Characteristics, 1997, Markov Processes Relat. Fields. 3 1--18.
|
| |
9
|
|
| |
10
|
|
| |
11
|
Yahoo! Directory Service, http://www.yahoo.com/
|
 |
12
|
|
| |
13
|
|
|