ACM Home Page
Please provide us with feedback. Feedback
Web site mining: a new way to spot competitors, customers and suppliers in the world wide web
Full text PdfPdf (953 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
SESSION: Web page classification table of contents
Pages: 249 - 258  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Martin Ester  Simon Fraser University, Burnaby, BC, Canada
Hans-Peter Kriegel  University of Munich (LMU), Munich, Germany
Matthias Schubert  University of Munich (LMU), Munich, Germany
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 68,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775084
What is a DOI?

ABSTRACT

When automatically extracting information from the world wide web, most established methods focus on spotting single HTML-documents. However, the problem of spotting complete web sites is not handled adequately yet, in spite of its importance for various applications. Therefore, this paper discusses the classification of complete web sites. First, we point out the main differences to page classification by discussing a very intuitive approach and its weaknesses. This approach treats a web site as one large HTML-document and applies the well-known methods for page classification. Next, we show how accuracy can be improved by employing a preprocessing step which assigns an occurring web page to its most likely topic. The determined topics now represent the information the web site contains and can be used to classify it more accurately. We accomplish this by following two directions. First, we apply well established classification algorithms to a feature space of occurring topics. The second direction treats a site as a tree of occurring topics and uses a Markov tree model for further classification. To improve the efficiency of this approach, we additionally introduce a powerful pruning method reducing the number of considered web pages. Our experiments show the superiority of the Markov tree approach regarding classification accuracy. In particular, we demonstrate that the use of our pruning method not only reduces the processing time, but also improves the classification accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
 
4
DMOZ open directory project, http://dmoz.org/
 
5
6
 
7
McCallum A., Nigam K.: A Comparison of Event Models for Naive Bayes Text Classification, Proceedings of AAAI-98 Workshop on Learning for Text Categorization, 1998.
 
8
Menshikov M.V., Volkov S.E.: Branching Markov Chains: Qualitative Characteristics, 1997, Markov Processes Relat. Fields. 3 1--18.
 
9
 
10
 
11
Yahoo! Directory Service, http://www.yahoo.com/
12
 
13

CITED BY  13

Collaborative Colleagues:
Martin Ester: colleagues
Hans-Peter Kriegel: colleagues
Matthias Schubert: colleagues