|
ABSTRACT
This paper explores the use of hierarchical structure for classifying a large, heterogeneous collection of web content. The hierarchical structure is initially used to train different second-level classifiers. In the hierarchical case, a model is learned to distinguish a second-level category from other categories within the same top level. In the flat non-hierarchical case, a model distinguishes a second-level category from all other second-level categories. Scoring rules can further take advantage of the hierarchy by considering only second-level categories that exceed a threshold at the top level.
We use support vector machine (SVM) classifiers, which have been shown to be efficient and effective for classification, but not previously explored in the context of hierarchical classification. We found small advantages in accuracy for hierarchical models over flat models. For the hierarchical approach, we found the same accuracy using a sequential Boolean decision rule and a multiplicative decision rule. Since the sequential approach is much more efficient, requiring only 14%-16% of the comparisons used in the other approaches, we find it to be a good choice for classifying text into large hierarchical structures.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
| |
6
|
D'Alessio, S., Murray, M., Schiaffino, R. and Kershenbaum, A. Category levels in hierarchical text categorization. Proceedings of EMNLP-3, 3rd Conference on Empirical Methods in Natural Language Processing, 1998.
|
 |
7
|
Susan Dumais , John Platt , David Heckerman , Mehran Sahami, Inductive learning algorithms and representations for text categorization, Proceedings of the seventh international conference on Information and knowledge management, p.148-155, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288651]
|
| |
8
|
Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and Tzeras, K. Air/X - A rule-based multi-stage indexing system for lage subject fields. Proceedings of RIAO'91,606-623, 1991.
|
| |
9
|
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D. Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. Hypertext -A Psychological Perspective. Ellis Horwood, 1993.
|
| |
15
|
Larkey, L. Some issues in the automatic classification of U.S. patents. In Working Notes for the AAAI-98 Workshop on Learning for Text Categorization, 1998.
|
| |
16
|
Lewis, D.D. and Ringuette, M.. A comparison of two learning algorithms for text categorization. Third Annual Symposium on Document Analysis and Information Retrieval ( SDAIR'94 ), 81-93, 1994.
|
| |
17
|
|
| |
18
|
Mladenic, D. and Grobelnik, M. Feature selection for classification based on text hierarchy. Proceedings of the Workshop on Learning from Text and the Web, 1998.
|
 |
19
|
Hwee Tou Ng , Wei Boon Goh , Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
|
| |
20
|
|
 |
21
|
|
 |
22
|
Hinrich Schütze , David A. Hull , Jan O. Pedersen, A comparison of classifiers and document representations for the routing problem, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.229-237, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215365]
|
| |
23
|
Vapnik, V., Estimation of Dependencies Based on Data {in Russian}, Nauka, Moscow, 1979. (English translation: Springer Verlag, 1982.)
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
CITED BY 109
|
|
|
|
|
|
|
|
|
|
|
Ludovic Denoyer , Jean-Noël Vittaut , Patrick Gallinari , Sylvie Brunessaux , Stephan Brunessaux, Structured multimedia document classification, Proceedings of the 2003 ACM symposium on Document engineering, November 20-22, 2003, Grenoble, France
|
|
|
|
|
|
|
|
|
Kristina Toutanova , Francine Chen , Kris Popat , Thomas Hofmann, Text classification in a hierarchical mixture model for small training sets, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Miles Efron , Jonathan Elsas , Gary Marchionini , Junliang Zhang, Machine learning for information architecture in a large governmental website, Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2004, Tuscon, AZ, USA
|
|
|
Ofer Dekel , Joseph Keshet , Yoram Singer, Large margin hierarchical classification, Proceedings of the twenty-first international conference on Machine learning, p.27, July 04-08, 2004, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tie-Yan LIU , Hao WAN , Tao QIN , Zheng CHEN , Yong REN , Wei-Ying MA, Site abstraction for rare category classification in large-scale web directory, Special interest tracks and posters of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
Benyu Zhang , Hua Li , Yi Liu , Lei Ji , Wensi Xi , Weiguo Fan , Zheng Chen , Wei-Ying Ma, Improving web search results using affinity graph, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
Fuminori Kimura , Akira Maeda , Masatoshi Yoshikawa , Shunsuke Uemura, Cross-Language Information Retrieval based on category matching between language versions of a web directory, Proceedings of the sixth international workshop on Information retrieval with Asian languages, p.153-160, July 07-07, 2003, Sappro, Japan
|
|
|
Tie-Yan Liu , Yiming Yang , Hao Wan , Hua-Jun Zeng , Zheng Chen , Wei-Ying Ma, Support vector machines classification with a very large-scale taxonomy, ACM SIGKDD Explorations Newsletter, v.7 n.1, p.36-43, June 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bin Gao , Tie-Yan Liu , Guang Feng , Tao Qin , Qian-Sheng Cheng , Wei-Ying Ma, Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Copartitioning, IEEE Transactions on Knowledge and Data Engineering, v.17 n.9, p.1263-1273, September 2005
|
|
|
Juho Rousu , Craig Saunders , Sandor Szedmak , John Shawe-Taylor, Learning hierarchical multi-category text classification models, Proceedings of the 22nd international conference on Machine learning, p.744-751, August 07-11, 2005, Bonn, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gui-Rong Xue , Yong Yu , Dou Shen , Qiang Yang , Hua-Jun Zeng , Zheng Chen, Reinforcing Web-object Categorization Through Interrelationships, Data Mining and Knowledge Discovery, v.12 n.2-3, p.229-248, May 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Claire Cardie , Cynthia Farina , Adil Aijaz , Matt Rawding , Stephen Purpura, A study in rule-specific issue categorization for e-rulemaking, Proceedings of the 2008 international conference on Digital government research, May 18-21, 2008, Montreal, Canada
|
|
|
|
|
|
Dikan Xing , Gui-Rong Xue , Qiang Yang , Yong Yu, Deep classifier: automatically categorizing search results into large-scale hierarchies, Proceedings of the international conference on Web search and web data mining, February 11-12, 2008, Palo Alto, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Xin Jin , Ying Li , Teresa Mah , Jie Tong, Sensitive webpage classification for content advertising, Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, p.28-33, August 12-12, 2007, San Jose, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Casey Whitelaw , Alex Kehlenbeck , Nemanja Petrovic , Lyle Ungar, Web-scale named entity recognition, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|