| Multi-evidence, multi-criteria, lazy associative document classification |
| Full text |
Pdf
(269 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 15th ACM international conference on Information and knowledge management
table of contents
Arlington, Virginia, USA
SESSION: Classification - 1
table of contents
Pages: 218 - 227
Year of Publication: 2006
ISBN:1-59593-433-2
|
|
Authors
|
|
Adriano Veloso
|
Federal University of Minas Gerais, Belo Horizonte, Brazil
|
|
Wagner Meira, Jr.
|
Federal University of Minas Gerais, Belo Horizonte, Brazil
|
|
Marco Cristo
|
Federal University of Minas Gerais, Belo Horizonte, Brazil
|
|
Marcos Gonçalves
|
Federal University of Minas Gerais, Belo Horizonte, Brazil
|
|
Mohammed Zaki
|
Rensselaer Polytechnic Institute, Troy
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 56, Citation Count: 2
|
|
|
ABSTRACT
We present a novel approach for classifying documents that combines different pieces of evidence (e.g., textual features of documents, links, and citations) transparently, through a data mining technique which generates rules associating these pieces of evidence to predefined classes. These rules can contain any number and mixture of the available evidence and are associated with several quality criteria which can be used in conjunction to choose the "best" rule to be applied at classification time. Our method is able to perform evidence enhancement by link forwarding/backwarding (i.e., navigating among documents related through citation), so that new pieces of link-based evidence are derived when necessary. Furthermore, instead of inducing a single model (or rule set) that is good on average for all predictions, the proposed approach employs a lazy method which delays the inductive process until a document is given for classification, therefore taking advantage of better qualitative evidence coming from the document. We conducted a systematic evaluation of the proposed approach using documents from the ACM Digital Library and from a Brazilian Web directory. Our approach was able to outperform in both collections all classifiers based on the best available evidence in isolation as well as state-of-the-art multi-evidence classifiers. We also evaluated our approach using the standard WebKB collection, where our approach showed gains of 1% in accuracy, being 25 times faster. Further, our approach is extremely efficient in terms of computational performance, showing gains of more than one order of magnitude when compared against other multi-evidence classifiers.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
R. Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, 1972.
|
| |
2
|
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and regression trees. Wadsworth Intl., 1984.
|
 |
3
|
Sergey Brin , Rajeev Motwani , Craig Silverstein, Beyond market baskets: generalizing association rules to correlations, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.265-276, May 11-15, 1997, Tucson, Arizona, United States
|
 |
4
|
Pável Calado , Marco Cristo , Edleno Moura , Nivio Ziviani , Berthier Ribeiro-Neto , Marcos André Gonçalves, Combining link-based and content-based methods for web document classification, Proceedings of the twelfth international conference on Information and knowledge management, November 03-08, 2003, New Orleans, LA, USA
[doi> 10.1145/956863.956938]
|
| |
5
|
D. Cohn and T. Hofmann. The missing link - A probabilistic model of document content and hypertext connectivity. In Advances in Neural Inf. Processing Systems, pages 430--436. MIT Press, 2001.
|
| |
6
|
S. Dasgupta, M. Littman, and D. McAllester. PAC generalization bounds for cotraining. In Proc. of Neural Inf. Processing Systems, 2001.
|
| |
7
|
M. Fisher and R. Everson. When are links useful? Experiments in text classification. In Proc. of ECIR03, pages 41--56, Pisa, Italy, April 2003.
|
| |
8
|
J. Friedman, R. Kohavi, and Y. Yun. Lazy decision trees. In Proc. of the Nat. Conf. on Artificial Intelligence, pages 717--724, Menlo Park, 1996.
|
| |
9
|
|
 |
10
|
David Gibson , Jon Kleinberg , Prabhakar Raghavan, Inferring Web communities from link topology, Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems, p.225-234, June 20-24, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276627.276652]
|
| |
11
|
|
| |
12
|
|
| |
13
|
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In Knowledge Discovery and Data Mining, pages 80--86, 1998.
|
| |
14
|
|
 |
15
|
|
| |
16
|
Altigran S. da Silva , Eveline A. Veloso , Paulo B. Golghe , Berthier Ribeiro-Neto , Alberto H. F. Laender , Nivio Ziviani, CoBWeb A Crawler for the Brazilian Web, Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, p.184, September 21-24, 1999
|
| |
17
|
H. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. JASIS, 24(4):265--269, 1973.
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
X. Yin and J. Han. CPAR: Classification based on predictive association rules. In Proc. of the SDM03. SIAM, 2003.
|
 |
24
|
|
 |
25
|
Baoping Zhang , Yuxin Chen , Weiguo Fan , Edward A. Fox , Marcos Gonçalves , Marco Cristo , Pável Calado, Intelligent GP fusion from multiple sources for text classification, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
[doi> 10.1145/1099554.1099688]
|
CITED BY 2
|
|
Leonardo Rocha , Fernando Mourão , Adriano Pereira , Marcos André Gonçalves , Wagner Meira, Jr., Exploiting temporal contexts in text classification, Proceeding of the 17th ACM conference on Information and knowledge management, October 26-30, 2008, Napa Valley, California, USA
|
|
|
Adriano Veloso , Wagner Meira, Jr , Mohammed Zaki, Calibrated lazy associative classification, Proceedings of the 23rd Brazilian symposium on Databases, October 13-17, 2008, Campinas, Sao Paulo, Brazil
|
|