ACM Home Page
Please provide us with feedback. Feedback
Topic model methods for automatically identifying out-of-scope resources
Full text PdfPdf (451 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries table of contents
Austin, TX, USA
SESSION: 1 table of contents
Pages 19-28  
Year of Publication: 2009
ISBN:978-1-60558-322-8
Authors
Steven Bethard  Stanford University, Stanford, CA, USA
Soumya Ghosh  University of Colorado, Boulder, CO, USA
James H. Martin  University of Colorado, Boulder, CO, USA
Tamara Sumner  University of Colorado, Boulder, CO, USA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 70,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555400.1555405
What is a DOI?

ABSTRACT

Recent years have seen the rise of subject-themed digital libraries, such as the NSDL pathways and the Digital Library for Earth System Education (DLESE). These libraries often need to manually verify that contributed resources cover topics that fit within the theme of the library. We show that such scope judgments can be automated using a combination of text classification techniques and topic modeling. Our models address two significant challenges in making scope judgments: only a small number of out-of-scope resources are typically available, and the topic distinctions required for digital libraries are much more subtle than classic text classification problems. To meet these challenges, our models combine support vector machine learners optimized to different performance metrics and semantic topics induced by unsupervised statistical topic models. Our best model is able to distinguish resources that belong in DLESE from resources that don't with an accuracy of around 70%. We see these models as the first steps towards increasing the scalability of digital libraries and dramatically reducing the workload required to maintain them.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Alias-i. LingPipe 3.7.0. http://alias-i.com/lingpipe/, Oct. 2008.
 
2
 
3
BEN. BiosciEdNet. http://www.biosciednet.org/, 2009.
 
4
D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 121--128. MIT Press, Cambridge, MA, 2008.
 
5
 
6
I. Buciu. Learning sparse non-negative features for object recognition. In Intelligent Computer Communication and Processing, 2007 IEEE International Conference on, pages 73--79, 2007.
 
7
comPADRE. Resources for physics and astronomy education. http://www.compadre.org/, 2009.
 
8
DLESE. Digital library for earth system education. http://www.dlese.org/, 2009.
 
9
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5):1048--1054, 1999.
 
10
P. Ginsparg. Winners and losers in the global research village. In Proceedings of the Joint ICSU Press/UNESCO Expert Conference on Electronic Publishing in Science, 1996.
 
11
 
12
D. Hiom. The social science information gateway: putting theory into practice. Information Research, 4(1), 1998.
 
13
 
14
15
16
 
17
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, Oct. 1999.
 
18
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556--562, 2000.
 
19
 
20
 
21
Y. Li and J. Shawe-Taylor. The SVM with uneven margins and chinese document categorization. COLIPS PUBLICATIONS, 2003.
22
 
23
A. Moschitti and R. Basili. Complex Linguistic Features for Text Classification: A Comprehensive Study, pages 181--196. Springer Berlin / Heidelberg, 2004.
24
 
25
 
26
NSDL. National science digital library. http://nsdl.org/, 2009.
 
27
OAIster. Open archives initiative (OAI)ster. http://www.oaister.org/, 2009.
 
28
OCA. Open content alliance. http://www.opencontentalliance.org/, 2009.
29
 
30
P. Soucy. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005, pages 1130--1135, 2005.
 
31
D. Soukup and I. Bajla. Robust object recognition under partial occlusions using NMF. Computational Intelligence and Neuroscience, 2008:857453, 2008\vadjust\newpage. PMC2396239.
 
32
M. Steyvers and T. Griffiths. Probabilistic topic models. In T. Landauer, Mc, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Lawrence Earlbaum, 2007.
33
 
34
 
35
S. Zhou, K. Li, and Y. Liu. Text Categorization Based on Topic Model, pages 572--579. Springer Berlin / Heidelberg, 2008.

Collaborative Colleagues:
Steven Bethard: colleagues
Soumya Ghosh: colleagues
James H. Martin: colleagues
Tamara Sumner: colleagues