| Topic model methods for automatically identifying out-of-scope resources |
| Full text |
Pdf
(451 KB)
|
Source
|
International Conference on Digital Libraries
archive
Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Austin, TX, USA
Pages 19-28
Year of Publication: 2009
ISBN:978-1-60558-322-8
|
|
Authors
|
|
Steven Bethard
|
Stanford University, Stanford, CA, USA
|
|
Soumya Ghosh
|
University of Colorado, Boulder, CO, USA
|
|
James H. Martin
|
University of Colorado, Boulder, CO, USA
|
|
Tamara Sumner
|
University of Colorado, Boulder, CO, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 27, Downloads (12 Months): 70, Citation Count: 0
|
|
|
ABSTRACT
Recent years have seen the rise of subject-themed digital libraries, such as the NSDL pathways and the Digital Library for Earth System Education (DLESE). These libraries often need to manually verify that contributed resources cover topics that fit within the theme of the library. We show that such scope judgments can be automated using a combination of text classification techniques and topic modeling. Our models address two significant challenges in making scope judgments: only a small number of out-of-scope resources are typically available, and the topic distinctions required for digital libraries are much more subtle than classic text classification problems. To meet these challenges, our models combine support vector machine learners optimized to different performance metrics and semantic topics induced by unsupervised statistical topic models. Our best model is able to distinguish resources that belong in DLESE from resources that don't with an accuracy of around 70%. We see these models as the first steps towards increasing the scalability of digital libraries and dramatically reducing the workload required to maintain them.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alias-i. LingPipe 3.7.0. http://alias-i.com/lingpipe/, Oct. 2008.
|
| |
2
|
|
| |
3
|
BEN. BiosciEdNet. http://www.biosciednet.org/, 2009.
|
| |
4
|
D. Blei and J. McAuliffe. Supervised topic models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 121--128. MIT Press, Cambridge, MA, 2008.
|
| |
5
|
|
| |
6
|
I. Buciu. Learning sparse non-negative features for object recognition. In Intelligent Computer Communication and Processing, 2007 IEEE International Conference on, pages 73--79, 2007.
|
| |
7
|
comPADRE. Resources for physics and astronomy education. http://www.compadre.org/, 2009.
|
| |
8
|
DLESE. Digital library for earth system education. http://www.dlese.org/, 2009.
|
| |
9
|
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5):1048--1054, 1999.
|
| |
10
|
P. Ginsparg. Winners and losers in the global research village. In Proceedings of the Joint ICSU Press/UNESCO Expert Conference on Electronic Publishing in Science, 1996.
|
| |
11
|
|
| |
12
|
D. Hiom. The social science information gateway: putting theory into practice. Information Research, 4(1), 1998.
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, Oct. 1999.
|
| |
18
|
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, pages 556--562, 2000.
|
| |
19
|
|
| |
20
|
|
| |
21
|
Y. Li and J. Shawe-Taylor. The SVM with uneven margins and chinese document categorization. COLIPS PUBLICATIONS, 2003.
|
 |
22
|
|
| |
23
|
A. Moschitti and R. Basili. Complex Linguistic Features for Text Classification: A Comprehensive Study, pages 181--196. Springer Berlin / Heidelberg, 2004.
|
 |
24
|
David Newman , Kat Hagedorn , Chaitanya Chemudugunta , Padhraic Smyth, Subject metadata enrichment using statistical topic models, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
[doi> 10.1145/1255175.1255248]
|
| |
25
|
|
| |
26
|
NSDL. National science digital library. http://nsdl.org/, 2009.
|
| |
27
|
OAIster. Open archives initiative (OAI)ster. http://www.oaister.org/, 2009.
|
| |
28
|
OCA. Open content alliance. http://www.opencontentalliance.org/, 2009.
|
 |
29
|
|
| |
30
|
P. Soucy. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005, pages 1130--1135, 2005.
|
| |
31
|
D. Soukup and I. Bajla. Robust object recognition under partial occlusions using NMF. Computational Intelligence and Neuroscience, 2008:857453, 2008\vadjust\newpage. PMC2396239.
|
| |
32
|
M. Steyvers and T. Griffiths. Probabilistic topic models. In T. Landauer, Mc, S. Dennis, and W. Kintsch, editors, Latent Semantic Analysis: A Road to Meaning. Lawrence Earlbaum, 2007.
|
 |
33
|
|
| |
34
|
Ozgur Yilmazel , Niranjan Balasubramanian , Sarah C. Harwell , Jennifer Bailey , Anne R. Diekema , Elizabeth D. Liddy, Text Categorization for Aligning Educational Standards, Proceedings of the 40th Annual Hawaii International Conference on System Sciences, p.73, January 03-06, 2007
[doi> 10.1109/HICSS.2007.517]
|
| |
35
|
S. Zhou, K. Li, and Y. Liu. Text Categorization Based on Topic Model, pages 572--579. Springer Berlin / Heidelberg, 2008.
|
|