ACM Home Page
Please provide us with feedback. Feedback
Subject metadata enrichment using statistical topic models
Full text PdfPdf (346 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries table of contents
Vancouver, BC, Canada
SESSION: Large-scale collections table of contents
Pages: 366 - 375  
Year of Publication: 2007
ISBN:978-1-59593-644-8
Authors
David Newman  UC Irvine, Irvine, CA
Kat Hagedorn  University of Michigan, Ann Arbor, MI
Chaitanya Chemudugunta  UC Irvine, Irvine, CA
Padhraic Smyth  UC Irvine, Irvine, CA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 104,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1255175.1255248
What is a DOI?

ABSTRACT

Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
Chemudugunta, C., Smyth, P., Steyvers, M., Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS'06, Advances in Neural Information Processing Systems 19. 2006.
 
4
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
 
5
 
6
Griffiths, T., Steyvers, M., Finding Scientific Topics. PNAS, 101(suppl. 1):5228--5235. 2004.
 
7
8
 
9
Lee, D., Seung, H. S., Learning the parts of objects by non-negative matrix factorization. Nature, v.401, 788--791, 1999.
10
11
 
12
 
13
 
14
Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In LNCS-IEEE Conference on Intelligence and Security Informatics. pp 93--104. San Diego, 2006


Collaborative Colleagues:
David Newman: colleagues
Kat Hagedorn: colleagues
Chaitanya Chemudugunta: colleagues
Padhraic Smyth: colleagues