| Subject metadata enrichment using statistical topic models |
| Full text |
Pdf
(346 KB)
|
Source
|
International Conference on Digital Libraries
archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Vancouver, BC, Canada
SESSION: Large-scale collections
table of contents
Pages: 366 - 375
Year of Publication: 2007
ISBN:978-1-59593-644-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 13, Downloads (12 Months): 104, Citation Count: 2
|
|
|
ABSTRACT
Creating a collection of metadata records from disparate and diverse sources often results in uneven, unreliable and variable quality subject metadata. Having uniform, consistent and enriched subject metadata allows users to more easily discover material, browse the collection, and limit keyword search results by subject. We demonstrate how statistical topic models are useful for subject metadata enrichment. We describe some of the challenges of metadata enrichment on a huge scale (10 million metadata records from 700 repositories in the OAIster Digital Library) when the metadata is highly heterogeneous (metadata about images and text, and both cultural heritage material and scientific literature). We show how to improve the quality of the enriched metadata, using both manual and statistical modeling techniques. Finally, we discuss some of the challenges of the production environment, and demonstrate the value of the enriched metadata in a prototype portal.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Wray Buntine , Jaakko Lofstrom , Jukka Perkio , Sami Perttu , Vladimir Poroshin , Tomi Silander , Henry Tirri , Antti Tuominen , Ville Tuulos, A Scalable Topic-Based Open Source Search Engine, Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, p.228-234, September 20-24, 2004
[doi> 10.1109/WI.2004.12]
|
| |
3
|
Chemudugunta, C., Smyth, P., Steyvers, M., Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS'06, Advances in Neural Information Processing Systems 19. 2006.
|
| |
4
|
Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., Harshman, R. A. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.
|
| |
5
|
|
| |
6
|
Griffiths, T., Steyvers, M., Finding Scientific Topics. PNAS, 101(suppl. 1):5228--5235. 2004.
|
| |
7
|
|
 |
8
|
|
| |
9
|
Lee, D., Seung, H. S., Learning the parts of objects by non-negative matrix factorization. Nature, v.401, 788--791, 1999.
|
 |
10
|
|
 |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
Newman, D., Chemudugunta, C., Smyth, P., Steyvers, M. Analyzing Entities and Topics in News Articles Using Statistical Topic Models. In LNCS-IEEE Conference on Intelligence and Security Informatics. pp 93--104. San Diego, 2006
|
CITED BY 2
|
Ian Porteous , David Newman , Alexander Ihler , Arthur Asuncion , Padhraic Smyth , Max Welling, Fast collapsed gibbs sampling for latent dirichlet allocation, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|