|
ABSTRACT
A major challenge for content management in intranets and other large scale document storage and retrieval services is the generation of high quality metadata. Manual generation of metadata is resource demanding and is often viewed by collection managers and document authors as inefficient use of their time, and there is a desire for other ways to create the needed metadata. Automatic Metadata Generation (AMG) is methods for generating metadata without manual interaction using computer program(s) to interpret the document and possibly the document context. Current AMG research has been limited to collection of similarly formatted documents. The research presented in this paper expands the field of AMG by presenting an approach that is independent of a common visualization scheme; AMG based on document code analysis. This is done by showing AMG possibilities from Latex, Word and PowerPoint documents and how this approach can significantly increase the quality of the generated metadata. This by avoiding common quality reducing factors as missing completeness, low accuracy, logical consistency and coherence and timeliness by giving AMG algorithms direct access to the user specified intellectual content and the file formatting. This research shows how this AMG approach can be combined with other AMG approaches, drawing on their strengths in order to achieve the desired high quality metadata entities.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Greenberg, J. 2004. Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. Journal of Internet Cataloging, 6(4): 59--82.
|
| |
3
|
Meire, M., Ochoa, X. and Duval, E. 2007. SAmgI: Automatic Metadata Generation v2.0. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, pp. 1195--1204, Chesapeake, VA: AACE
|
| |
4
|
|
| |
5
|
Edvardsen, L.F.H., Sølvberg, I.T., Aalberg, T., Trætteberg, H. 2009. Using the structural content of documents to automatically generate quality metadata. Webist 2009, March 23--26, 2009. Springer
|
| |
6
|
Edvardsen, L.F.H., Sølvberg, I.T. 2007. Metadata challenges in introducing the global IEEE Learning Object metadata (LOM) standard in a local environment. Webist 2007, March 3--6, 2007. Springer
|
| |
7
|
IEEE LTSC, 2005. IEEE P1484.12.3/D8, 2005-02-22 Draft Standard for Learning Technology -- Extensible Markup Language Schema Definition Language Binding for Learning Object Metadata, WG12: Related Materials, http://ltsc.ieee.org/wg12/files/IEEE_1484_12_03_d8_submitted.pdf
|
| |
8
|
DCMI, 2008. Dublin Core Metadata Element Set, Version 1.1. Dublin Core Metadata Initiative, http://dublincore.org/documents/dces/
|
| |
9
|
It's learning. 2009. It's learning. http://www.itslearning.com
|
| |
10
|
Open Archives Initiative. 2004 Protocol for Metadata Harvesting -- v.2.0. http://www.openarchives.org/OAI/openarchivesprotocol.html
|
| |
11
|
Seymore, K., McCallum, A. and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. Proc. of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37--42, 1999.
|
| |
12
|
Greenstone. 2007. Source only distribution. http://prdownloads.sourceforge.net/greenstone/gsdl-2.72-src.tar.gz (source code inspected)
|
| |
13
|
Bird, K. and the Jorum Team. 2006. Automated Metadata -- A review of existing and potential metadata automation within Jorum and an overview of other automation systems. 31st March 2006, Version 1.0, Final, Signed off by JISC and Intrallect July 2006.
|
| |
14
|
Google. 2009. Google. http://www.google.com
|
| |
15
|
Scirus. 2009. Scirus -- for scientific information. http://www.scirus.com
|
| |
16
|
Yahoo. 2009. Yahoo!, http://www.yahoo.com
|
| |
17
|
Singh, A., Boley, H. and Bhavsar, V.C. 2004. LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology. National Research Council and University of New Brunswick, Learning Objects Summit Fredericton, NB, Canada, March 29--30, 2004
|
 |
18
|
Giovanni Giuffrida , Eddie C. Shek , Jihoon Yang, Knowledge-based metadata extraction from PostScript files, Proceedings of the fifth ACM conference on Digital libraries, p.77-84, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336639]
|
| |
19
|
Kawtrakul A. and Yingsaeree C. 2005. A Unified Framework for Automatic Metadata Extraction from Electronic Document. Proceedings of IADLC2005 (25--26 August 2005), pp. 71--77.
|
| |
20
|
Flynn, P., Zhou, L., Maly, K., Zeil, S. and Zubair, M. 2007. Automated Template--Based Metadata Extraction Architecture. ICADL 2007.
|
 |
21
|
Hang Li , Yunbo Cao , Jun Xu , Yunhua Hu , Shenjie Li , Dmitriy Meyerzon, A new approach to intranet search based on information extraction, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
[doi> 10.1145/1099554.1099685]
|
| |
22
|
Liu, Y., Bai, K., Mitra, P, and Giles, C.L. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. JCDL'07, June 18--23, 2007, Vancouver, Canada, ACM 978-1-59593-644-8/07/0006
|
| |
23
|
Boguraev, B. and Neff, M. 2000. Lexical Cohesion, Discourse Segmentation and Document Summarization. RIAO.
|
| |
24
|
LOMGen. 2006. LOMGen. http://www.cs.unb.ca/agentmatcher/LOMGen.html
|
| |
25
|
Greenberg J., Spurgin, K., Crystal, A., Cronquist, M. and Wilson, A. 2005. Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. UNC School of information and library science.
|
 |
26
|
|
 |
27
|
Elizabeth D. Liddy , Eileen Allen , Sarah Harwell , Susan Corieri , Ozgur Yilmazel , N. Ercan Ozgencil , Anne Diekema , Nancy McCracken , Joanne Silverstein , Stuart Sutton, Automatic metadata generation & evaluation, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, August 11-15, 2002, Tampere, Finland
[doi> 10.1145/564376.564464]
|
| |
28
|
Jenkins, C. and Inman, D. 2001. Server-side Automatic Metadata Generation using Qualified Dublin Core and RDF. 0-7695-1022-1/01, 2001 IEEE
|
| |
29
|
|
| |
30
|
Bruce, T.R. and Hillmann, D.I. 2004. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions, In Metadata in Practice, D. Hillmann & E Westbrooks, eds., ISSN: 0-8389-0882-9
|
| |
31
|
Yewei Xue , Yunhua Hu , Guomao Xin , Ruihua Song , Shuming Shi , Yunbo Cao , Chin-Yew Lin , Hang Li, Web page title extraction and its application, Information Processing and Management: an International Journal, v.43 n.5, p.1332-1347, September, 2007
[doi> 10.1016/j.ipm.2006.11.007]
|
| |
32
|
ACM. 2009. ACM SIG Proceedings Templates, http://www.acm.org/sigs/publications/proceedings-templates
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.7
Digital Libraries
General Terms:
Algorithms,
Experimentation,
Reliability,
Verification
Keywords:
PDF,
automatic metadata generation,
document code,
extraction,
harvesting,
latex,
metadata quality,
openXML,
powerpoint,
word
|