ACM Home Page
Please provide us with feedback. Feedback
Digital Library logoTake a look at the new version of this page: [ beta version ]. Tell us what you think.
Test collection management and labeling system
Full text PdfPdf (1.34 MB)
Source
Document Engineering archive
Proceedings of the 9th ACM symposium on Document engineering table of contents
Munich, Germany
SESSION: Experiments and methodology table of contents
Pages: 39-42  
Year of Publication: 2009
ISBN:978-1-60558-575-8
Authors
Eunyee Koh  Adobe Systems Inc., San Jose, CA, USA
Andruid Kerne  Texas A&M University, College Station, TX, USA
Sarah Berry  Texas A&M University, College Station, TX, USA
Sponsors
SIGDOC: ACM Special Interest Group for Design of Communications
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 14,   Downloads (12 Months): 25,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1600193.1600203
What is a DOI?

ABSTRACT

In order to evaluate the performance of information retrieval and extraction algorithms, we need test collections. A test collection consists of a set of documents, a clearly formed problem that an algorithm is supposed to provide solutions to, and the answers that the algorithm should produce when executed on the documents. Defining the association between elements in the test collection and answers is known as labeling. For mainstream information retrieval problems, there are publicly available test collections which have been maintained for years. However, the scope of these problems, and thus the associated test collections, is limited. In other cases, researchers need to build, label, and manage their own test collections, which can be a tedious and error-prone task. We were building test collections of HTML documents, for problems in which the answers that the algorithm supplies is a sub-tree of the DOM (Document Object Model). To lighten the burden of this task, we developed a test collection management and labeling system (TCMLS), to facilitate usability in the process of building test collections, applying them to validate algorithms, and potentially sharing them across the research community.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Cai, D., Yu, S., Wen, J. R.&Ma, W. Y., VIPS: a Vision-based Page Segmentation Algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
3
 
4
DOM Inspector, Mozilla, http://www.mozilla.org/projects/inspector/, last visited 12/13/2007.
 
5
Fox, E. A., Characterization of two new experimental collections in computer and information science containing textual and bibliographical concepts, Technical Report 83-561, Cornell University, Department of Computer Science, Ithaca, NY, 1983.
 
6
JTidy, http://jtidy.sourceforge.net/, last visited 01/08/2008.
7
 
8
Newsblaster, http://www.newsblaster.com/, last visited 04/22/2009.
 
9
Slaughter, L., Marchionini, G., Geisler, G., Open video: A framework for a test collection, Journal of Network and Computer Applications, 23(3), 2000, pp. 219--245.
10
 
11
Text REtrieval Conference (TREC), http://trec.nist.gov/, last visited 01/18/2008.
12
 
13
TREC Video Retrieval Evaluation (TRECVID), http://www-nlpir.nist.gov/projects/trecvid/, last visited 04/22/2009.
 
14
W3C, Document Object Model (DOM) Level 2 Core Specification, http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/, 2000.
 
15
XML Path Language (XPath), http://www.w3.org/TR/xpath, last visited 04/22/2009.


Collaborative Colleagues:
Eunyee Koh: colleagues
Andruid Kerne: colleagues
Sarah Berry: colleagues