|
ABSTRACT
In order to evaluate the performance of information retrieval and extraction algorithms, we need test collections. A test collection consists of a set of documents, a clearly formed problem that an algorithm is supposed to provide solutions to, and the answers that the algorithm should produce when executed on the documents. Defining the association between elements in the test collection and answers is known as labeling. For mainstream information retrieval problems, there are publicly available test collections which have been maintained for years. However, the scope of these problems, and thus the associated test collections, is limited. In other cases, researchers need to build, label, and manage their own test collections, which can be a tedious and error-prone task. We were building test collections of HTML documents, for problems in which the answers that the algorithm supplies is a sub-tree of the DOM (Document Object Model). To lighten the burden of this task, we developed a test collection management and labeling system (TCMLS), to facilitate usability in the process of building test collections, applying them to validate algorithms, and potentially sharing them across the research community.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Baeza-Yates, R., Ribeiro-Neto, B., Modern Information Retrieval, Addison-Wesley Longman Publishing, 1999.
|
| |
2
|
Cai, D., Yu, S., Wen, J. R. & Ma, W. Y., VIPS: a Vision-based Page Segmentation Algorithm, Microsoft Technical Report, MSR-TR-2003-79, 2003.
|
| |
3
|
Dakka, W., Gravano, L., Efficient summarization-aware search for online news articles, JCDL 2007, 63--72.
|
| |
4
|
DOM Inspector, Mozilla, http://www.mozilla.org/projects/inspector/, last visited 12/13/2007.
|
| |
5
|
Fox, E. A., Characterization of two new experimental collections in computer and information science containing textual and bibliographical concepts, Technical Report 83-561, Cornell University, Department of Computer Science, Ithaca, NY, 1983.
|
| |
6
|
JTidy, http://jtidy.sourceforge.net/, last visited 01/08/2008.
|
| |
7
|
Liu, Y., Bai, K., Mitra, P., Giles, L. C., TableSeer: automatic table metadata extraction and searching in digital libraries, JCDL 2007, 91--100.
|
| |
8
|
Newsblaster, http://www.newsblaster.com/, last visited 04/22/2009.
|
| |
9
|
Slaughter, L., Marchionini, G., Geisler, G., Open video: A framework for a test collection, Journal of Network and Computer Applications, 23(3), 2000, pp. 219--245.
|
| |
10
|
Song, R., Liu, H., Wen, J. R., & Ma, W. Y., Learning Important Models for Web Page Blocks based on Layout and Content Analysis, Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) Explorations, 6(2), pp. 14--23, 2004.
|
| |
11
|
Text REtrieval Conference (TREC), http://trec.nist.gov/, last visited 01/18/2008.
|
| |
12
|
Kerne, A., Toups, Z. O., Dworaczyk, B., Khandelwal, M., A concise XML binding framework facilitates practical object-oriented document engineering, ACM DocEng 2008, 62--65.
|
| |
13
|
TREC Video Retrieval Evaluation (TRECVID), http://www-nlpir.nist.gov/projects/trecvid/, last visited 04/22/2009.
|
| |
14
|
W3C, Document Object Model (DOM) Level 2 Core Specification, http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/, 2000.
|
| |
15
|
XML Path Language (XPath), http://www.w3.org/TR/xpath, last visited 04/22/2009.
|
|