| Similarity measures for tracking information flow |
| Full text |
Pdf
(146 KB)
|
| Source
|
Conference on Information and Knowledge Management
archive
Proceedings of the 14th ACM international conference on Information and knowledge management
table of contents
Bremen, Germany
SESSION: Paper session IR-6 (information retrieval): IR models 1
table of contents
Pages: 517 - 524
Year of Publication: 2005
ISBN:1-59593-140-6
|
|
Authors
|
|
Donald Metzler
|
University of Massachusetts, Amherst, MA
|
|
Yaniv Bernstein
|
RMIT University, Melbourne, Australia
|
|
W. Bruce Croft
|
University of Massachusetts, Amherst, MA
|
|
Alistair Moffat
|
University of Melbourne, Melbourne, Australia
|
|
Justin Zobel
|
RMIT University, Melbourne, Australia
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): n/a, Downloads (12 Months): n/a, Citation Count: 11
|
|
|
ABSTRACT
Text similarity spans a spectrum, with broad topical similarity near one extreme and document identity at the other. Intermediate levels of similarity -- resulting from summarization, paraphrasing, copying, and stronger forms of topical relevance -- are useful for applications such as information flow analysis and question-answering tasks. In this paper, we explore mechanisms for measuring such intermediate kinds of similarity, focusing on the task of identifying where a particular piece of information originated. We consider both sentence-to-sentence and document-to-document comparison, and have incorporated these algorithms into <small>RECAP</small>, a prototype information flow analysis tool. Our experimental results with <small>RECAP</small> indicate that new mechanisms such as those we propose are likely to be more appropriate than existing methods for identifying the intermediate forms of similarity.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proc. DARPA Broadcast News Transcription and Understanding Workshop, pages 194--218, 1998.
|
 |
3
|
|
| |
4
|
Y. Bernstein and J. Zobel. A scalable system for identifying coderivative documents. In Proc. String Processing and Information Retrieval Symp., pages 55--67, 2004. Published as LNCS 3246.
|
 |
5
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
6
|
|
| |
7
|
|
| |
8
|
D. Harman. Overview of the TREC 2002 novelty track. In Proc. 11th Text REtrieval Conf. (TREC 2002). NIST, 2002.
|
| |
9
|
N. Heintze. Scalable document fingerprinting. In Proc. USENIX Workshop on Electronic Commerce, November 1996.
|
| |
10
|
|
| |
11
|
U. Manber. Finding similar files in a large file system. In Proc. USENIX Winter Technical Conf., pages 1--10, San Fransisco, CA, USA, 17--21 1994.
|
 |
12
|
Donald Metzler , Yaniv Bernstein , W. Bruce Croft , Alistair Moffat , Justin Zobel, The recap system for identifying information flow, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
[doi> 10.1145/1076034.1076190]
|
| |
13
|
D. Metzler, T. Strohman, H. Turtle, and W. B. Croft. Indri at terabyte track 2004. In Proc. 13th Text REtrieval Conf. (TREC 2004). NIST, 2004.
|
| |
14
|
V. Murdock and W. B. Croft. Simple translation models for sentence retrieval in factoid question answering. In Proc. SIGIR Workshop on Information Retrieval for Question Answering, pages 31--35, 2004.
|
 |
15
|
|
| |
16
|
S. E. Robertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proc. 1st Text REtrieval Conf. (TREC 2001), pages 21--30. NIST, 1992.
|
| |
17
|
M. Sanderson. Duplicate detection in the Reuters collection. Technical Report TR-1997-5, University of Glasgow, 1997.
|
| |
18
|
N. Shivakumar and H. García-Molina. SCAM: A copy detection mechanism for digital documents. In Proc. 2nd Conf. on the Theory and Practice of Digital Libraries, 1995.
|
| |
19
|
I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proc. 12th Text REtrieval Conf. (TREC 2003), pages 38--53. NIST, 2003.
|
 |
20
|
|
|