ACM Home Page
Please provide us with feedback. Feedback
Identifying the original contribution of a document via language modeling
Full text PdfPdf (372 KB)
Source
Annual ACM Conference on Research and Development in Information Retrieval archive
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval table of contents
Boston, MA, USA
POSTER SESSION: Posters table of contents
Pages 696-697  
Year of Publication: 2009
ISBN:978-1-60558-483-6
Authors
Benyah Shaparenko  Cornell University, Ithaca, NY, USA
Thorsten Joachims  Cornell University, Ithaca, NY, USA
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 62,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1571941.1572083
What is a DOI?

ABSTRACT

One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Document understanding conferences. http://duc.nist.gov/.
 
2
J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop-1998, 1998.
 
3
 
4
5
 
6
I. Soboroff and D. Harman. Overview of the TREC 2003 novelty track. In Proceedings of TREC-2003, 2003.

Collaborative Colleagues:
Benyah Shaparenko: colleagues
Thorsten Joachims: colleagues