|
ABSTRACT
We are experiencing an unprecedented increase of content contributed by users in forums such as blogs, social networking sites and microblogging services. Such abundance of content complements content on web sites and traditional media forums such as news papers, news and financial streams, and so on. Given such plethora of information there is a pressing need to cross reference information across textual services. For example, commonly we read a news item and we wonder if there are any blogs reporting related content or vice versa. In this paper, we present techniques to automate the process of cross referencing online information content. We introduce methodologies to extract phrases from a given "query document" to be used as queries to search interfaces with the goal to retrieve content related to the query document. In particular, we consider two techniques to extract and score key phrases. We also consider techniques to complement extracted phrases with information present in external sources such as Wikipedia and introduce an algorithm called RelevanceRank for this purpose. We discuss both these techniques in detail and provide an experimental study utilizing a large number of human judges from Amazons's Mechanical Turk service. Detailed experiments demonstrate the effectiveness and efficiency of the proposed techniques for the task of automating retrieval of documents related to a query document.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
BlogScope http://www.blogscope.net/about/
|
| |
5
|
Kenneth Ward Church , Patrick Hanks, Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, p.76-83, June 26-29, 1989, Vancouver, British Columbia, Canada
[doi> 10.3115/981623.981633]
|
 |
6
|
Amit Chandel , Oktie Hassanzadeh , Nick Koudas , Mohammad Sadoghi , Divesh Srivastava, Benchmarking declarative approximate selection predicates, Proceedings of the 2007 ACM SIGMOD international conference on Management of data, June 11-14, 2007, Beijing, China
[doi> 10.1145/1247480.1247521]
|
| |
7
|
|
| |
8
|
Cucerzan, S. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In EMNLP-CoNLL, 2007.
|
| |
9
|
|
| |
10
|
Efthmiadis, E. Query Expansion. In Annual Review of Information Science and Technology, 31:121--187, 1996.
|
 |
11
|
|
 |
12
|
|
| |
13
|
Feller, W. An Introduction to Probability Theory and Its Applications, Wiley, 1968.
|
| |
14
|
|
| |
15
|
Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D. Text Joins for Data Cleasing and Integration in an RDBMS. In WWW, 2003.
|
| |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
Ide, E. New Experiments in Relevance Feedback. In The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, 1971.
|
| |
20
|
Levenshtein, V. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady 1966.
|
| |
21
|
MacDonald, C., He, B., Plachouras, V., Ounis, I. University of Glasgow at TREC 2005: Experiments in Terabyte and Enterprise Tracks with Terrier. In TREC, 2005.
|
| |
22
|
|
| |
23
|
Medelyan, O. Computing Lexical Chains with Graph Clustering In ACL 2007.
|
 |
24
|
|
| |
25
|
Mitra, M., Buckley, C., Singhal, A., Cardie, C. An Analysis of Statistical and Sytactic Phrases. In RIAO Conference, 1997.
|
| |
26
|
|
| |
27
|
Amazon Mechanical Turk. http://www.mturk.com
|
| |
28
|
Pantel, P., Lin, D. A statistical corpus based term extractor Lecture notes in AI, 2001, Springer-Verlag
|
| |
29
|
Part-of-speech tagging. http://en.wikipedia.org/wiki/Part-of-speech_tagging
|
| |
30
|
Rocchio, J. Relevance Feedback in Information Retrieval. In The SMART Retrieval System - Experiments in Automatic Document Processing, Prentice-Hall, 1971.
|
| |
31
|
|
| |
32
|
Spink, A., Jansen, B., Ozmultu, H. Use of Query Reformulation and Relevance Feedback by Excite Users. In Internet Research: Electronic Networking Applications and Policy, 2000.
|
| |
33
|
|
| |
34
|
|
| |
35
|
Vechtomova, O., Karamuftuoglu, M. Approaches to High Accuracy Retrieval: Phrase-Based Search Experiments in the HARD Track. In TREC, 2004.
|
 |
36
|
Ian H. Witten , Gordon W. Paynter , Eibe Frank , Carl Gutwin , Craig G. Nevill-Manning, KEA: practical automatic keyphrase extraction, Proceedings of the fourth ACM conference on Digital libraries, p.254-255, August 11-14, 1999, Berkeley, California, United States
[doi> 10.1145/313238.313437]
|
| |
37
|
Yahoo Term Extraction Web Service. http://developer.yahoo.com/search/content/V1/termExtraction.html
|
 |
38
|
Hugo Zaragoza , Henning Rode , Peter Mika , Jordi Atserias , Massimiliano Ciaramita , Giuseppe Attardi, Ranking very many typed entities on wikipedia, Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, November 06-10, 2007, Lisbon, Portugal
[doi> 10.1145/1321440.1321599]
|
| |
39
|
The Future of Social Networking: Understanding Market Stratigic and Technology developments. Datamonitor, 2007.
|
|