|
ABSTRACT
We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Auer and J. Lehmann. What have innsbruck and leipzig in common? extracting semantics from wiki content. pages 503--517. 2007.
|
| |
2
|
|
| |
3
|
A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70:066111, 2004.
|
| |
4
|
D. J. de Solla Price. Networks of scientific papers. Science, 169:510--515, 1965.
|
| |
5
|
E. Frank, G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-manning. Domain-specific keyphrase extraction. pages 668--673. Morgan Kaufmann Publishers, 1999.
|
| |
6
|
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of The Twentieth International Joint Conference for Artificial Intelligence, pages 1606--1611, Hyderabad, India, 2007.
|
| |
7
|
|
| |
8
|
S. A. Kauffman. Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol, 22(3):437--467, March 1969.
|
 |
9
|
|
| |
10
|
|
| |
11
|
O. Medelyan, I. H. Witten, and D. Milne. Topic indexing with wikipedia. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.
|
| |
12
|
|
| |
13
|
R. Mihalcea. Using wikipedia for automatic word sense disambiguation. In Proceedings of NAACL HLT 2007, pages 196--203, 2007.
|
 |
14
|
|
| |
15
|
R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP-04 and the 2004 Conference on Empirical Methods in Natural Language Processing, July 2004.
|
| |
16
|
G. A. Miller, C. Fellbaum, R. Tengi, P. Wakefield, H. Langone, and B. R. Haskell. Wordnet: a lexical database for the english language. http://wordnet.princeton.edu/.
|
| |
17
|
D. Milne. Computing semantic relatedness using wikipedia link structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC), Hamilton, New Zealand, 2007.
|
| |
18
|
D. Milne and I. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In Wikipedia and AI workshop at the AAAI-08 Conference (WikiAI08), Chicago, US, 2008.
|
| |
19
|
M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004.
|
| |
20
|
S. Redner. How popular is your paper? an empirical study of the citation distribution. The European Physical Journal B, 4:131, 1998.
|
| |
21
|
|
| |
22
|
|
| |
23
|
M. Strube and S. Ponzetto. WikiRelate! Computing semantic relatedness using Wikipedia. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI-06), pages 1419--1424, Boston, Mass., July 2006.
|
| |
24
|
Z. Syed, T. Finin, and A. Joshi. Wikipedia as an Ontology for Describing Documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press, March 2008.
|
 |
25
|
|
| |
26
|
D. Turdakov and P. Velikhov. Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation. In Colloquium on Databases and Information Systems (SYRCoDIS), 2008.
|
| |
27
|
S. Wasserman, K. Faust, and D. Iacobucci. Social Network Analysis : Methods and Applications (Structural Analysis in the Social Sciences). Cambridge University Press, November 1994.
|
CITED BY
|
|
Maxim Grinev , Maria Grineva , Alexander Boldakov , Leonid Novak , Andrey Syssoev , Dmitry Lizorkin, Sifting micro-blogging stream for events of user interest, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|