|
ABSTRACT
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date.We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The lorel query language for semistructured data. International Journal of Digital Libraries, 1(1):68--88, 1997.
|
 |
2
|
Rakesh Agrawal , Roberto J. Bayardo, Jr. , Daniel Gruhl , Spiros Papadimitriou, Vinci: a service-oriented architecture for rapid development of web applications, Proceedings of the 10th international conference on World Wide Web, p.355-365, May 01-05, 2001, Hong Kong, Hong Kong
[doi> 10.1145/371920.372088]
|
| |
3
|
AltaVista. http://www.altavista.com.
|
| |
4
|
|
| |
5
|
T. Berners-Lee, J. Hendler, and O. Lassila. Semantic web. Scientific American, 1(1):68--88, 2000.
|
| |
6
|
D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn, H. F. Nielsen, S. Thatte, and D. Winder. Simple Object Access Protocol. http://www.w3.org/TR/SOAP/, May 2000.
|
| |
7
|
D. Brickley and R.V.Guha. Rdf schema. http://www.w3.org/TR/rdf-schema/.
|
| |
8
|
|
| |
9
|
C. Clarke, G. Cormack, and F. Burkowski. Shortest substring ranking. In Proceedings of the Fourth Text Retrieval Conference, pages 295--304, Gaithersburg, MD, November 1995.
|
| |
10
|
W. Cohen and L. Jensen. A structured wrapper induction system for extracting information from semi-structured documents. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (IJCAI'01), 2001.
|
| |
11
|
M. Erdmann, A. Maedche, H. Schnurr, and S. Staab. From manual to semi-automatic semantic annotation: About ontology-based text annotation tools. In P. Buitelaar and K. Hasida, editors, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, August 2000.
|
| |
12
|
Google. http://www.google.com.
|
| |
13
|
T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers.
|
| |
14
|
J. Heflin and J. Hendler. Searching the web with shoe. In AAAI-2000 Workshop on AI for Web Search, 2000.
|
| |
15
|
J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hilldrum, D. Maden, V. Raman, and M. A. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7--18, 2000.
|
| |
16
|
|
 |
17
|
|
| |
18
|
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729--737, 1997.
|
| |
19
|
T. Leonard and H. Glaser. Large scale acquisition and maintenance from the web without source access. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/Leonard.pdf, 2001.
|
| |
20
|
K. Lerman, C. Knoblock, and S. Minton. Automatic data extraction from lists and tables in web sources. In IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, August 2001.
|
| |
21
|
|
| |
22
|
J. Li, L. Zhang, and Y. Yu. Learning to generate semantic annotation for domain specific sentences. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/GenerateSemAnnot.pdf.
|
| |
23
|
P. K. Lockheed. AeroDAML: Applying information extraction to generate DAML annotations from web pages.
|
| |
24
|
D. L. McGuinness. Description logics emerge from ivory towers. In Description Logics, 2001.
|
| |
25
|
|
| |
26
|
R. Mihalcea. Word sense disambiguation and its application to the internet search. Master's thesis, Southern Methodist University, 1999.
|
| |
27
|
A. Newell. Some problems of the basic organization in problem-solving programs. In Proceedings of the Second Conference on Self-Organizing Systems, pages 393--423, Washington, DC, 1962.
|
| |
28
|
Natalya F. Noy , Michael Sintek , Stefan Decker , Monica Crubézy , Ray W. Fergerson , Mark A. Musen, Creating Semantic Web Contents with Protégé-2000, IEEE Intelligent Systems, v.16 n.2, p.60-71, March 2001
[doi> 10.1109/5254.920601]
|
| |
29
|
J. Pustejovsky, B. Boguraev, M. Verhagen, P. Buitelaar, and M. Johnston. Semantic indexing and typed hyperlinking. In Proceedings of the American Association for Artificial Intelligence Conference, Spring Symposium, NLP for WWW, pages 120--128, 1997.
|
| |
30
|
R.Guha and R. McCool. Tap: Towards a web of data. http://tap.stanford.edu/.
|
| |
31
|
E. Riloff and J. Shepherd. A corpus-based approach for building semantic lexicons. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP-97), pages 117--124, Providence, RI, 1997.
|
| |
32
|
|
| |
33
|
|
| |
34
|
S. Staab, A. Maedche, and S. Handschuh. An annotation framework for the semantic web. In S. Isjizaki, editor, Proceedings of the First Workshop on Multimedia Annotation, Tokyo, Japan, January 2001.
|
| |
35
|
The Internet Archive. http://www.archive.org.
|
| |
36
|
Maria Vargas-Vera , Enrico Motta , John Domingue , Mattia Lanzoni , Arthur Stutt , Fabio Ciravegna, MnM: Ontology Driven Semi-automatic and Automatic Support for Semantic Markup, Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, p.379-391, October 01-04, 2002
|
| |
37
|
W3C. Platform for internet content selection. http://www.w3.org/PICS/.
|
| |
38
|
W3C. Web ontology language. http://www.w3.org/2001/sw/WebOnt/.
|
| |
39
|
Web-in-a-Box. http://research.compaq.com/SRC/WebArcheology/wib.html.
|
| |
40
|
Y. Wilks and M. Stevenson. Sense tagging: Semantic tagging with a lexicon. In Proceedings of the SIGLEX Workshop Tagging Text with Lexical Semantics: What, why and how?, pages 47--51, 1997.
|
CITED BY 61
|
|
|
|
|
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Daniel Gruhl , R. Guha , Ravi Kumar , Jasmine Novak , Andrew Tomkins, The predictive power of online chatter, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
Victoria Uren , Enrico Motta , Martin Dzbor , Philipp Cimiano, Browsing for information by highlighting automatically generated annotations: a user study and evaluation, Proceedings of the 3rd international conference on Knowledge capture, October 02-05, 2005, Banff, Alberta, Canada
|
|
|
Oren Etzioni , Michael Cafarella , Doug Downey , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence, v.165 n.1, p.91-134, June 2005
|
|
|
|
|
|
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Cartic Ramakrishnan , Li Ding , Pranam Kolari , Amit P. Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Semantic analytics on social networks: experiences in addressing the problem of conflict of interest detection, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
Daniel Gruhl , Daniel N. Meredith , Jan H. Pieper , Alex Cozzi , Stephen Dill, The web beyond popularity: a really simple system for web scale RSS, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Marcus Fontoura , Engene Shekita , Jason Y. Zien , Sridhar Rajagopalan , Andreas Neumann, High performance index build algorithms for intranet search engines, Proceedings of the Thirtieth international conference on Very large data bases, p.1122-1133, August 31-September 03, 2004, Toronto, Canada
|
|
|
Mark J. Weal , Harith Alani , Sanghee Kim , Paul H. Lewis , David E. Millard , Patrick A. S. Sinclair , David C. De Roure , Nigel R. Shadbolt, Ontologies as facilitators for repurposing web documents, International Journal of Human-Computer Studies, v.65 n.6, p.537-562, June, 2007
|
|
|
|
|
|
|
|
|
Holger Bast , Alexandru Chitea , Fabian Suchanek , Ingmar Weber, ESTER: efficient search on text, entities, and relations, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
Paul - Alexandru Chirita , Stefania Costache , Wolfgang Nejdl , Siegfried Handschuh, P-TAG: large scale automatic generation of personalized annotation tags for the web, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Boanerges Aleman-Meza , Meenakshi Nagarajan , Li Ding , Amit Sheth , I. Budak Arpinar , Anupam Joshi , Tim Finin, Scalable semantic analytics on social networks for addressing the problem of conflict of interest detection, ACM Transactions on the Web (TWEB), v.2 n.1, p.1-29, February 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ding Zhou , Jiang Bian , Shuyi Zheng , Hongyuan Zha , C. Lee Giles, Exploring social annotations for information retrieval, Proceeding of the 17th international conference on World Wide Web, April 21-25, 2008, Beijing, China
|
|
|
Saurav Sahay , Sougata Mukherjea , Eugene Agichtein , Ernest V. Garcia , Shamkant B. Navathe , Ashwin Ram, Discovering semantic biomedical relations utilizing the Web, ACM Transactions on Knowledge Discovery from Data (TKDD), v.2 n.1, p.1-15, March 2008
|
|
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lorenzo Blanco , Valter Crescenzi , Paolo Merialdo , Paolo Papotti, Supporting the automatic construction of entity aware search engines, Proceeding of the 10th ACM workshop on Web information and data management, October 30-30, 2008, Napa Valley, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|