|
ABSTRACT
We make the case for developing a web of concepts by starting with the current view of web (comprised of hyperlinked pages, or documents, each seen as a bag of words), extracting concept-centric metadata, and stitching it together to create a semantically rich aggregate view of all the information available on the web for each concept instance. The goal of building and maintaining such a web of concepts presents many challenges, but also offers the promise of enabling many powerful applications, including novel search and information discovery paradigms. We present the goal, motivate it with example usage scenarios and some analysis of Yahoo! logs, and discuss the challenges in building and leveraging such a web of concepts. We place this ambitious research agenda in the context of the state of the art in the literature, and describe various ongoing efforts at Yahoo! Research that are related.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
]]J. Allan. Topic Detection and Tracking. Kluwer Academic, 2002.
|
| |
2
|
|
| |
3
|
|
| |
4
|
]]T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.
|
| |
5
|
]]J. Atserias, H. Zaragoza, M. Ciaramita, and G. Attardi. Semantically annotated snapshot of the English Wikipedia. In LREC, 2008.
|
 |
6
|
Holger Bast , Alexandru Chitea , Fabian Suchanek , Ingmar Weber, ESTER: efficient search on text, entities, and relations, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
[doi> 10.1145/1277741.1277856]
|
| |
7
|
|
| |
8
|
|
| |
9
|
]]T. Berners-Lee, J. Hendler, and O. Lassila. The semantic web. Scientific American, 284(5):34--43, 2001.
|
 |
10
|
|
| |
11
|
]]I. Bhattacharya and L. Getoor. A latent Dirichlet model for unsupervised entity resolution. In SDM, 2006.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
]]C. Cardie. Empirical methods in information extraction. AI Magazine, 18(4):65--79, 1997.
|
| |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
]]W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IJCAI Workshop on Information Integration on the Web, pages 73--78, 2003.
|
| |
21
|
|
 |
22
|
|
| |
23
|
]]N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. Matching reviews with objects using a language model. In Manuscript, 2008.
|
| |
24
|
|
| |
25
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
| |
26
|
]]A. Doan, J. Madhavan, P. Domingos, and A.Y. Halevy. Ontology matching: A machine learning approach. In Handbook on Ontologies, pages 385--404, 2004.
|
 |
27
|
|
| |
28
|
]]P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, pages 31--48, 2004.
|
 |
29
|
|
 |
30
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
| |
31
|
]]I.P. Fellegi and A.B. Sunter. A theory for record linkage. JASA, 64:1183--1210, 1969.
|
| |
32
|
]]A.D. Fuxman and R.J. Miller. First-order query rewriting for inconsistent databases. In ICDT, pages 337--351, 2005.
|
| |
33
|
]]R. Gilleron, F. Jousse, I. Tellier, and M. Tommasi. XML document transformation with conditional random fields. In INEX, 2006.
|
| |
34
|
]]M.N. Gubanov and P.A. Bernstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.
|
| |
35
|
|
| |
36
|
|
 |
37
|
Alon Halevy , Michael Franklin , David Maier, Principles of dataspace systems, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.1-9, June 26-28, 2006, Chicago, IL, USA
[doi> 10.1145/1142351.1142352]
|
 |
38
|
|
 |
39
|
|
| |
40
|
|
| |
41
|
]]A. Jain, D. Kifer, A. Kirpal, S. Merugu, S. Keerthi, P. Bohannon, and R. Ramakrishnan. Concept-centric extraction: using domain knowledge and local learning. In Manuscript, 2008.
|
| |
42
|
]]T.S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engineering Bulletin, 29(1):40--48, 2006.
|
| |
43
|
]]D.V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SDM, 2005.
|
| |
44
|
]]N. Kushmerick, D.S. Weld, and R.B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.
|
| |
45
|
]]J. Madhavan, L. Afanasiev, L. Antova, and A.Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.
|
 |
46
|
|
| |
47
|
]]A. McCallum and B. Wellner. Conditional models of identity uncertainty with application to noun coreference. In NIPS, 2004.
|
| |
48
|
Robert McCann , Alexander Kramnik , Warren Shen , Vanitha Varadarajan , Olu Sobulo , AnHai Doan, Integrating Data from Disparate Sources: A Mass Collaboration Approach, Proceedings of the 21st International Conference on Data Engineering, p.487-488, April 05-08, 2005
[doi> 10.1109/ICDE.2005.81]
|
| |
49
|
]]I. Muslea, S. Minton, and C. Knoblock. STALKER: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.
|
| |
50
|
]]J. Myllymaki and J. Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.
|
 |
51
|
|
| |
52
|
]]H.B. Newcombe, J.M. Kennedy, S.J. Axford, and A.P. and James. Automatic linkage of vital records. Science, 130:954--959, 1959.
|
| |
53
|
]]H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In NIPS, 2002.
|
| |
54
|
|
| |
55
|
]]E. Rahm, A. Thor, D. Aumueller, H.H. Do, N. Golovin, and T. Kirsten. iFuice: Information fusion utilizing instance correspondences and peer mappings. In WebDB, pages 7--12, 2005.
|
 |
56
|
Rajat Raina , Alexis Battle , Honglak Lee , Benjamin Packer , Andrew Y. Ng, Self-taught learning: transfer learning from unlabeled data, Proceedings of the 24th international conference on Machine learning, p.759-766, June 20-24, 2007, Corvalis, Oregon
[doi> 10.1145/1273496.1273592]
|
| |
57
|
|
| |
58
|
|
| |
59
|
|
| |
60
|
]]S. Sundararajan and S. Keerthi. Graph based classification methods using inaccurate external classifier information. In Manuscript, 2008.
|
| |
61
|
]]J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.
|
|