|
ABSTRACT
Text corpora are often enhanced by additional metadata which relate real-world entities, with each document in which such entities are discussed. Such relationships are typically obtained through widely available Information Extraction tools. At the same time, interesting known associations typically hold among these entities. For instance, a corpus might contain discussions on hotels, cities and airlines; fixed associations among these entities may include: airline A operates a flight to city C, hotel H is located in city C. A plethora of applications necessitate the identification of associated entities, each best matching a given set of keywords. Consider the sample query: Find a holiday package in a "pet-friendly" hotel, located in a "historical" yet "lively" city, with travel operated by an "economical" and "safe" airline. These keywords are unlikely to occur in the textual description of entities themselves, (e.g., the actual hotel name or the city name or the airline name). Consequently to answer such queries, one needs to exploit both relationships between entities and documents (e.g., keyword "pet-friendly" occurs in a document that contains an entity specifying a hotel name H), and the known associations between entities (e.g., hotel H is located in city C). In this work, we focus on the class of "entity package finder" queries outlined above. We demonstrate that existing techniques cannot be efficiently adapted to solve this problem, as the resulting algorithm relies on estimations with excessive runtime and/or storage overheads. We propose an efficient algorithm to process such queries, over large corpora. We devise early pruning and termination strategies, in the presence of joins and aggregations (executed on entities extracted from text), that do not depend on any estimates. Our analysis and experimental evaluation on real and synthetic data demonstrates the efficiency and scalability of our approach.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Opencalais. http://www.opencalais.com. Retrieved on June 23, 2008.
|
 |
2
|
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy, Join synopses for approximate query answering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.275-286, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
3
|
A. Angel, S. Chaudhuri, G. Das, and N. Koudas. Ranking objects based on relationships and fixed associations. Tech.report, 2008. Available at http://www.cs.toronto.edu/albert/docs/acdk-edbt09.pdf.
|
| |
4
|
D. E. Appelt and D. Israel. Introduction to information extraction. In IJCAI Tutorial, 1999.
|
 |
5
|
|
| |
6
|
|
| |
7
|
Holger Bast , Debapriyo Majumdar , Ralf Schenkel , Martin Theobald , Gerhard Weikum, IO-Top-k: index-access optimized top-k query processing, Proceedings of the 32nd international conference on Very large data bases, September 12-15, 2006, Seoul, Korea
|
 |
8
|
Kurt Bollacker , Colin Evans , Praveen Paritosh , Tim Sturge , Jamie Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376746]
|
 |
9
|
Kaushik Chakrabarti , Venkatesh Ganti , Jiawei Han , Dong Xin, Ranking objects based on relationships, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142516]
|
 |
10
|
|
| |
11
|
|
 |
12
|
Ihab F. Ilyas , Walid G. Aref , Ahmed K. Elmagarmid , Hicham G. Elmongui , Rahul Shah , Jeffrey Scott Vitter, Adaptive rank-aware query optimization in relational databases, ACM Transactions on Database Systems (TODS), v.31 n.4, p.1257-1304, December 2006
[doi> 10.1145/1189769.1189772]
|
 |
13
|
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
A. Singhal. Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35--43, 2001.
|
| |
18
|
M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang. Top-k query processing in uncertain databases. In ICDE, pages 896--905, 2007.
|
|