ACM Home Page
Please provide us with feedback. Feedback
Resource-adaptive real-time new event detection
Full text PdfPdf (402 KB)
Source
International Conference on Management of Data archive
Proceedings of the 2007 ACM SIGMOD international conference on Management of data table of contents
Beijing, China
SESSION: Distributed data management table of contents
Pages: 497 - 508  
Year of Publication: 2007
ISBN:978-1-59593-686-8
Authors
Gang Luo  IBM T.J. Watson Research Center, Hawthorne, NY
Chunqiang Tang  IBM T.J. Watson Research Center, Hawthorne, NY
Philip S. Yu  IBM T.J. Watson Research Center, Hawthorne, NY
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 154,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1247480.1247536
What is a DOI?

ABSTRACT

In a document streaming environment, online detection of the first documents that mention previously unseen events is an open challenge. For this online new event detection (ONED) task, existing studies usually assume that enough resources are always available and focus entirely on detection accuracy without considering efficiency. Moreover, none of the existing work addresses the issue of providing an effective and friendly user interface. As a result, there is a significant gap between the existing systems and a system that can be used in practice. In this paper, we propose an ONED framework with the following prominent features. First, a combination of indexing and compression methods is used to improve the document processing rate by orders of magnitude without sacrificing much detection accuracy. Second, when resources are tight, a resource-adaptive computation method is used to maximize the benefit that can be gained from the limited resources. Third, when the new event arrival rate is beyond the processing capability of the consumer of the ONED system, new events are further filtered and prioritized before they are presented to the consumer. Fourth, implicit citation relationships are created among all the documents and used to compute the importance of document sources. This importance information can guide the selection of document sources. We implemented a prototype of our framework on top of IBM's Stream Processing Core middleware. We also evaluated the effectiveness of our techniques on the standard TDT5 benchmark. To the best of our knowledge, this is the first implementation of a real application in a large-scale stream processing system.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
4
 
5
 
6
K. Bharat, A. Z. Broder, and J. Dean et al. A Comparison of Techniques to Find Mirrored Hosts on the WWW. IEEE Data Eng. Bull. 23(4): 21--26, 2000.
7
 
8
R. Braun, R. Kaneshiro. Exploiting Topic Pragmatics for New Event Detection in TDT-2004. TDT-2004 Workshop.
 
9
10
 
11
F. Chen, A. Farahat, and T. Brants. Story Link Detection and New Event Detection are Asymmetric. HLT-NAACL 2003.
 
12
G.M. Corso, A. Gulli, and F. Romani. Ranking a Stream of News. WWW 2005: 97--106.
 
13
M. Clayton. US Plans Massive Data Sweep. The Christian Science Monitor, February 09, 2006. http://www.csmonitor.com/2006/0209/p01s02-uspo.html, 2006.
 
14
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding Replicated Web Collections. SIGMOD Conf. 2000: 355--366.
15
 
16
Google News Homepage. http://news.google.com, 2006.
17
 
18
19
20
21
 
22
E. Lipton. Software to Monitor Overseas Opinions of U.S. The New York Times, October 4, 2006. http://news.zdnet.com/2100--9588_22--6122641.html, 2006.
23
 
24
 
25
L. Page, S. Brin, and R. Motwani et al. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998.
 
26
M. F. Porter. An Algorithm for Suffix Stripping. Program 14(3): 130--137, 1980.
 
27
 
28
S. E. Robertson, S. Walker, and M. Hancock-Beaulieu. Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive. TREC 1998: 199--210.
29
 
30
A. Singhal. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4): 35--43, 2001.
31
 
32
SMART Stopword List. http://www.lextek.com/manuals/onix/stopwords2.html, 2005.
 
33
C. Tang, S. Dwarkadas. Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval. NSDI 2004: 211--224.
 
34
TDT Homepage. http://www.nist.gov/speech/tests/tdt.
 
35
TREC Novelty Track. http://trec.nist.gov/tracks.html, 2004.
 
36
Yahoo! News Homepage. http://news.yahoo.com, 2006.
37
38
39
 
40


Collaborative Colleagues:
Gang Luo: colleagues
Chunqiang Tang: colleagues
Philip S. Yu: colleagues