|
ABSTRACT
A fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. E-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. The published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. Underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.The goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. The approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. The resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. Experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, Y. Yang, 'Topic Detection and Tracking Pilot Study: Final Report," Proc. DARPA Broadcast News Transcription and Understanding Workshop, Feb. 1998.
|
 |
3
|
|
 |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
A. Birrell, S. Perl, M. Schroeder, T. Wobber, The Pachyderm E-mail System, 1997, at http://www.research.compaq.com/SRC/pachyderm/.
|
| |
8
|
T. Blanton, Ed., White House E-mail, New Press, 1995.
|
 |
9
|
|
| |
10
|
C. Chatfield, The Analysis of time Series: An Introduction, Chapman and Hall, 1996.
|
| |
11
|
S. Chatman, Story and Discourse: Narrative Structure in Fiction and Film, Cornell Univ. Press, 1978.
|
| |
12
|
D. Chudova, P. Smyth, "Unsupervised identification of sequential patterns under a Markov assumption," KDD Workshop on Temporal Data Mining, 2001.
|
| |
13
|
W. Cohen. "Learning rules that classify e-mail." Proc. AAAI Spring Symp. Machine Learning and Information Access, 1996.
|
| |
14
|
T. Cover, P. Hart, "Nearest neighbor pattern classification," IEEE Trans. Information Theory IT-13(1967), pp. 21--27.
|
| |
15
|
W. Davison, L. Wall, S. Barber, trn, 1993 http://web.mit.edu/afs/sipb/project/trn/src/trn-3.6/.
|
| |
16
|
R. Ehrich, J. Foith, "Representation of Random Waveforms by Relational 'Trees," IEEE Trans. Computers, C25:7(1976).
|
| |
17
|
|
| |
18
|
E.M. Forster, Aspects of the Novel, Harcourt, Brace, and World, Inc. 1927.
|
| |
19
|
G. Gay, M. Stefanone, M. Grace-Martin, H. Hembrooke, "The effect of wireless computing in collaborative learning environments," Intl. J. Human-Computer Interaction, to appear.
|
| |
20
|
G. Genette, Narrative Discourse: An Essay in Method, English translation (J.E. Lewin), Cornell Univ. Press, 1980.
|
| |
21
|
G. Genette, Narrative Discourse Revisited, English translation (J.E. Lewin), Cornell Univ. Press, 1988.
|
| |
22
|
|
| |
23
|
T. Gruber, Hypermail, Enterprise Integration Technologies.
|
 |
24
|
|
| |
25
|
J. Han, W. Gong, Y. Yin, "Mining Segment-Wise Periodic Patterns in Time-Related Databases", Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1998.
|
| |
26
|
|
| |
27
|
|
| |
28
|
D. Hawkins, "Point estimation of the parameters of piecewise regression models," Applied Statistics 25(1976)
|
 |
29
|
|
| |
30
|
J. Helfman, C. Isbell, "Ishmail: Immediate identification of important information," AT&T Labs Technical Report, 1995.
|
 |
31
|
|
| |
32
|
D. Hudson, "Fitting segmented curves whose join points have to be estimated," Journal of the American Statistical Association 61(1966) pp. 1097--1129.
|
| |
33
|
F. P. Kelly, "Notes on effective bandwidths," in Stochastic Networks: Theory and Applications, (F.P. Kelly, S. Zachary, I. Ziedins, eds.) Oxford Univ. Press, 1996.
|
| |
34
|
E. Keogh, P. Smyth, "A probabilistic approach to fast pattern matching in time series databases," Proc. Intl. Conf. Knowledge Discovery and Data Mining, 1997.
|
| |
35
|
J.l. Klein et al., Plaintiffs' Memorandum in Support of Proposed Final Judgment, United States of America v. Microsoft Corporation and State of New York, ex rel. Attorney General Eliot Spitzer, et al., v. Microsoft Corporation, Civil Actions No. 98-1232 (TPJ) and 98-1233 (TPJ), April 2000.
|
| |
36
|
M. Last, Y. Klein, A. Kandel, "Knowledge Discovery in Time Series Databases," IEEE Transactions on Systems, Man, and Cybernetics 31B(2001).
|
| |
37
|
V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan, "Mining of Concurrent Text and Time-Series," KDD-2000 Workshop on Text Mining, 2000.
|
| |
38
|
|
| |
39
|
S.S. Lukesh, "E-mail and potential loss to future archives and scholarship, or, The dog that didn't bark," First Monday 4(9) (September 1999), at http://firstmonday.org
|
 |
40
|
|
 |
41
|
|
| |
42
|
H. Mannila, H. Toivonen, A.I. Verkamo, "Discovering frequent episodes in sequences," Proc. Intl. Conf. on Knowledge Discovery and Data Mining, 1995.
|
| |
43
|
R. Martin, V. Yohai, "Data mining for unusual movements in temporal data," KDD Wkshp. Temporal Data Mining, 2001.
|
 |
44
|
|
| |
45
|
Nancy E. Miller , Pak Chung Wong , Mary Brewster , Harlan Foote, TOPIC ISLANDS—a wavelet-based text visualization system, Proceedings of the conference on Visualization '98, p.189-196, October 18-23, 1998, Research Triangle Park, North Carolina, United States
|
| |
46
|
R. Moore, C. Baru, A. Rajasekar, B. Ludaescher, R. Marciano, M. Wan, W. Schroeder, A. Gupta, "Collection-Based Persistent Digital Archives -- Part 2," D-Lib Magazine, 6(2000).
|
| |
47
|
K. Murphy, M. Paskin, "Linear time inference in hierarchical HMMs," Advances in Neural Information Processing Systems (NIPS) 14, 2001.
|
| |
48
|
F. Olsen, "Facing Flood of E-Mail, Archives Seeks Help From Supercomputer Researchers," Chronicle of Higher Education, August 24, 1999.
|
| |
49
|
T. Payne, P. Edwards, "Interface agents that learn: An investigation of learning issues in a mail agent interface," Applied Artificial Intelligence 11(1997), pp. 1--32.
|
 |
50
|
|
| |
51
|
L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE 77(1989).
|
| |
52
|
M. Redmond, B. Adelson, "AlterEgo e-mail filtering agent," Proc. AAAI Workshop on Case-Based Reasoning, 1998.
|
| |
53
|
J. Rennie, "ifile: An application of machine learning to e-mail filtering," Proc. KDD Workshop on Text Mining, 2000.
|
| |
54
|
M. Sahami, S. Dumais, D. Heckerman, E. Horvitz. "A Bayesian approach to filtering junk email," Proc. AAAI Workshop on Learning for Text Categorization, 1998.
|
| |
55
|
B. Schneier, Applied Cryptography Wiley, 1996.
|
 |
56
|
|
| |
57
|
|
| |
58
|
S. Shaw, R. DeFigueiredo, "Structural Processing of Waveforms as Trees," IEEE Transactions on Acoustics, Speech, and Signal Processing 38:2(1990)
|
 |
59
|
|
 |
60
|
|
| |
61
|
R. Swan, D. Jensen, "TimeMines: Constructing Timelines with Statistical Models of Word Usage," KDD-2000 Workshop on Text Mining, 2000.
|
 |
62
|
|
| |
63
|
|
 |
64
|
Yiming Yang , Tom Ault , Thomas Pierce , Charles W. Lattimer, Improving text categorization methods for event tracking, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.65-72, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345550]
|
 |
65
|
|
CITED BY 85
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Daniel Gruhl , R. Guha , Ravi Kumar , Jasmine Novak , Andrew Tomkins, The predictive power of online chatter, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
Tomoyuki Nanno , Toshiaki Fujiki , Yasuhiro Suzuki , Manabu Okumura, Automatically collecting, monitoring, and mining japanese weblogs, Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, May 19-21, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
R. Guha , Ravi Kumar , D. Sivakumar , Ravi Sundaram, Unweaving a web of documents, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gilad Mishne , David Carmel , Ron Hoory , Alexey Roytman , Aya Soffer, Automatic analysis of call-center conversations, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
|
|
|
Michail Vlachos , Christopher Meek , Zografoula Vagena , Dimitrios Gunopulos, Identifying similarities, periodicities and bursts for online search queries, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
|
|
|
|
|
|
Lars Backstrom , Dan Huttenlocher , Jon Kleinberg , Xiangyang Lan, Group formation in large social networks: membership, growth, and evolution, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deepak Agarwal , Dhiman Barman , Dimitrios Gunopulos , Neal E. Young , Flip Korn , Divesh Srivastava, Efficient and effective explanation of change in hierarchical summaries, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Salvatore J. Stolfo , Shlomo Hershkop , Chia-Wei Hu , Wei-Jen Li , Olivier Nimeskern , Ke Wang, Behavior-based modeling and its application to Email analysis, ACM Transactions on Internet Technology (TOIT), v.6 n.2, p.187-221, May 2006
|
|
|
|
|
|
|
|
|
Eytan Adar , Daniel S. Weld , Brian N. Bershad , Steven S. Gribble, Why we search: visualizing and predicting user behavior, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Chaomei Chen , Jian Zhang , Weizhong Zhu , Michael Vogeley, Delineating the citation impact of scientific discoveries, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Finding hierarchical heavy hitters in data streams, Proceedings of the 29th international conference on Very large data bases, p.464-475, September 09-12, 2003, Berlin, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fabian Mörchen , Mathäus Dejori , Dmitriy Fradkin , Julien Etienne , Bernd Wachmann , Markus Bundschus, Anticipating annotations and emerging trends in biomedical literature, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
|
|
|
|
|
|
|
|
|
Xuanhui Wang , ChengXiang Zhai , Xiao Hu , Richard Sproat, Mining correlated bursty topic patterns from coordinated text streams, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
Gabriel Pui Cheong Fung , Jeffrey Xu Yu , Huan Liu , Philip S. Yu, Time-dependent event hierarchy construction, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Shizhu Liu , Yuval Merhav , Wai Gen Yee , Nazli Goharian , Ophir Frieder, A sentence level probabilistic model for evolutionary theme pattern mining from news corpora, Proceedings of the 2009 ACM symposium on Applied Computing, March 08-12, 2009, Honolulu, Hawaii
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Theodoros Lappas , Benjamin Arai , Manolis Platakis , Dimitrios Kotsakos , Dimitrios Gunopulos, On burstiness-aware search for document sequences, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
|
|
|
|
|
|
Yabin Zheng , Zhiyuan Liu , Maosong Sun , Liyun Ru , Yang Zhang, Incorporating user behaviors in new word detection, Proceedings of the 21st international jont conference on Artifical intelligence, p.2101-2106, July 11-17, 2009, Pasadena, California, USA
|
|
|
Ranieri Baraglia , Carlos Castillo , Debora Donato , Franco Maria Nardini , Raffaele Perego , Fabrizio Silvestri, Aging effects on query flow graphs for query suggestion, Proceeding of the 18th ACM conference on Information and knowledge management, November 02-06, 2009, Hong Kong, China
|
|
|
|
|
|
Syed Toufeeq Ahmed , Ruchi Bhindwale , Hasan Davulcu, Tracking terrorism news threads by extracting eventsignatures, Proceedings of the 2009 IEEE international conference on Intelligence and security informatics, p.182-184, June 08-11, 2009, Richardson, Texas, USA
|
|
|
|
|