|
ABSTRACT
Massive transaction streams present a number of opportunities for data mining techniques. The transactions in such streams might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how the transactors, for example, credit-card numbers or IP addresses, use the associated services.Over the past 5 years, we have computed evolving profiles (called signatures) of transactors in several very large data streams. The signature for each transactor captures the salient features of his or her behavior through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain.Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Aho, A., Kernighan, B., and Weinberger, P. 1979. AWK---A pattern scanning and processing language. Softw. Pract. Exp. 9, 4 (April), 267--279.
|
| |
2
|
|
 |
3
|
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
|
 |
6
|
Dan Bonachea , Kathleen Fisher , Anne Rogers , Frederick Smith, Hancock: a language for processing very large-scale data, Proceedings of the 2nd conference on Domain-specific languages, p.163-176, October 03-06, 1999, Austin, Texas, United States
|
| |
7
|
Burge, P. and Shawe-Taylor, J. 1996. Frameworks for fraud detection in mobile telecommunications networks. In Proceedings of the Fourth Annual Mobile and Personal Communications Seminar. University of Limerick, Limerick, Ireland.
|
| |
8
|
Carney, D., Cetinemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., and Zdonik, S. 2002. Monitoring streams---a new class of data management applications. In Proceedings of the 28th VLDB Conference. See the Aurora Project homepage, www.cs.brown.edu/research/aurora/main.html, for a complete list of papers.
|
| |
9
|
Chandra, S., Heintze, N., MacQueen, D., Oliva, D., and Siff, M. 1999. Pre-release of C-frontend library for SML/NJ. See cm.bell-labs.com/cm/cs/what/smlnj.
|
| |
10
|
Chandrasekaran, S. and Franklin, M. J. 2002. Streaming queries over streaming data. In Proceedings of the 28th VLDB Conference.
|
 |
11
|
Corinna Cortes , Kathleen Fisher , Daryl Pregibon , Anne Rogers, Hancock: a language for extracting signatures from data streams, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.9-17, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347094]
|
| |
12
|
Cortes, C. and Pregibon, D. 1998. Giga mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining.
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
Gupta, P., Lin, S., and McKeown, M. 1998. Routing lookups in hardware and memory access speeds. In Proceedings of the 17th Annual Joint Conference of the IEEE Computer and Communications Societies. Vol. 3, 1240--1247.
|
| |
18
|
Hellerstein, J., Franklin, M., Chandrasekaran, S., Deshpande, A., Hildrum, K., Madden, S., Raman, V., and Shah, M. 2000. Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull., 7--18. See the Telegraph Project homepage, telegraph.cs.berkley.edu, for a complete list of papers.
|
| |
19
|
Huang, N.-F., Zhao, S.-M., Pan, J.-Y., and Su, C.-A. 1999. A fast IP routing lookup scheme for gigabit switching routers. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies. Vol. 3, 1429--1436.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
| |
25
|
Riggs, R., Waldo, J., Wollrath, A., and Bharat, K. 1996. Pickling state in the Java system. In Proceedings of the USENIX 1996 Conference on Object-Oriented Technologies (COOTS).
|
| |
26
|
SAX Project. 2002. SAX home page. See www.saxproject.org.
|
| |
27
|
SIGMOD. 2002. Proceedings of the 21st ACM SIGMOD International Conference on Management of Data.
|
| |
28
|
Sullivan, M. and Heybey, A. 1998. Tribeca: A system for managing large databases of network traffic. In Proceedings of the USENIX Annual Technical Conference (No. 98).
|
| |
29
|
van Rossum, G. 2001. Python library reference. See python.sourceforge.net/devel-docs/lib/lib.html.
|
| |
30
|
VLDB. 2002. Proceedings of the 28th International Conference on Very Large Data Bases.
|
| |
31
|
Wang, D. C. 1998. The asdlGen Reference Manual. See www.cs.princeton.edu/zephyr/ASDL.
|
CITED BY 4
|
|
|
|
|
|
|
|
|
|
|
L. Romano , V. Vianello , S. D'antonio , S. Giordano, Using data correlation to build an intrusion detection system, Proceedings of the 10th WSEAS international conference on Automation & information, p.342-347, March 23-25, 2009, Prague, Czech Republic
|
|