ACM Home Page
Please provide us with feedback. Feedback
Hancock: A language for analyzing transactional data streams
Full text PdfPdf (218 KB)
Source ACM Transactions on Programming Languages and Systems (TOPLAS) archive
Volume 26 ,  Issue 2  (March 2004) table of contents
Pages: 301 - 338  
Year of Publication: 2004
ISSN:0164-0925
Authors
Corinna Cortes  AT&T Labs, New York, NY
Kathleen Fisher  AT&T Labs, NJ
Daryl Pregibon  AT&T Labs, New York, NY
Anne Rogers  AT&T Labs, Chicago, IL
Frederick Smith  AT&T Labs, Natick, MA
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 97,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/973097.973100
What is a DOI?

ABSTRACT

Massive transaction streams present a number of opportunities for data mining techniques. The transactions in such streams might represent calls on a telephone network, commercial credit card purchases, stock market trades, or HTTP requests to a web server. While historically such data have been collected for billing or security purposes, they are now being used to discover how the transactors, for example, credit-card numbers or IP addresses, use the associated services.Over the past 5 years, we have computed evolving profiles (called signatures) of transactors in several very large data streams. The signature for each transactor captures the salient features of his or her behavior through time. Programs for processing signatures must be highly optimized because of the size of the data stream (several gigabytes per day) and the number of signatures to maintain (hundreds of millions). Originally, we wrote such programs directly in C, but because these programs often sacrificed readability for performance, they were difficult to verify and maintain.Hancock is a domain-specific language we created to express computationally efficient signature programs cleanly. In this paper, we describe the obstacles to computing signatures from massive streams and explain how Hancock addresses these problems. For expository purposes, we present Hancock using a running example from the telecommunications industry; however, the language itself is general and applies equally well to other data sources.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Aho, A., Kernighan, B., and Weinberger, P. 1979. AWK---A pattern scanning and processing language. Softw. Pract. Exp. 9, 4 (April), 267--279.
 
2
3
4
 
5
6
 
7
Burge, P. and Shawe-Taylor, J. 1996. Frameworks for fraud detection in mobile telecommunications networks. In Proceedings of the Fourth Annual Mobile and Personal Communications Seminar. University of Limerick, Limerick, Ireland.
 
8
Carney, D., Cetinemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., and Zdonik, S. 2002. Monitoring streams---a new class of data management applications. In Proceedings of the 28th VLDB Conference. See the Aurora Project homepage, www.cs.brown.edu/research/aurora/main.html, for a complete list of papers.
 
9
Chandra, S., Heintze, N., MacQueen, D., Oliva, D., and Siff, M. 1999. Pre-release of C-frontend library for SML/NJ. See cm.bell-labs.com/cm/cs/what/smlnj.
 
10
Chandrasekaran, S. and Franklin, M. J. 2002. Streaming queries over streaming data. In Proceedings of the 28th VLDB Conference.
11
 
12
Cortes, C. and Pregibon, D. 1998. Giga mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining.
13
 
14
 
15
 
16
 
17
Gupta, P., Lin, S., and McKeown, M. 1998. Routing lookups in hardware and memory access speeds. In Proceedings of the 17th Annual Joint Conference of the IEEE Computer and Communications Societies. Vol. 3, 1240--1247.
 
18
Hellerstein, J., Franklin, M., Chandrasekaran, S., Deshpande, A., Hildrum, K., Madden, S., Raman, V., and Shah, M. 2000. Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull., 7--18. See the Telegraph Project homepage, telegraph.cs.berkley.edu, for a complete list of papers.
 
19
Huang, N.-F., Zhao, S.-M., Pan, J.-Y., and Su, C.-A. 1999. A fast IP routing lookup scheme for gigabit switching routers. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies. Vol. 3, 1429--1436.
 
20
 
21
 
22
 
23
 
24
 
25
Riggs, R., Waldo, J., Wollrath, A., and Bharat, K. 1996. Pickling state in the Java system. In Proceedings of the USENIX 1996 Conference on Object-Oriented Technologies (COOTS).
 
26
SAX Project. 2002. SAX home page. See www.saxproject.org.
 
27
SIGMOD. 2002. Proceedings of the 21st ACM SIGMOD International Conference on Management of Data.
 
28
Sullivan, M. and Heybey, A. 1998. Tribeca: A system for managing large databases of network traffic. In Proceedings of the USENIX Annual Technical Conference (No. 98).
 
29
van Rossum, G. 2001. Python library reference. See python.sourceforge.net/devel-docs/lib/lib.html.
 
30
VLDB. 2002. Proceedings of the 28th International Conference on Very Large Data Bases.
 
31
Wang, D. C. 1998. The asdlGen Reference Manual. See www.cs.princeton.edu/zephyr/ASDL.


Collaborative Colleagues:
Corinna Cortes: colleagues
Kathleen Fisher: colleagues
Daryl Pregibon: colleagues
Anne Rogers: colleagues
Frederick Smith: colleagues