|
ABSTRACT
Large-scale data sets are sometimes logically and physically distributed in separate databases. The issues of mining these data sets are not just their sizes, but also the distributed nature. The complication is that communicating all the data to a central database would be too slow. To reduce communication costs, one could compress the data during transmission. Another method is random sampling. We propose an approach for distributed multivariate regression based on sampling and discuss its relationship with the compression method. The central idea is motivated by the observation that, although communication is limited, each individual site can still scan and process all the data it holds. Thus it is possible for the site to communicate only influential samples without seeing data in other sites. We exploit this observation and derive a method that provides tradeoff between communication cost and accuracy. Experimental results show that it is better than the compression method and random sampling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
S. Bailey , R. Grossman , H. Sivakumar , A. Turinsky, Papyrus: a system for data mining over local and wide area clusters and super-clusters, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.63-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331595]
|
| |
2
|
Chris Clifton. Privacy preserving distributed data mining. Purdue Research Foundation, August 2002 through July 2003.
|
| |
3
|
John Fox. Applied Regression Analysis, Linear Models, and Related Methods. Sage Publications.
|
| |
4
|
L. Garby, J. S. Garrow, B. Jorgensen, O. Lammert, K. Madsen, P. Sorensen, and J. Webster. Relation between energy expenditure and body composition in man: specific energy expenditure in vivo of fat and fat-free mass. European Journal of Clinical Nutririon, pages 301--305, 1988.
|
| |
5
|
Y. Guo, S. M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Meta-learning for parallel data mining. Seventh Parallel Computing Workshop, 1997.
|
| |
6
|
|
| |
7
|
H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable distributed data mining using an agent based architecture. 3rd International Conference on the Knowledge Discovery and Data Mining, 1997.
|
 |
8
|
Foster Provost , David Jensen , Tim Oates, Efficient progressive sampling, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, p.23-32, August 15-18, 1999, San Diego, California, United States
[doi> 10.1145/312129.312188]
|
| |
9
|
Alvin C. Rencher. Linear models in statistics. Wiley, 2000.
|
| |
10
|
Alvin C. Rencher. Methods of multivariate analysis. Wiley-Interscience, 2002.
|
| |
11
|
S. Stolfo, A. L. Prodromidis, and P. K. Chan. Jam: Java agents for meta-learning over distributed databases. 3rd International Conference on Knowledge Discovery and Data Mining, 1997.
|
| |
12
|
Bin Zhang and George Forman. Distributed data clustering can be efficient and exact. IGKDD Explorations, 2000.
|
|