ACM Home Page
Please provide us with feedback. Feedback
Distributed multivariate regression based on influential observations
Full text PdfPdf (203 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Washington, D.C.
POSTER SESSION: Research track table of contents
Pages: 679 - 684  
Year of Publication: 2003
ISBN:1-58113-737-0
Authors
Hang Yu  National University of Singapore
Ee-Chien Chang  National University of Singapore
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 37,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/956750.956839
What is a DOI?

ABSTRACT

Large-scale data sets are sometimes logically and physically distributed in separate databases. The issues of mining these data sets are not just their sizes, but also the distributed nature. The complication is that communicating all the data to a central database would be too slow. To reduce communication costs, one could compress the data during transmission. Another method is random sampling. We propose an approach for distributed multivariate regression based on sampling and discuss its relationship with the compression method. The central idea is motivated by the observation that, although communication is limited, each individual site can still scan and process all the data it holds. Thus it is possible for the site to communicate only influential samples without seeing data in other sites. We exploit this observation and derive a method that provides tradeoff between communication cost and accuracy. Experimental results show that it is better than the compression method and random sampling.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
Chris Clifton. Privacy preserving distributed data mining. Purdue Research Foundation, August 2002 through July 2003.
 
3
John Fox. Applied Regression Analysis, Linear Models, and Related Methods. Sage Publications.
 
4
L. Garby, J. S. Garrow, B. Jorgensen, O. Lammert, K. Madsen, P. Sorensen, and J. Webster. Relation between energy expenditure and body composition in man: specific energy expenditure in vivo of fat and fat-free mass. European Journal of Clinical Nutririon, pages 301--305, 1988.
 
5
Y. Guo, S. M. Rueger, J. Sutiwaraphun, and J. Forbes-Millott. Meta-learning for parallel data mining. Seventh Parallel Computing Workshop, 1997.
 
6
 
7
H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable distributed data mining using an agent based architecture. 3rd International Conference on the Knowledge Discovery and Data Mining, 1997.
8
 
9
Alvin C. Rencher. Linear models in statistics. Wiley, 2000.
 
10
Alvin C. Rencher. Methods of multivariate analysis. Wiley-Interscience, 2002.
 
11
S. Stolfo, A. L. Prodromidis, and P. K. Chan. Jam: Java agents for meta-learning over distributed databases. 3rd International Conference on Knowledge Discovery and Data Mining, 1997.
 
12
Bin Zhang and George Forman. Distributed data clustering can be efficient and exact. IGKDD Explorations, 2000.