ACM Home Page
Please provide us with feedback. Feedback
Nearest neighbors in high-dimensional data: the emergence and influence of hubs
Full text PdfPdf (959 KB)
Source ACM International Conference Proceeding Series; Vol. 382 archive
Proceedings of the 26th Annual International Conference on Machine Learning table of contents
Montreal, Quebec, Canada
Pages 865-872  
Year of Publication: 2009
ISBN:978-1-60558-516-1
Authors
Miloš Radovanović  University of Novi Sad, Novi Sad, Serbia
Alexandros Nanopoulos  University of Hildesheim, Hildesheim, Germany
Mirjana Ivanović  University of Novi Sad, Novi Sad, Serbia
Sponsors
: MITACS
: NSF
Microsoft Research : Microsoft Research
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 38,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1553374.1553485
What is a DOI?

ABSTRACT

High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
 
3
 
4
5
 
6
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. The MIT Press.
 
7
Demartines, P. (1994). Analyse de données par réseaux de neurones auto-organisés. Doctoral dissertation, Institut Nat'l Polytechnique de Grenoble, France.
 
8
Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proc. Int. Conf. on Spoken Language Processing. Paper 0608.
 
9
Erdős, P., & Réényi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290--297.
 
10
 
11
Hicklin, A., Watson, C., & Ulery, B. (2005). The myth of goats: How many people have fingerprints that are hard to match? (Technical Report). National Institute of Standards and Technology.
 
12
Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17 (pp. 777--784).
 
13
Meilă, M., & Shi, J. (2001). Learning segmentation by random walks. Advances in Neural Information Processing Systems 13 (pp. 873--879).
 
14
Newman, C. M., Rinott, Y., & Tversky, A. (1983). Nearest neighbors and voronoi regions in certain point processes. Advances in Applied Probability, 15, 726--751.
15
 
16
 
17
Yao, Y.-C., & Simons, G. (1996). A large-dimensional independent and identically distributed property for nearest neighbor counts in Poisson processes. Annals of Applied Probability, 6, 561--571.

Collaborative Colleagues:
Miloš Radovanović: colleagues
Alexandros Nanopoulos: colleagues
Mirjana Ivanović: colleagues