Robust nearest-neighbor methods for classifying high-dimensional data

Chan, Yao-Ban and Hall, Peter (2009) Robust nearest-neighbor methods for classifying high-dimensional data. Annals of Statistics, 37 6A: 3186-3203. doi:10.1214/08-AOS591

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
UQ308958_OA.pdf Full text (open access) application/pdf 189.25KB 0

Author Chan, Yao-Ban
Hall, Peter
Title Robust nearest-neighbor methods for classifying high-dimensional data
Journal name Annals of Statistics   Check publisher's open access policy
ISSN 0090-5364
Publication date 2009-12-01
Year available 2009
Sub-type Article (original research)
DOI 10.1214/08-AOS591
Open Access Status File (Publisher version)
Volume 37
Issue 6A
Start page 3186
End page 3203
Total pages 18
Place of publication Beachwood, OH, United States
Publisher Institute of Mathematical Statistics
Language eng
Formatted abstract
We suggest a robust nearest-neighbor approach to classifying high- dimensional data. The method enhances sensitivity by employing a threshold and truncates to a sequence of zeros and ones in order to reduce the deleterious impact of heavy-tailed data. Empirical rules are suggested for choosing the threshold. They require the bare minimum of data; only one data vector is needed from each population. Theoretical and numerical aspects of performance are explored, paying particular attention to the impacts of correlation and heterogeneity among data components. On the theoretical side, it is shown that our truncated, thresholded, nearest-neighbor classifier enjoys the same classification boundary as more conventional, nonrobust approaches, which require finite moments in order to achieve good performance. In particular, the greater robustness of our approach does not come at the price of reduced effectiveness. Moreover, when both training sample sizes equal 1, our new method can have performance equal to that of optimal classifiers that require independent and identically distributed data with known marginal distributions; yet, our classifier does not itself need conditions of this type.
Keyword Classification boundary
Detection boundary
False discovery rate
Heterogeneous components
Higher criticism
Optimal classification
Q-Index Code C1
Q-Index Status Provisional Code
Institutional Status Non-UQ

Document type: Journal Article
Sub-type: Article (original research)
Collection: School of Mathematics and Physics
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 4 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Fri, 13 Sep 2013, 01:18:19 EST by Kay Mackie on behalf of School of Mathematics & Physics