Perspectives on Cross-Validation

Cross-validation is probably the most widely used method for risk estimation in machine learning and statistics. However, analyzing it and comparing it to the data splitting estimator has proved difficult. In the first part of the talk, I will present a new analysis which characterizes the exact asymptotic of cross-validation in the form of a central limit theorem for estimators which satisfy certain stability conditions. In particular, parametric estimators automatically satisfy these conditions, and the theorems characterize the cross-validated risk for such estimators fully. I will demonstrate that they exhibit a wide variety of behaviours: in the case of a parametric empirical risk minimizer, the folds behave as if independent if the evaluation loss is the same as the training loss. However, if a surrogate loss is used, different behaviours may occur. In the second part, I will move on to discuss issues which arise when using cross-validation for high-dimensional estimators: in the regime where the number of parameters is comparable to the number of observations, cross-validation (and data splitting) may introduce serious bias in the estimate of the risk when the amount of data left out is high (i.e. the number of folds is low). A natural approach may thus be to alleviate this problem by leaving out as little data as possible: a single observation, leading to leave-one-out cross-validation (LOOCV). I will show that indeed, such a result holds and the LOOCV estimator is consistent in the high-dimensional asymptotic. Unfortunately, the LOOCV estimator is computationally prohibitive, and cannot be used in practice. Finally, I will discuss a general framework, approximate LOOCV, from which closed-formed approximate estimators can be derived for penalized GLMs, including non-smooth ones such as the LASSO or SVMs.

[Slides]

Date:
Haut-parleurs:
Wenda Zhou
Affiliation:
Columbia University