### Transcription of Bias and variance estimation with the Bootstrap …

1 L13: cross-validation Resampling methods Cross validation **Bootstrap** **bias** and **variance** **estimation** **with** the **Bootstrap** Three-way data partitioning CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 1. Introduction Almost invariably, all the pattern recognition techniques that we have introduced have one or more free parameters The number of neighbors in a kNN classifier The bandwidth of the kernel function in kernel density **estimation** The number of features to preserve in a subset selection problem Two issues arise at this point Model Selection: How do we select the optimal parameter(s) for a given classification problem? Validation: Once we have chosen a model, how do we estimate its true error rate? The true error rate is the classifier's error rate when tested on the ENTIRE POPULATION. If we had access to an unlimited number of examples, these questions would have a straightforward answer Choose the model that provides the lowest error rate on the entire population And, of course, that error rate is the true error rate However, in real applications only a finite set of examples is available This number is usually smaller than we would hope for!

2 Why? Data collection is a very expensive process CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 2. One may be tempted to use the entire training data to select the optimal classifier, then estimate the error rate This na ve approach has two fundamental problems The final model will normally overfit the training data: it will not be able to generalize to new data The problem of overfitting is more pronounced **with** models that have a large number of parameters The error rate estimate will be overly optimistic (lower than the true error rate). In fact, it is not uncommon to achieve 100% correct classification on training data The techniques presented in this lecture will allow you to make the best use of your (limited) data for Training Model selection and Performance **estimation** CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 3. The holdout method Split dataset into two groups Training set: used to train the classifier Test set: used to estimate the error rate of the trained classifier Total number of examples Training Set Test Set The holdout method has two basic drawbacks In problems where we have a sparse dataset we may not be able to afford the luxury of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an unfortunate split These limitations of the holdout can be overcome **with** a family of resampling methods at the expense of higher computational cost Cross validation Random subsampling K-fold cross-validation Leave-one-out cross-validation **Bootstrap** CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 4.

3 Random subsampling Random subsampling performs K data splits of the entire dataset Each data split randomly selects a (fixed) number of examples without replacement For each data split we retrain the classifier from scratch **with** the training examples and then estimate **with** the test examples Total number of examples Test example Experiment 1. Experiment 2. Experiment 3. The true error estimate is obtained as the average of the separate estimates . This estimate is significantly better than the holdout estimate 1 . = =1 . CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 5. K-fold cross validation Create a K-fold partition of the dataset For each of experiments, use 1 folds for training and a different fold for testing This procedure is illustrated in the following figure for = 4. Total number of examples Experiment 1. Experiment 2. Experiment 3. Test examples Experiment 4. K-Fold cross validation is similar to random subsampling The advantage of KFCV is that all the examples in the dataset are eventually used for both training and testing As before, the true error is estimated as the average error rate on test examples 1.

4 = =1 . CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 6. Leave-one-out cross validation LOO is the degenerate case of KFCV, where K is chosen as the total number of examples For a dataset **with** examples, perform experiments For each experiment use 1 examples for training and the remaining example for testing Total number of examples Experiment 1. Experiment 2. Experiment 3. Single test example Experiment N. As usual, the true error is estimated as the average error rate on test examples 1 . = =1 . CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 7. How many folds are needed? **with** a large number of folds + The **bias** of the true error rate estimator will be small (the estimator will be very accurate). The **variance** of the true error rate estimator will be large The computational time will be very large as well (many experiments). **with** a small number of folds + The number of experiments and, therefore, computation time are reduced + The **variance** of the estimator will be small The **bias** of the estimator will be large (conservative or larger than the true error rate).

5 In practice, the choice for K depends on the size of the dataset For large datasets, even 3-fold cross validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for is K=10. CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 8. The **Bootstrap** The **Bootstrap** is a resampling technique **with** replacement From a dataset **with** examples Randomly select ( **with** replacement) examples and use this set for training The remaining examples that were not selected for training are used for testing This value is likely to change from fold to fold Repeat this process for a specified number of folds ( ). As before, the true error is estimated as the average error rate on test data Complete dataset X1 X2 X3 X4 X5. Experiment 1 X3 X1 X3 X3 X5 X2 X4. Experiment 2 X5 X5 X3 X1 X2 X4. Experiment 3 X5 X5 X1 X2 X1 X3 X4.

6 Experiment K X4 X4 X4 X4 X1 X2 X3 X5. Training sets Test sets CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 9. Compared to basic CV, the **Bootstrap** increases the **variance** that can occur in each fold [Efron and Tibshirani, 1993]. This is a desirable property since it is a more realistic simulation of the real-life experiment from which our dataset was obtained Consider a classification problem **with** classes, a total of . examples and examples for each class . The a priori probability of choosing an example from class is / . Once we choose an example from class , if we do not replace it for the next selection, then the a priori probabilities will have changed since the probability of choosing an example from class will now be 1 / . Thus, sampling **with** replacement preserves the a priori probabilities of the classes throughout the random selection process An additional benefit is that the **Bootstrap** can provide accurate measures of BOTH the **bias** and **variance** of the true error estimate CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 10.

7 **bias** and **variance** of a statistical estimate Problem: estimate parameter of unknown distribution . To emphasize the fact that concerns , we write ( ). Solution We collect examples = { 1 , 2 } from . defines a discrete distribution ' **with** mass 1/ at each example We compute the statistic ' = ( ') as an estimator of ( ). , ( ') may is the estimate of the true error rate for a classifier How good is this estimator? **bias** : How much does it deviate from the true value = . + . = .. **variance** : how much variability does it show for different samples = 2. CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 11. Example: **bias** and **variance** of the sample mean **bias** : The sample mean is known to be an unbiased estimator How about its **variance** ? From statistics, the standard deviation of the sample mean is equal to 1 2. = =1 . ( 1). This term is also known in statistics as the STANDARD ERROR. Unfortunately, there is no such a neat algebraic formula for almost any estimate other than the sample mean CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 12.

8 **bias** and **variance** estimates **with** the **Bootstrap** The **Bootstrap** allows us to estimate **bias** and **variance** for practically any statistical estimate, be it a scalar or vector (matrix). Here we will only describe the **estimation** procedure For more details refer to Advanced algorithms for neural networks [Masters, 1995], which provides an excellent introduction Approach Consider a dataset of examples = { 1 , 2 , , } from distribution . This dataset defines a discrete distribution '. Compute ' = ( ') as our initial estimate of . Let { 1 , 2 , , } be a **Bootstrap** dataset drawn from = { 1 , 2 , , }. Estimate parameter using this **Bootstrap** dataset ( ). Generate **Bootstrap** datasets and obtain estimates { 1 ( ), 2 ( ) , ( )}. The **bias** and **variance** estimates of are 1 . = where = =1 .. 1 2. = . 1. =1. CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 13. Rationale The effect of generating a **Bootstrap** dataset from the distribution ' is similar to the effect of obtaining the dataset = { 1 , 2 , } from the original distribution.

9 In other words, the distribution { 1 ( ), 2 ( ) , ( )} is related to the initial estimate in the same fashion that multiple estimates are related to the true value . CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 14. Example Assume a small dataset = {3,5,2,1,7}, and we want to compute the **bias** and **variance** of the sample mean = We generate a number of **Bootstrap** samples (three in this case). Assume that the first **Bootstrap** yields the dataset {7,3,2,3,1}. We compute the sample mean 1 = The second **Bootstrap** sample yields the dataset {5,1,1,3,7}. We compute the sample mean 2 = The third **Bootstrap** sample yields the dataset {2,2,7,1,3}. We compute the sample mean 3 = We average these estimates and obtain an average of = What are the **bias** and **variance** of the sample mean '? ( ') = .. = .. We may conclude then that resampling introduced a downward **bias** on the mean, so we would be inclined to use + = as an unbiased estimate of.

10 ( ') = / [ .. + .. + .. ] = .. NOTES. We have done this exercise for the sample mean (so you could trace the computations), but could be any other statistical operator! How many **Bootstrap** samples should we use? As a rule of thumb, several hundred resamples will suffice for most problems [Masters,1995]. CSCE 666 Pattern **analysis** | Ricardo Gutierrez-Osuna | CSE@TAMU 15. Three-way data splits If model selection and true error estimates are to be computed simultaneously, the data should be divided into three disjoint sets [Ripley, 1996]. Training set: used for learning, , to fit the parameters of the classifier In an MLP, we would use the training set to find the optimal weights **with** the back-prop rule Validation set: used to select among several trained classifiers In an MLP, we would use the validation set to find the optimal number of hidden units or determine a stopping point for the back-propagation algorithm Test set: used only to assess the performance of a fully-trained classifier In an MLP, we would use the test to estimate the error rate after we have chosen the final model (MLP size and actual weights).