上QQ阅读APP看书，第一时间看更新

The train, val, and test datasets

For the rest of the book, I will be structuring my data into three separate sets that I'll refer to as train, val, and test. These three separate datasets, drawn as random samples from the total dataset will be structured and sized approximately like this.

The train dataset will be used for training the network, as expected.

The val dataset, or the validation dataset, will be used to find ideal hyperparameters, and to measure overfitting. At the end of an epoch, which is when the network has has the opportunity to observe every data point in the training set, we will make a prediction on the val set. That prediction will be used to watch for overfitting and will help us know when the network has finished training. Using the val set at the end of each epoch like this somewhat differs from the typical usage. For more information on Hold-Out Validation please reference The Elements of Statistical Learning by Hastie and Tibshirani (https://web.stanford.edu/~hastie/ElemStatLearn/).

The test dataset will be used once all training is complete, to accurately measure model performance on a set of data that the network hasn't seen.

It is very important that the val and test data comes from the same datasets. It is less important that the train dataset matches val and test, although that is still ideal. If image augmentation were being used (performing minor modifications to training images in an attempt to amplify the training set size) for example, the training set distribution may no longer match the val set distribution. This is acceptable and network performance can be adequately measured as long as val and test are from the same distribution.

In traditional machine learning applications it's somewhat customary to use 10-20 percent of the available data for val and test. In deep neural networks it's often the case that our data volume is so large that we can adequately measure network performance with much smaller val and test sets. When data volume goes into the 10s of millions of observations, a 98 percent, 1 percent, 1 percent split may be completely appropriate.