In machine learning, algorithms learn from data by building a mathematical model, typically trained using a training data set with known inputs and labels. Training adjusts model parameters, such as weights in artificial neural networks, via methods like gradient descent. A separate validation data set tunes hyperparameters and helps prevent overfitting through techniques like early stopping. Finally, a test data set provides an unbiased evaluation of the final model, often used in methods such as cross-validation. The division into training, validation, and test sets depends on the specific problem and data availability.
Training data set
A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier.1314
For classification tasks, a supervised learning algorithm looks at the training data set to determine, or learn, the optimal combinations of variables that will generate a good predictive model.15 The goal is to produce a trained (fitted) model that generalizes well to new, unknown data.16 The fitted model is evaluated using “new” examples from the held-out data sets (validation and test data sets) to estimate the model’s accuracy in classifying new data.17 To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model.18
Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general.
When a training set is continuously expanded with new data, then this is incremental learning.
Validation data set
A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model. It is sometimes also called the development set or the "dev set".19 An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer.2021 It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set.
In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation data set in addition to the training and test data sets. For example, if the most suitable classifier for the problem is sought, the training data set is used to train the different candidate classifiers, the validation data set is used to compare their performances and decide which one to take and, finally, the test data set is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation data set functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing.
The basic process of using a validation data set for model selection (as part of training data set, validation data set, and test data set) is:2223
Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.
An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error).
Test data set
A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. If a model fit to the training data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training data set as opposed to the test data set usually points to over-fitting.
A test set is therefore a set of examples used only to assess the performance (i.e. generalization) of a fully specified classifier.2425 To do this, the final model is used to predict classifications of examples in the test set. Those predictions are compared to the examples' true classifications to assess the model's accuracy.26
In a scenario where both validation and test data sets are used, the test data set is typically used to assess the final model that is selected during the validation process. In the case where the original data set is partitioned into two subsets (training and test data sets), the test data set might assess the model only once (e.g., in the holdout method).27 Note that some sources advise against such a method.28 However, when using a method such as cross-validation, two partitions can be sufficient and effective since results are averaged after repeated rounds of model training and testing to help reduce bias and variability.2930
Confusion in terminology
Testing is trying something to find out about it ("To put to the proof; to prove the truth, genuineness, or quality of by experiment" according to the Collaborative International Dictionary of English) and to validate is to prove that something is valid ("To confirm; to render valid" Collaborative International Dictionary of English). With this perspective, the most common use of the terms test set and validation set is the one here described. However, in both industry and academia, they are sometimes used interchanged, by considering that the internal process is testing different models to improve (test set as a development set) and the final model is the one that needs to be validated before real use with an unseen data (validation set). "The literature on machine learning often reverses the meaning of 'validation' and 'test' sets. This is the most blatant example of the terminological confusion that pervades artificial intelligence research."31 Nevertheless, the important concept that must be kept is that the final set, whether called test or validation, should only be used in the final experiment.
Cross-validation
In order to get more stable results and use all valuable data for training, a data set can be repeatedly split into several training and a validation data sets. This is known as cross-validation. To confirm the model's performance, an additional test data set held out from cross-validation is normally used.
It is possible to use cross-validation on training and validation sets, and within each training set have further cross-validation for a test set for hyperparameter tuning. This is known as nested cross-validation.
Causes of error
Omissions in the training of algorithms are a major cause of erroneous outputs.32 Types of such omissions include:33
- Particular circumstances or variations were not included.
- Obsolete data
- Ambiguous input information
- Inability to change to new environments
- Inability to request help from a human or another AI system when needed
An example of an omission of particular circumstances is a case where a boy was able to unlock the phone because his mother registered her face under indoor, nighttime lighting, a condition which was not appropriately included in the training of the system.3435
Usage of relatively irrelevant input can include situations where algorithms use the background rather than the object of interest for object detection, such as being trained by pictures of sheep on grasslands, leading to a risk that a different object will be interpreted as a sheep if located on a grassland.36
See also
- Statistical classification
- List of datasets for machine learning research
- Hierarchical classification
References
Ron Kohavi; Foster Provost (1998). "Glossary of terms". Machine Learning. 30: 271–274. doi:10.1023/A:1007411609915. https://ai.stanford.edu/~ronnyk/glossary.html ↩
Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. New York: Springer. p. vii. ISBN 0-387-31073-8. Pattern recognition has its origins in engineering, whereas machine learning grew out of computer science. However, these activities can be viewed as two facets of the same field, and together they have undergone substantial development over the past ten years. 0-387-31073-8 ↩
James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. p. 176. ISBN 978-1461471370. 978-1461471370 ↩
Ripley, Brian (1996). Pattern Recognition and Neural Networks. Cambridge University Press. p. 354. ISBN 978-0521717700. 978-0521717700 ↩
James, Gareth (2013). An Introduction to Statistical Learning: with Applications in R. Springer. p. 176. ISBN 978-1461471370. 978-1461471370 ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
Ripley, Brian (1996). Pattern Recognition and Neural Networks. Cambridge University Press. p. 354. ISBN 978-0521717700. 978-0521717700 ↩
Prechelt, Lutz; Geneviève B. Orr (2012-01-01). "Early Stopping — But When?". In Grégoire Montavon; Klaus-Robert Müller (eds.). Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 53–67. doi:10.1007/978-3-642-35289-8_5. ISBN 978-3-642-35289-8. 978-3-642-35289-8 ↩
Prechelt, Lutz; Geneviève B. Orr (2012-01-01). "Early Stopping — But When?". In Grégoire Montavon; Klaus-Robert Müller (eds.). Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 53–67. doi:10.1007/978-3-642-35289-8_5. ISBN 978-3-642-35289-8. 978-3-642-35289-8 ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
"Machine learning - Is there a rule-of-thumb for how to divide a dataset into training and validation sets?". Stack Overflow. Retrieved 2021-08-12. https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio ↩
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354 ↩
"Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17) ftp://ftp.sas.com/pub/neural/FAQ.html#A_data ↩
Larose, D. T.; Larose, C. D. (2014). Discovering knowledge in data : an introduction to data mining. Hoboken: Wiley. doi:10.1002/9781118874059. ISBN 978-0-470-90874-7. OCLC 869460667. 978-0-470-90874-7 ↩
Xu, Yun; Goodacre, Royston (2018). "On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning". Journal of Analysis and Testing. 2 (3). Springer Science and Business Media LLC: 249–262. doi:10.1007/s41664-018-0068-2. ISSN 2096-241X. PMC 6373628. PMID 30842888. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6373628 ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
"Deep Learning". Coursera. Retrieved 2021-05-18. https://www.coursera.org/specializations/deep-learning ↩
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354 ↩
"Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17) ftp://ftp.sas.com/pub/neural/FAQ.html#A_data ↩
"Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17) ftp://ftp.sas.com/pub/neural/FAQ.html#A_data ↩
Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford: Oxford University Press, p. 372 https://books.google.com/books?id=T0S0BgAAQBAJ ↩
Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge: Cambridge University Press, p. 354 ↩
"Subject: What are the population, sample, training set, design set, validation set, and test set?", Neural Network FAQ, part 1 of 7: Introduction (txt), comp.ai.neural-nets, Sarle, W.S., ed. (1997, last modified 2002-05-17) ftp://ftp.sas.com/pub/neural/FAQ.html#A_data ↩
Larose, D. T.; Larose, C. D. (2014). Discovering knowledge in data : an introduction to data mining. Hoboken: Wiley. doi:10.1002/9781118874059. ISBN 978-0-470-90874-7. OCLC 869460667. 978-0-470-90874-7 ↩
Kohavi, Ron (2001-03-03). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection". 14. {{cite journal}}: Cite journal requires |journal= (help) https://www.researchgate.net/publication/2352264 ↩
Xu, Yun; Goodacre, Royston (2018). "On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning". Journal of Analysis and Testing. 2 (3). Springer Science and Business Media LLC: 249–262. doi:10.1007/s41664-018-0068-2. ISSN 2096-241X. PMC 6373628. PMID 30842888. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6373628 ↩
Brownlee, Jason (2017-07-13). "What is the Difference Between Test and Validation Datasets?". Retrieved 2017-10-12. https://machinelearningmastery.com/difference-test-validation-datasets/ ↩
Xu, Yun; Goodacre, Royston (2018). "On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning". Journal of Analysis and Testing. 2 (3). Springer Science and Business Media LLC: 249–262. doi:10.1007/s41664-018-0068-2. ISSN 2096-241X. PMC 6373628. PMID 30842888. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6373628 ↩
Ripley, Brian D. (2008-01-10). "Glossary". Pattern recognition and neural networks. Cambridge University Press. ISBN 9780521717700. OCLC 601063414. 9780521717700 ↩
Chanda SS, Banerjee DN (2022). "Omission and commission errors underlying AI failures". AI Soc. 39 (3): 1–24. doi:10.1007/s00146-022-01585-x. PMC 9669536. PMID 36415822. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9669536 ↩
Chanda SS, Banerjee DN (2022). "Omission and commission errors underlying AI failures". AI Soc. 39 (3): 1–24. doi:10.1007/s00146-022-01585-x. PMC 9669536. PMID 36415822. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9669536 ↩
Chanda SS, Banerjee DN (2022). "Omission and commission errors underlying AI failures". AI Soc. 39 (3): 1–24. doi:10.1007/s00146-022-01585-x. PMC 9669536. PMID 36415822. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9669536 ↩
Greenberg A (2017-11-14). "Watch a 10-Year-Old's Face Unlock His Mom's iPhone X". Wired. https://www.wired.com/story/10-year-old-face-id-unlocks-mothers-iphone-x/?mbid=social_fb ↩
Chanda SS, Banerjee DN (2022). "Omission and commission errors underlying AI failures". AI Soc. 39 (3): 1–24. doi:10.1007/s00146-022-01585-x. PMC 9669536. PMID 36415822. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9669536 ↩