# How to use K-fold Cross Validation with TensorFlow 2 and Keras?

Last Updated on 12 February 2021

When you train supervised machine learning models, you’ll likely try multiple models, in order to find out how good they are. Part of this process is likely going to be the question how can I compare models objectively?

Training and testing datasets have been invented for this purpose. By splitting a small part off your full dataset, you create a dataset which (1) was not yet seen by the model, and which (2) you assume to approximate the distribution of the population, i.e. the real world scenario you wish to generate a predictive model for.

Now, when generating such a split, you should ensure that your splits are relatively unbiased. In this blog post, we’ll cover one technique for doing so: K-fold Cross Validation. Firstly, we’ll show you how such splits can be made naïvely – i.e., by a simple hold out split strategy. Then, we introduce K-fold Cross Validation, show you how it works, and why it can produce better results. This is followed by an example, created with Keras and Scikit-learn’s KFold functions.

Are you ready? Let’s go! 😎

Update 12/Feb/2021: added TensorFlow 2 to title; some styling changes.

Update 11/Jan/2021: added code example to start using K-fold CV straight away.

Update 04/Aug/2020: clarified the (in my view) necessity of validation set even after K-fold CV.

## Code example: K-fold Cross Validation with TensorFlow and Keras

This quick code can be used to perform K-fold Cross Validation with your TensorFlow/Keras model straight away. If you want to understand it in more detail, make sure to read the rest of the article below!

.wp-block-code{border:0;padding:0;}.wp-block-code > div{overflow:auto;}.shcb-language{border:0;clip:rect(1px,1px,1px,1px);-webkit-clip-path:inset(50%);clip-path:inset(50%);height:1px;margin:-1px;overflow:hidden;padding:0;position:absolute;width:1px;word-wrap:normal;word-break:normal;}.hljs{box-sizing:border-box;}.hljs.shcb-code-table{display:table;width:100%;}.hljs.shcb-code-table > .shcb-loc{color:inherit;display:table-row;width:100%;}.hljs.shcb-code-table .shcb-loc > span{display:table-cell;}.wp-block-code code.hljs:not(.shcb-wrap-lines){white-space:pre;}.wp-block-code code.hljs.shcb-wrap-lines{white-space:pre-wrap;}.hljs.shcb-line-numbers{border-spacing:0;counter-reset:line;}.hljs.shcb-line-numbers > .shcb-loc{counter-increment:line;}.hljs.shcb-line-numbers .shcb-loc > span{padding-left:.75em;}.hljs.shcb-line-numbers .shcb-loc::before{border-right:1px solid #ddd;content:counter(line);display:table-cell;padding:0 .75em;text-align:right;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;white-space:nowrap;width:1%;}from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from sklearn.model_selection import KFold
import numpy as np

# Merge inputs and targets
inputs = np.concatenate((input_train, input_test), axis=0)
targets = np.concatenate((target_train, target_test), axis=0)

# Define the K-fold Cross Validator
kfold = KFold(n_splits=num_folds, shuffle=True)

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(inputs, targets):

# Define the model architecture
model = Sequential()

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Generate a print
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

# Generate generalization metrics
scores = model.evaluate(inputs[test], targets[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

# Increase fold number
fold_no = fold_no + 1Code language: PHP (php)

## Evaluating and selecting models with K-fold Cross Validation

Training a supervised machine learning model involves changing model weights using a training set. Later, once training has finished, the trained model is tested with new data – the testing set – in order to find out how well it performs in real life.

When you are satisfied with the performance of the model, you train it again with the entire dataset, in order to finalize it and use it in production (Bogdanovist, n.d.)

However, when checking how well the model performance, the question how to split the dataset is one that emerges pretty rapidly. K-fold Cross Validation, the topic of today’s blog post, is one possible approach, which we’ll discuss next.

However, let’s first take a look at the concept of generating train/test splits in the first place. Why do you need them? Why can’t you simply train the model with all your data and then compare the results with other models? We’ll answer these questions first.

Then, we take a look at the efficient but naïve simple hold-out splits. This way, when we discuss K-fold Cross Validation, you’ll understand more easily why it can be more useful when comparing performance between models. Let’s go!

## Let's pause for a second! 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

### Why using train/test splits? – On finding a model that works for you

Before we’ll dive into the approaches for generating train/test splits, I think that it’s important to take a look at why we should split them in the first place when evaluating model performance.

For this reason, we’ll invent a model evaluation scenario first.

#### Generating many predictions

Say that we’re training a few models to classify images of digits. We train a Support Vector Machine (SVM), a Convolutional Neural Network (CNN) and a Densely-connected Neural Network (DNN) and of course, hope that each of them predicts “5” in this scenario:

Our goal here is to use the model that performs best in production, a.k.a. “really using it” 🙂

The central question then becomes: how well does each model perform?

Based on their performance, we can select a model that can be used in real life.

However, if we wish to determine model performance, we should generate a whole bunch of predictions – preferably, thousands or even more – so that we can compute metrics like accuracy, or loss. Great!

#### Don’t be the student who checks his own homework

Now, we’ll get to the core of our point – i.e., why we need to generate splits between training and testing data when evaluating machine learning models.

We’ll require an understanding of the high-level supervised machine learning process for this purpose:

It can be read as follows:

• In the first step, all the training samples (in blue on the left) are fed forward to the machine learning model, which generates predictions (blue on the right).
• In the second step, the predictions are compared with the “ground truth” (the real targets) – which results in the computation of a loss value.
• The model can subsequently be optimized by steering the model away from the error, by changing its weights, in the backwards pass of the gradient with respect to (finally) the loss value.
• The process then starts again. Presumably, the model performs better this time.

As you can imagine, the model will improve based on the loss generated by the data. This data is a sample, which means that there is always a difference between the sample distribution and the population distribution. In other words, there is always a difference between what your data tells that the patterns are and what the patterns are in the real world. This difference can be really small, but it’s there.

Now, if you let the model train for long enough, it will adapt substantially to the dataset. This also means that the impact of the difference will get larger and larger, relative to the patterns of the real-world scenario. If you’ve trained it for too long – a problem called overfitting – the difference may be the cause that it won’t work anymore when real world data is fed to it.

Generating a split between training data and testing data can help you solve this issue. By training your model using the training data, you can let it train for as long as you want. Why? Simple: you have the testing data to evaluate model performance afterwards, using data that is (1) presumably representative for the real world and (2) unseen yet. If the model is highly overfit, this will be clear, because it will perform very poorly during the evaluation step with the testing data.

Now, let’s take a look at how we can do this. We’ll s tart with simple hold-out splits 🙂

### A naïve approach: simple hold-out split

Say that you’ve got a dataset of 10.000 samples. It hasn’t been split into a training and a testing set yet. Generally speaking, a 80/20 split is acceptable. That is, 80% of your data – 8.000 samples in our case – will be used for training purposes, while 20% – 2.000 – will be used for testing.

## Never miss new Machine Learning articles ✅

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

We can thus simply draw a boundary at 8.000 samples, like this:

We call this simple hold-out split, as we simply “hold out” the last 2.000 samples (Chollet, 2017).

It can be a highly effective approach. What’s more, it’s also very inexpensive in terms of the computational power you need. However, it’s also a very naïve approach, as you’ll have to keep these edge cases in mind all the time (Chollet, 2017):

1. Data representativeness: all datasets, which are essentially samples, must represent the patterns in the population as much as possible. This becomes especially important when you generate samples from a sample (i.e., from your full dataset). For example, if the first part of your dataset has pictures of ice cream, while the latter one only represents espressos, trouble is guaranteed when you generate the split as displayed above. Random shuffling may help you solve these issues.
2. The arrow of time: if you have a time series dataset, your dataset is likely ordered chronologically. If you’d shuffle randomly, and then perform simple hold-out validation, you’d effectively “[predict] the future given the past” (Chollet, 2017). Such temporal leaks don’t benefit model performance.
3. Data redundancy: if some samples appear more than once, a simple hold-out split with random shuffling may introduce redundancy between training and testing datasets. That is, identical samples belong to both datasets. This is problematic too, as data used for training thus leaks into the dataset for testing implicitly.

Now, as we can see, while a simple hold-out split based approach can be effective and will be efficient in terms of computational resources, it also requires you to monitor for these edge cases continuously.

### K-fold Cross Validation

A more expensive and less naïve approach would be to perform K-fold Cross Validation. Here, you set some value for $$K$$ and (hey, what’s in a name 😋) the dataset is split into $$K$$ partitions of equal size. $$K – 1$$ are used for training, while one is used for testing. This process is repeated $$K$$ times, with a different partition used for testing each time.

For example, this would be the scenario for our dataset with $$K = 5$$ (i.e., once again the 80/20 split, but then 5 times!):

For each split, the same model is trained, and performance is displayed per fold. For evaluation purposes, you can obviously also average it across all folds. While this produces better estimates, K-fold Cross Validation also increases training cost: in the $$K = 5$$ scenario above, the model must be trained for 5 times.

Let’s now extend our viewpoint with a few variations of K-fold Cross Validation 🙂

If you have no computational limitations whatsoever, you might wish to try a special case of K-fold Cross Validation, called Leave One Out Cross Validation (or LOOCV, Khandelwal 2019). LOOCV means $$K = N$$, where $$N$$ is the number of samples in your dataset. As the number of models trained is maximized, the precision of the model performance average is maximized too, but so is the cost of training due to the sheer amount of models that must be trained.

If you have a binary classification problem, you might also wish to take a look at Stratified Cross Validation (Khandelwal, 2019). It extends K-fold Cross Validation by ensuring an equal distribution of the target classes over the splits. This ensures that your classification problem is balanced. It doesn’t work for multiclass classification due to the way that samples are distributed.

Finally, if you have a time series dataset, you might wish to use Time-series Cross Validation (Khandelwal, 2019). Check here how it works.

## Creating a Keras model with K-fold Cross Validation

Now that we understand how K-fold Cross Validation works, it’s time to code an example with the Keras deep learning framework 🙂

Coding it will be a multi-stage process:

• Firstly, we’ll take a look at what we need in order to run our model successfully.
• Then, we take a look at today’s model.
• Subsequently, we add K-fold Cross Validation, train the model instances, and average performance.
• Finally, we output the performance metrics on screen.

### What we’ll need to run our model

For running the model, we’ll need to install a set of software dependencies. For today’s blog post, they are as follows:

• TensorFlow 2.0+, which includes the Keras deep learning framework;
• Numpy.

## Join hundreds of other learners! 😎

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

### Our model: a CIFAR-10 CNN classifier

Now, today’s model.

We’ll be using a convolutional neural network that can be used to classify CIFAR-10 images into a set of 10 classes. The images are varied, as you can see here:

Now, my goal is not to replicate the process of creating the model here, as we already did that in our blog post “How to build a ConvNet for CIFAR-10 and CIFAR-100 classification with Keras?”. Take a look at that post if you wish to understand the steps that lead to the model below.

(Do note that this is a small adaptation, where we removed the third convolutional block for reasons of speed.)

Here is the full model code of the original CIFAR-10 CNN classifier, which we can use when adding K-fold Cross Validation:

from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
import matplotlib.pyplot as plt

# Model configuration
batch_size = 50
img_width, img_height, img_num_channels = 32, 32, 3
loss_function = sparse_categorical_crossentropy
no_classes = 100
no_epochs = 100
verbosity = 1

(input_train, target_train), (input_test, target_test) = cifar10.load_data()

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Fit data to model
history = model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

# Visualize history
# Plot history: Loss
plt.plot(history.history['val_loss'])
plt.title('Validation loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.show()

# Plot history: Accuracy
plt.plot(history.history['val_accuracy'])
plt.title('Validation accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.show()Code language: PHP (php)

### Removing obsolete code

Now, let’s slightly adapt the model in order to add K-fold Cross Validation.

Firstly, we’ll strip off some code that we no longer need:

import matplotlib.pyplot as pltCode language: JavaScript (javascript)

We will no longer generate the visualizations, and besides the import we thus also remove the part generating them:

# Visualize history
# Plot history: Loss
plt.plot(history.history['val_loss'])
plt.title('Validation loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.show()

# Plot history: Accuracy
plt.plot(history.history['val_accuracy'])
plt.title('Validation accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.show()Code language: PHP (php)

Secondly, let’s add the KFold code from scikit-learn to the imports – as well as numpy:

from sklearn.model_selection import KFold
import numpy as npCode language: JavaScript (javascript)

Which…

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Scikit-learn (n.d.) sklearn.model_selection.KFold

Precisely what we want!

We also add a new configuration value:

num_folds = 10

This will ensure that our $$K = 10$$.

What’s more, directly after the “normalize data” step, we add two empty lists for storing the results of cross validation:

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.
# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Define per-fold score containers <-- these are new
acc_per_fold = []
loss_per_fold = []Code language: PHP (php)

This is followed by a concat of our ‘training’ and ‘testing’ datasets – remember that K-fold Cross Validation makes the split!

# Merge inputs and targets
inputs = np.concatenate((input_train, input_test), axis=0)
targets = np.concatenate((target_train, target_test), axis=0)Code language: PHP (php)

Based on this prior work, we can add the code for K-fold Cross Validation:

fold_no = 1
for train, test in kfold.split(input_train, target_train):

Ensure that all the model related steps are now wrapped inside the for loop. Also make sure to add a couple of extra print statements and to replace the inputs and targets to model.fit:

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(inputs, targets):

# Define the model architecture
model = Sequential()

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Generate a print
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)Code language: PHP (php)

We next replace the “test loss” print with one related to what we’re doing. Also, we increase the fold_no:

  # Generate generalization metrics
scores = model.evaluate(inputs[test], targets[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

# Increase fold number
fold_no = fold_no + 1Code language: PHP (php)

Here, we simply print a “score for fold X” – and add the accuracy and sparse categorical crossentropy loss values to the lists.

Now, why do we do that?

Simple: at the end, we provide an overview of all scores and the averages. This allows us to easily compare the model with others, as we can simply compare these outputs. Add this code at the end of the model, but make sure that it is not wrapped inside the for loop:

# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
print('------------------------------------------------------------------------')
print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')Code language: PHP (php)

#### Full model code

Altogether, this is the new code for your K-fold Cross Validation scenario with $$K = 10$$:

from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from sklearn.model_selection import KFold
import numpy as np

# Model configuration
batch_size = 50
img_width, img_height, img_num_channels = 32, 32, 3
loss_function = sparse_categorical_crossentropy
no_classes = 100
no_epochs = 25
verbosity = 1
num_folds = 10

(input_train, target_train), (input_test, target_test) = cifar10.load_data()

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Define per-fold score containers
acc_per_fold = []
loss_per_fold = []

# Merge inputs and targets
inputs = np.concatenate((input_train, input_test), axis=0)
targets = np.concatenate((target_train, target_test), axis=0)

# Define the K-fold Cross Validator
kfold = KFold(n_splits=num_folds, shuffle=True)

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(inputs, targets):

# Define the model architecture
model = Sequential()

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Generate a print
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

# Generate generalization metrics
scores = model.evaluate(inputs[test], targets[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

# Increase fold number
fold_no = fold_no + 1

# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
print('------------------------------------------------------------------------')
print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')Code language: PHP (php)

## Results

Now, it’s time to run the model, to see whether we can get some nice results 🙂

Say, for example, that you saved the model as k-fold-model.py in some folder. Open up your command prompt – for example, Anaconda Prompt – and cd to the folder where your file is stored. Make sure that your dependencies are installed and then run python k-fold-model.py.

If everything goes well, the model should start training for 25 epochs per fold.

### Evaluating the performance of your model

During training, it should produce batches like this one:

------------------------------------------------------------------------
Training for fold 3 ...
Train on 43200 samples, validate on 10800 samples
Epoch 1/25
43200/43200 [==============================] - 9s 200us/sample - loss: 1.5628 - accuracy: 0.4281 - val_loss: 1.2300 - val_accuracy: 0.5618
Epoch 2/25
43200/43200 [==============================] - 7s 165us/sample - loss: 1.1368 - accuracy: 0.5959 - val_loss: 1.0767 - val_accuracy: 0.6187
Epoch 3/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.9737 - accuracy: 0.6557 - val_loss: 0.9869 - val_accuracy: 0.6522
Epoch 4/25
43200/43200 [==============================] - 7s 169us/sample - loss: 0.8665 - accuracy: 0.6967 - val_loss: 0.9347 - val_accuracy: 0.6772
Epoch 5/25
43200/43200 [==============================] - 8s 175us/sample - loss: 0.7792 - accuracy: 0.7281 - val_loss: 0.8909 - val_accuracy: 0.6918
Epoch 6/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.7110 - accuracy: 0.7508 - val_loss: 0.9058 - val_accuracy: 0.6917
Epoch 7/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.6460 - accuracy: 0.7745 - val_loss: 0.9357 - val_accuracy: 0.6892
Epoch 8/25
43200/43200 [==============================] - 8s 184us/sample - loss: 0.5885 - accuracy: 0.7963 - val_loss: 0.9242 - val_accuracy: 0.6962
Epoch 9/25
43200/43200 [==============================] - 7s 156us/sample - loss: 0.5293 - accuracy: 0.8134 - val_loss: 0.9631 - val_accuracy: 0.6892
Epoch 10/25
43200/43200 [==============================] - 7s 164us/sample - loss: 0.4722 - accuracy: 0.8346 - val_loss: 0.9965 - val_accuracy: 0.6931
Epoch 11/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.4168 - accuracy: 0.8530 - val_loss: 1.0481 - val_accuracy: 0.6957
Epoch 12/25
43200/43200 [==============================] - 7s 159us/sample - loss: 0.3680 - accuracy: 0.8689 - val_loss: 1.1481 - val_accuracy: 0.6938
Epoch 13/25
43200/43200 [==============================] - 7s 165us/sample - loss: 0.3279 - accuracy: 0.8850 - val_loss: 1.1438 - val_accuracy: 0.6940
Epoch 14/25
43200/43200 [==============================] - 7s 171us/sample - loss: 0.2822 - accuracy: 0.8997 - val_loss: 1.2441 - val_accuracy: 0.6832
Epoch 15/25
43200/43200 [==============================] - 7s 167us/sample - loss: 0.2415 - accuracy: 0.9149 - val_loss: 1.3760 - val_accuracy: 0.6786
Epoch 16/25
43200/43200 [==============================] - 7s 170us/sample - loss: 0.2029 - accuracy: 0.9294 - val_loss: 1.4653 - val_accuracy: 0.6820
Epoch 17/25
43200/43200 [==============================] - 7s 165us/sample - loss: 0.1858 - accuracy: 0.9339 - val_loss: 1.6131 - val_accuracy: 0.6793
Epoch 18/25
43200/43200 [==============================] - 7s 171us/sample - loss: 0.1593 - accuracy: 0.9439 - val_loss: 1.7192 - val_accuracy: 0.6703
Epoch 19/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.1271 - accuracy: 0.9565 - val_loss: 1.7989 - val_accuracy: 0.6807
Epoch 20/25
43200/43200 [==============================] - 8s 190us/sample - loss: 0.1264 - accuracy: 0.9547 - val_loss: 1.9215 - val_accuracy: 0.6743
Epoch 21/25
43200/43200 [==============================] - 9s 207us/sample - loss: 0.1148 - accuracy: 0.9587 - val_loss: 1.9823 - val_accuracy: 0.6720
Epoch 22/25
43200/43200 [==============================] - 7s 167us/sample - loss: 0.1110 - accuracy: 0.9615 - val_loss: 2.0952 - val_accuracy: 0.6681
Epoch 23/25
43200/43200 [==============================] - 7s 166us/sample - loss: 0.0984 - accuracy: 0.9653 - val_loss: 2.1623 - val_accuracy: 0.6746
Epoch 24/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.0886 - accuracy: 0.9691 - val_loss: 2.2377 - val_accuracy: 0.6772
Epoch 25/25
43200/43200 [==============================] - 7s 166us/sample - loss: 0.0855 - accuracy: 0.9697 - val_loss: 2.3857 - val_accuracy: 0.6670
Score for fold 3: loss of 2.4695983460744224; accuracy of 66.46666526794434%
------------------------------------------------------------------------

Do note the increasing validation loss, a clear sign of overfitting.

And finally, after the 10th fold, it should display the overview with results per fold and the average:

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Loss: 2.4094747734069824 - Accuracy: 67.96666383743286%
------------------------------------------------------------------------
> Fold 2 - Loss: 1.768296229839325 - Accuracy: 67.03333258628845%
------------------------------------------------------------------------
> Fold 3 - Loss: 2.4695983460744224 - Accuracy: 66.46666526794434%
------------------------------------------------------------------------
> Fold 4 - Loss: 2.363724467277527 - Accuracy: 66.28333330154419%
------------------------------------------------------------------------
> Fold 5 - Loss: 2.083754387060801 - Accuracy: 65.51666855812073%
------------------------------------------------------------------------
> Fold 6 - Loss: 2.2160572570165 - Accuracy: 65.6499981880188%
------------------------------------------------------------------------
> Fold 7 - Loss: 1.7227793588638305 - Accuracy: 66.76666736602783%
------------------------------------------------------------------------
> Fold 8 - Loss: 2.357142448425293 - Accuracy: 67.25000143051147%
------------------------------------------------------------------------
> Fold 9 - Loss: 1.553109979470571 - Accuracy: 65.54999947547913%
------------------------------------------------------------------------
> Fold 10 - Loss: 2.426255855560303 - Accuracy: 66.03333353996277%
------------------------------------------------------------------------
Average scores for all folds:
> Accuracy: 66.45166635513306 (+- 0.7683473645622098)
> Loss: 2.1370193102995554
------------------------------------------------------------------------Code language: CSS (css)

This allows you to compare the performance across folds, and compare the averages of the folds across model types you’re evaluating 🙂

In our case, the model produces accuracies of 60-70%. This is acceptable, but there is still room for improvement. But hey, that wasn’t the scope of this blog post 🙂

### Model finalization

If you’re satisfied with the performance of your model, you can finalize it. There are two options for doing so:

• Save the best performing model instance (check “How to save and load a model with Keras?” – do note that this requires retraining because you haven’t saved models with the code above), and use it for generating predictions.
• Retrain the model, but this time with all the data – i.e., without making the train/test split. Save that model, and use it for generating predictions. I do suggest to continue using a validation set, as you want to know when the model is overfitting.

Both sides have advantages and disadvantages. The advantages of the first are that you don’t have to retrain, as you can simply use the best-performing fold which was saved during the training procedure. As retraining may be expensive, this could be an option, especially when your model is large. However, the disadvantage is that you simply miss out a percentage of your data – which may bring your training sample closer to the actual patterns in the population rather than your sample. If that’s the case, then the second option is better.

However, that’s entirely up to you! 🙂

## Summary

In this blog post, we looked at the concept of model evaluation: what is it? Why would we need it in the first place? And how to do so objectively? If we can’t evaluate models without introducing bias of some sort, there’s no point in evaluating at all, is there?

We introduced simple hold-out splits for this purpose, and showed that while they are efficient in terms of the required computational resources, they are also naïve. K-fold Cross Validation is $$K$$ times more expensive, but can produce significantly better estimates because it trains the models for $$K$$ times, each time with a different train/test split.

To illustrate this further, we provided an example implementation for the Keras deep learning framework using TensorFlow 2.0. Using a Convolutional Neural Network for CIFAR-10 classification, we generated evaluations that performed in the range of 60-70% accuracies.

I hope you’ve learnt something from today’s blog post. If you did, feel free to leave a comment in the comments section! If you have questions, you can add a comment or ask a question with the button on the right. Please do the same if you spotted mistakes or when you have other remarks. I’ll happily answer your comments and will improve my blog if that’s the best thing to do.

Thank you for reading MachineCurve today and happy engineering! 😎

🚀 Boost your ML knowledge with MachineCurve Continue your Keras journey 👩‍💻 Learn about supervised learning with the Keras Deep Learning framework, including tutorials on ConvNets, autoencoders, activation functions, optimizers... and a lot more! Python examples are included. Enjoy our 100+ free Keras tutorials

## References

Scikit-learn. (n.d.). sklearn.model_selection.KFold — scikit-learn 0.22.1 documentation. Retrieved February 17, 2020, from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

Allibhai, E. (2018, October 3). Holdout vs. Cross-validation in Machine Learning. Retrieved from https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f

Chollet, F. (2017). Deep Learning with Python. New York, NY: Manning Publications.

Bogdanovist. (n.d.). How to choose a predictive model after k-fold cross-validation? Retrieved from https://stats.stackexchange.com/a/52277

## Do you want to start learning ML from a developer perspective? 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to learn new things and better understand concepts you already know. We send emails every Friday.
By signing up, you consent that any information you receive can include services and special offers by email.

## 50 thoughts on “How to use K-fold Cross Validation with TensorFlow 2 and Keras?”

1. Devidas

great post
but How can I save the best performance among all the folds in the program itself.
Also retrain on whole data without using validation will it become robust model for unknown population samples.
please clarify and if possible need code snippest.
Thanks
Thanks

1. Chris

Hi Devidas,

Question 1: how to save the best performing Keras model across all the folds in K-fold cross validation:
This cannot be done out of the box. However, as you can see in my code, using for ... in, I loop over the folds, and train the model again and again with the split made for that particular fold.

In those cases, you could use Keras ModelCheckpoint to save the best model per fold.

You would need to add to the imports: from tensorflow.keras.callbacks import ModelCheckpoint

Also make sure to import the os module: import os

…and subsequently add the callback to your code so that it runs during training:

 fold_no = 0 for train, test in kfold.split(inputs, targets):

 # Define the model architecture model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(no_classes, activation='softmax'))

 # Compile the model model.compile(loss=loss_function, optimizer=optimizer, metrics=['accuracy'])

 # Define callbacks checkpoint_path = f'./some_folder/{fold_no}' os.mkdir(checkpoint_path) keras_callbacks = [ ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') ]

 # Increase fold no fold_no += 1

 # Generate a print print('------------------------------------------------------------------------') print(f'Training for fold {fold_no} ...')

 # Fit data to model history = model.fit(inputs[train], targets[train], batch_size=batch_size, epochs=no_epochs, verbose=verbosity, validation_split=validation_split, callbacks=keras_callbacks)

Now, all best instances of your model given the particular fold are saved. Based on how the folds perform (which you’ll see in your terminal after training), you can pick the saved model that works best.

However, I wouldn’t recommend this, as each fold is trained with a subset of your data – and it might in fact be bias that drives the better performance. Be careful when doing this.

Question 2: if you retrain without validation data, will it become a robust model for unknown samples?
That’s difficult to say, because it depends on the distribution from which you draw the samples. For example, if you cross-validate a ConvNet trained on the MNIST dataset with K-fold cross validation, and it performs well across all folds, you can be confident that you can train it with full data for once. You might nevertheless wish to use validation data for detecting e.g. overfitting though (also see https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). Now, if you fed it CIFAR10 data in production usage, you could obviously expect very poor performance.

Hope this helps!

Best,
Chris

1. Keerti

Hi Chris,
Can you clarify if in Question 1, validation_split in fit function is (test[splits], test[targets])?

Basically I have a 5 fold cross validation, with 4 being trained and 1 being validated. Further I have a set of test samples, using which I want to evaluate a model.

So, I have read many articles on the CV, but I could not find any on how to use validation data.

1. Chris

Hi Keerti,

There are some differing views on this topic, but I see K-fold Cross Validation as a method where you split your (shuffled) dataset into K train/test splits, so with K = 5, there will be 5 such splits, and 5 models will be trained (with train data) and evaluated (with test data).
See the figure below for a K = 3 setting.
K-fold Cross Validation does in my view not account for validation data as we know it from neural networks. It is a generic method which also works with e.g. SVMs, which don’t know the concept of validation data, because they are optimized in a different way.
This does however cause a lot of confusion for those who want to apply K-fold Cross Validation to neural networks.
Validation data can additionally be used to steer the training process and stop when the model starts overfitting. This ensures that you have the best loss score available for the model (e.g. you don’t have to specify a fixed number of epochs but can use validation-data driven EarlyStopping instead. So for neural networks, using the K-fold CV you generate the train/test split, and then further sub divide the training data into true training data and validation data.
(The overlap in name is what makes it confusing.)

Best,
Chris

1. Keerti

Thank you very much. Now I have a clear picture of cross validation.

2. Josseline

Hi Chris! Thanks for this post! It helps me a lot 🙂 I have some questions and I hope you can help me.
1. Do you store checkpoints per fold when val_loss is the lowest in all epochs? I am doing my own implementation on Pytorch and I don’t have clear the criteria of ModelCheckpoint.
2. I am trying with different hyperparameters of one model and I would like to choose what is the best one. I test every config doing KFolds CV to train and validation. Should I store the parameters with the best metric per fold (no matter the epoch) and then choose the best one overall folds?

1. Chris

Hi Josseline,

1. The Keras ModelCheckpoint can be configured in many ways. See https://keras.io/callbacks/#modelcheckpoint for all options. In my case, I use it to save the best-performing model instance only, by setting ‘val_loss’, ‘min’ for minimum validation loss and save_best_only=True for saving the epoch with lowest validation loss only. In practice, this means that after every epoch, it checks whether validation loss is lower this time, and if so, it saves the model. Did you know that something similar (albeit differently) is available for PyTorch? https://pytorch.org/ignite/handlers.html#ignite.handlers.ModelCheckpoint It seems that you need to write your own checking logic, though.
2. If I understand you correctly, you are using different hyperparameters for every fold. I wouldn’t do it this way. Instead, I would train every fold with the same set of hyperparameters, and keep your training/validation/testing sets constant. This way, what goes in (the data) is sampled from the same distribution all the time, and (should you use a same random seed for e.g. random weight initialization) nothing much should interfere from a data point of view. Then, for every different set of hyperparameters, I would repeat K-fold Cross Validation. This way, across many K-fold cross validation instances, I can see how well one set of hyperparameters performs generally (within the folds) and how well a model performs across different sets of hyperparameters (across the folds).

Hope this helps. Regards,
Chris

3. Josseline

For every set of hyperparameters I repeat K-Folds CV to get training and validation splits, in order to get K instances for every hyperparameters config. My doubt is how can I choose a model instance for every experiment? I mean, I don’t know what is the best way to decided between hyperparameters sets if per every one I applied K-Folds CV.

I hope I had explained it better 🙂

Regards

1. Chris

Hi Josseline,

So if I understand you correctly, if you have two experiments – say, the same architecture, same hyperparameters, but with one you use the ‘Adam’ optimizer whereas with the other you use the ‘SGD’ optimizer – you repeat K-fold cross validation twice?

So, if K = 10, you effectively make 2×10 splits, train your 2 architectures 10 times each with the different splits, then average the outcome for each fold and check whether there are abnormalities within the folds?

If that’s a correct understanding, now, would my understanding of your question be correct if I’d say your question is “what hyperparameters to choose for my model?”? If not, my apologies.

If so – there are limited general answers to that question. Often, I start with Xavier or He initialization (based on whether I do not or do use ReLU activated layers), Adam optimization, some regularization (L1/L2/Dropout) and LR Range tested learning rates with decay. Then, I start experimenting, and change a few hyperparameters here and there – also based on intuition and what I see happening during the training process. Doing so, K-fold CV can help me validate the model performance across various train/test splits each time, before training the model with the full dataset.

4. Josseline

For now, my experiments are limited to variations of a base architecture, for example, trying with different amount of filters to my convolutional layers and set my learning rate. I applied K-Fold CV because I have a small dataset (with less than 2000 samples) but after it, I don’t know yet what would be my final model, I mean I would like to know what is the strategy to how to decide what of them have the best performance during K-Fold CV.

Do you store all the trained models during your experiments? I am a beginner in this area, so my apologies if I said something wrong in my questions.

1. Chris

Hi Josseline,

Don’t worry, nothing wrong in your questions, it’s the exact opposite in fact – it would be weird for me to answer your question wrongly because I read it wrongly 🙂

I do store all the models during my experiments – but only the best ones per fold (see example code in one of my comments above for implementing this with Keras).

I would consider this strategy:
1. For every variation, train with K-fold CV with the exact same dataset. Set K = 5 for example given your number of samples. Also make sure to use the same loss metric across variations and to use validation data when training.
2. After every training ends (i.e. all the K = 5 splits have finished training), check for abnormalities in every individual fold (this could indicate a disbalanced dataset) and whether your average across the folds is acceptably high.
3. If you see no abnormalities, you can be confident that your model will generalize to data sampled from that distribution. This means that you can now train every variation again, but then with the entire dataset (i.e. no test data – you just used K-fold CV to validate that it generalizes).
4. As you trained all variations with the same dataset, an example way to choose the best final model would be to pick the trained-on-full-dataset model with lowest validation loss after training.

That’s a general strategy I would follow. However, since you have very few samples (2000 is really small in deep learning terms, where ~60k is considered small if the data is complex), you might also wish to take a look at SVM based classification/regression with manual feature extraction. For example, one option for computer vision based problems would be using a clustering algorithm such as Mean Shift to derive more abstract characteristics followed by an SVM classifier. This setup would be better suited to smaller datasets, I’d say – because your neural networks will likely overfit pretty rapidly.

https://www.machinecurve.com/index.php/2020/04/23/how-to-perform-mean-shift-clustering-with-python-in-scikit/
https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/

Regards,
Chris

1. Josseline

Thanks for your help Chris! I am going to put into practice this strategy. I am aware my dataset is pretty small, I am thinking in use data augmentation in order to increase the samples used for training. I read SVM would be another approach, I am going to check your suggestions 🙂

1. Chris

Data augmentation would absolutely be of help in your case. Best of luck! 🙂

Chris

5. Rebeen Ali

Hi thank you very much,

have you shared your code in GitHub to see all the code together

Thank you

6. Rebeen

Hi
could you please provide this code in a github to see all the code together

thank you

7. John

Hi, the formatting of this tutorial is a bit confusing at the moment, all the embedded code appears in single lines with no line breaks. Is there some way to fix this, or perhaps a link to download and view the code ourselves?
Thank you

1. Chris

Hi John,
Thanks for your comment. I am aware of the issue and am looking for a fix. Most likely, I can spend some time on the matter tomorrow.
Regards,
Chris

2. Chris

Hi John,
Things should be normal again!
Regards,
Chris

8. Ming

Hi Chris,

Thank you for sharing a very nice tutorial. But I am just curious about ‘validation_split’ inside the cross validation.

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)

This will actually reserve 0.2 of inputs[train], targets[train] in your code to be used as validation data. Why do you need validation data here since all results will be averaged after cross validation?

In wikipedia, they don’t use validation data.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)

1. Chris

Hi Ming,

Thanks and I agree, I’ve adapted the article.

Regards,
Chris

1. zay

if this is the case, it shouldnt report val_acc and val_loss during the training right? Also needs updating.

1. Chris

Hi Zay,

Indeed. I’ll make sure to adapt the post.

Best,
Chris

1. Laura

Hi Chris,

First of all, thank you so much for the article, it helps a lot! I just have a few questions and I hope you can help me understand. It would be very much appreciated.

I see in the above comments you previously used val_split in the model.fit, which gave you the val_acc and val_loss during the training of all epochs. With this you could see if overfitting takes place and the amount of epochs should be reduced. You removed the val_split in the model.fit and instead evaluate the models acc and loss of each fold separately after training the model, which gives you one value of accuracy and loss that indicates the overall performance of the model during that fold. However, now you miss the information about the val_acc and val_loss of every epoch during the folds? My question is, how can you check if no overfitting takes place?

You do mention overfitting at the end of the article, but I struggle a bit here.
You state that, after obtaining a satisfied model, one should retrain the model with the whole training data without the cross validation and you suggest using a validation set to check overfitting. This is where I get confused. I have split my data into a “train set” and “test set”. With the “train set” I plan to do the k-fold cross validation, the “test set” is put aside for now and not used until I have a satisfied model. So if I’m correct, for every fold the “training set” is split into “training data” which I use in model.fit and “validation data” which I use in model.evaluate. I should tune my hyperparameters every training and the cross validation is used to see if it is generalised, right? After obtaining a model I’m satisfied with, you mention I should thus train it on my whole “training set”. From the comments I take it I train the model without splitting this one into “training data” and “validation data”. Then you mention to use a validation set to check overfitting. This is where I get stuck, probably because I’m a little confused about validation option (within model.fit or seperately with model.evaluate). What should I do:

– train the model on the whole “training set” with model.fit and evaluate the model with model.evaluate on the “test set” (the one that has been not used until this point) (which would give me one value for val_acc and val_loss)

– train the model on the whole “training set” and include validation_data=(“test set”) in the model.fit function (which should give me val_acc and val_loss of every epoch and lets me check overfitting) (but I read on some sites the “test set” should never be used during training, I’m not sure if this is used during training but it is fed into the model.fit so I get a little confused)

– train the model on the whole “training set” with model.fit, but specify a val_split in the model.fit function (so I obtain a val_acc and val_loss of every epoch and will be able to check overfitting) AND evaluate the model with model.evaluate on the “test set” (which has not bees seen by the model before, and gives me 1 value for acc and loss) (however I guess not ALL data is now used for training, as you would use a validation split. And you did suggest using all data)

Sorry for the long post, I really hope you can help!

Best,
Laura

2. Chris

Hi Laura,

1. You ask: how can you check if no overfitting takes place?
> Answer: in my answer to your previous question, I mentioned that you can use K-fold CV to generate K train/test splits, and within each such split (a.k.a. fold), you can split a bit off the training set to act as true validation data. That’s why I usually fit with the training data, then use a bit of the training data for validation purposes, but evaluate with model.evaluate after the model has finished training. Now, overfitting happens when validation loss starts increasing substantially. You could use callbacks such as EarlyStopping in Keras to automatically detect this and stop the training process.
2. You describe your approach i.e. having split off a test set and wanting to perform K-fold CV on your train set. Why would you do that? 🙂 K-fold CV can be used to generate such sets, across K folds.
3. With respect to your three choices: definitely number 3. What’s more, as you already know how well your model generalizes by having obtained good test scores from your folds – good meaning that each score does not deviate from the average very much – you can question whether you need the test set after all. You have already determined that your model generalizes well! Why not train it with the entire dataset, for a specified number of epochs (say, no epochs until overfitting – 10% or something)?

Best,
Chris

9. Brian

Hello! Great post! At the bottom when you say “Retrain the model, but this time with all the data – i.e., without making the split. Save that model, and use it for generating predictions.”

Do you mean model.fit() and just add in the training set and no validation set?

1. Brian J Ferrell

A better explanation of my question

So you do k fold cross validation and instead of saving the best model checkpoint, youre saying to retrain another model.fit but with the entire x_train dataset and LEAVE OUT the testing set. THEN go and make predictions on a set of data it has not seen before.

1. Chris

That’s indeed what I am proposing. You validated whether the model generalizes for that dataset using K folds which means that it has seen partial test sets sampled from your full dataset. As you know it generalizes, it would be best to fully train it with the entire dataset i.e. without training data. What you would still need is a validation set in order to ensure that it’s not overfitting.

Best,
Chris

2. Chris

Hi Brian,

Thanks for your comment and your critical remark. I mean the train/test split here. I think it’s important to continue validating the training process with a validation set as you’ll want to find out when it’s overfitting.

Best,
Chris

10. Brian J Ferrell

It is not printing the validation accuracy when I run the fit model. Just accuracy and loss.. not the other two

1. Chris

It is. I’m not using a validation set there. You could add it, if you like, and use that.

Best,
Chris

11. Hello Chris,

1. Chris

Hi Ahmed,

How to do this entirely depends on the structure of your dataset. Would you mind me asking what it looks like? Are they images with corresponding targets in a CSV file, or are they represented differently?

Best,
Chris

1. Natime

First, thank you for sharing your knowledge. I have custom data. The training and testing data is an image. I have two classes in the training and testing set. So, how to use this example for my problem?

1. Chris

Hi Natime,

How much data (in MB – GB) would that be?

Best,
Chris

1. Natime

It’s 16.9 MB. All the images cropped to 256×256. The total amount of images is 1026.

12. Jim

Hi Chris. Thank you for your help on kfold cross validation in Keras. I would like to do Stratified validation in LSTM. Is it possible to create a simple post or link a github with this approach? I can’t find a good tutorial on the Internet and I have some errors on my code and I can’t solve it. Thank you.

13. Laura

Hi Chris,

Thank you so much for this post, it’s very clear and so useful it includes explanations as well as code. I have a question about the final retraining without training/test split and I know there have been several questions on the topic but I would very much like to have a confirmation I’m understanding / doing this right, as it is all quite new to me. It would be greatly appreciated if you could confirm if I am going in the right direction 🙂

I know the terms test/validation are used interchangeably throughout literature so I’ll try to be as clear as possible. I have split my data into a training and test set, this test set is set aside and not used for now. I want to use the 10-fold cross validation on the training set to tune my hyperparameters. So if I’m correct in understanding, I will be satisfied with a model is it shows a good performance across all 10 folds.
Then, if I understand correctly, you suggest to train this model I’m satisfied with, with the full training set, without using the K-fold. So I would use model.fit with my training data without implementing the for loop and the kfold.split code.
Then you suggest to continue using a validation set, in order to check overfitting. Am I correct in understanding that I should use the test set I have set aside at the beginning and haven’t used since? If I’m correct, I struggle a little with using this test set. Should I use it like in option 1 or 2, or do these give the same results?

Option 1) model.fit(x_train,
y_train,
validation_data=(x_test, y_test))

Option 2) model.fit(x_train,
y_train)
loss, accuracy = model.evaluate(x_test, y_test)

Thanks so much in advance, I hope you can help me with this! Keep up the good work 🙂

Best,
Laura

1. Chris

Hi Laura,

Thanks for your compliment and question!
In practice, people use K-fold CV slightly differently, usually as a consequence of prior expertise and exposure to other ML professionals’ ways of working.
Wikipedia describes CV as follows: “One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).”
The goal, here, is to ensure that the set you’re training with has no weird anomalies whatsoever with respect to validation data (such as an extreme amount of outliers, as a result of bad luck) – because of training across many folds, and averaging the results, you get a better idea about how your model performs.
Now, with K-fold, you’ll first train across many folds with one fold serving as TESTING data – so you’ll call model.evaluate on that.
This aligns with a definition found at ML Mastery: “The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.” https://machinelearningmastery.com/k-fold-cross-validation/
(It’s called validation set there, but the article clearly suggests that by configuring K in a way, you get a true “train/test” set, so it’s really training/testing data).
Then averaging the model.evaluates across the K folds, you’ll have a better idea about how well the model performs.

Now, here’s why some people get confused:
How long do you train the model for?
Often, until it starts overfitting.

While K-fold CV thus splits your set into train/test data for every fold, you’ll also need to know when to stop training!
That’s why for each fold, you can use a subset (e.g. 20%) of training data as a validation set, and then call certain callbacks (such as EarlyStopping in Keras) which stop your training process once that’s necessary.

Now, to summarize:
1. You’ll use K-fold CV to have different train/test splits across K folds (K = 5 or K = 10 is common). The fact that this testing data is often called “validation data”, makes things confusing.
2. Within a fold, you can split a bit off the train set to act as true “validation data” e.g. to find when to stop the training process.

Hope this helps!

Best,
Chris

1. Laura

Hi Chris,

Thanks so much, honestly, you’re my hero. Sorry for the 2 questions, I thought the first one wouldn’t come through so I posted a second one. It’s all a lot more clear to me now, I just got really confused about validation and test sets.
Could you confirm if I understand it all correctly now? So I should use K-fold CV to split my data into training/test sets. I should tune my hyperparameters, and train/evaluate the model over K folds. During the folds I should keep an eye out for overfitting, which I can do through specifying a validation_split and using callbacks. After the training over all folds is completed, I should check if each fold score does not deviate from the average very much, to check if the model is generalisable. When I find a model that performs well over all folds and have a number of epochs that doesn’t result in overfitting, I could retrain that model on the whole data set (to give it more data, thus to make it better) without the need of using a validation split, as I already know it doesn’t overfit, and without evaluating it on a test set, as I already know it generalises. Correct me if I’m wrong 🙂

Best,
Laura

1. Chris

Hi Laura,

Don’t worry about the two questions – I have an approval mechanism built in here to avoid a lot of spam bots writing comments, which happens a lot because MC is in a tech niche 🙂

That’s exactly what I meant! Do make sure to perform your final training with a fixed amount of epochs, set to the approximate number of epochs found during K fold CV before overfitting starts to occur. This way, you’ll end up with a model that (1) generalizes (determined through K-fold CV), (2) does not benefits from strange outliers (determined through the averaging and deviation checks), (3) does not overfit yet (4) makes use of the maximum amount of data available.

Best of luck,
Chris

1. Chris

By the way, sorry for the poor grammar and spelling in the above comment. Looks like I need some coffee 😂

First of all thanks for this very comprehensive guide! Everything worked perfectly except I don’t know how to add validation data here. When I run the model as you ran it there is only loss and accuracy shown during training, but now val_acc or val_loss.
I used 5-fold CV, so how do I use the 20% that are being left out of training as validation set? I was reading the questions you answered but I am not sure if i can just add validation_split to my model.fit function?
It would be really nice if you could help me there!

1. Chris