# How to use K-fold Cross Validation with Keras?

When you train supervised machine learning models, you’ll likely try multiple models, in order to find out how good they are. Part of this process is likely going to be the question how can I compare models objectively?

Training and testing datasets have been invented for this purpose. By splitting a small part off your full dataset, you create a dataset which (1) was not yet seen by the model, and which (2) you assume to approximate the distribution of the population, i.e. the real world scenario you wish to generate a predictive model for.

Now, when generating such a split, you should ensure that your splits are relatively unbiased. In this blog post, we’ll cover one technique for doing so: K-fold Cross Validation. Firstly, we’ll show you how such splits can be made naïvely – i.e., by a simple hold out split strategy. Then, we introduce K-fold Cross Validation, show you how it works, and why it can produce better results. This is followed by an example, created with Keras and Scikit-learn’s KFold functions.

Are you ready? Let’s go! 😎

Update 11/06/2020: improved K-fold cross validation code based on reader comments.

## Evaluating and selecting models with K-fold Cross Validation

Training a supervised machine learning model involves changing model weights using a training set. Later, once training has finished, the trained model is tested with new data – the testing set – in order to find out how well it performs in real life.

When you are satisfied with the performance of the model, you train it again with the entire dataset, in order to finalize it and use it in production (Bogdanovist, n.d.)

However, when checking how well the model performance, the question how to split the dataset is one that emerges pretty rapidly. K-fold Cross Validation, the topic of today’s blog post, is one possible approach, which we’ll discuss next.

However, let’s first take a look at the concept of generating train/test splits in the first place. Why do you need them? Why can’t you simply train the model with all your data and then compare the results with other models? We’ll answer these questions first.

Then, we take a look at the efficient but naïve simple hold-out splits. This way, when we discuss K-fold Cross Validation, you’ll understand more easily why it can be more useful when comparing performance between models. Let’s go!

### Why using train/test splits? – On finding a model that works for you

Before we’ll dive into the approaches for generating train/test splits, I think that it’s important to take a look at why we should split them in the first place when evaluating model performance.

For this reason, we’ll invent a model evaluation scenario first.

#### Generating many predictions

Say that we’re training a few models to classify images of digits. We train a Support Vector Machine (SVM), a Convolutional Neural Network (CNN) and a Densely-connected Neural Network (DNN) and of course, hope that each of them predicts “5” in this scenario:

Our goal here is to use the model that performs best in production, a.k.a. “really using it” 🙂

The central question then becomes: how well does each model perform?

Based on their performance, we can select a model that can be used in real life.

However, if we wish to determine model performance, we should generate a whole bunch of predictions – preferably, thousands or even more – so that we can compute metrics like accuracy, or loss. Great!

#### Don’t be the student who checks his own homework

Now, we’ll get to the core of our point – i.e., why we need to generate splits between training and testing data when evaluating machine learning models.

We’ll require an understanding of the high-level supervised machine learning process for this purpose:

It can be read as follows:

• In the first step, all the training samples (in blue on the left) are fed forward to the machine learning model, which generates predictions (blue on the right).
• In the second step, the predictions are compared with the “ground truth” (the real targets) – which results in the computation of a loss value.
• The model can subsequently be optimized by steering the model away from the error, by changing its weights, in the backwards pass of the gradient with respect to (finally) the loss value.
• The process then starts again. Presumably, the model performs better this time.

As you can imagine, the model will improve based on the loss generated by the data. This data is a sample, which means that there is always a difference between the sample distribution and the population distribution. In other words, there is always a difference between what your data tells that the patterns are and what the patterns are in the real world. This difference can be really small, but it’s there.

Now, if you let the model train for long enough, it will adapt substantially to the dataset. This also means that the impact of the difference will get larger and larger, relative to the patterns of the real-world scenario. If you’ve trained it for too long – a problem called overfitting – the difference may be the cause that it won’t work anymore when real world data is fed to it.

Generating a split between training data and testing data can help you solve this issue. By training your model using the training data, you can let it train for as long as you want. Why? Simple: you have the testing data to evaluate model performance afterwards, using data that is (1) presumably representative for the real world and (2) unseen yet. If the model is highly overfit, this will be clear, because it will perform very poorly during the evaluation step with the testing data.

Now, let’s take a look at how we can do this. We’ll s tart with simple hold-out splits 🙂

### A naïve approach: simple hold-out split

Say that you’ve got a dataset of 10.000 samples. It hasn’t been split into a training and a testing set yet. Generally speaking, a 80/20 split is acceptable. That is, 80% of your data – 8.000 samples in our case – will be used for training purposes, while 20% – 2.000 – will be used for testing.

We can thus simply draw a boundary at 8.000 samples, like this:

We call this simple hold-out split, as we simply “hold out” the last 2.000 samples (Chollet, 2017).

It can be a highly effective approach. What’s more, it’s also very inexpensive in terms of the computational power you need. However, it’s also a very naïve approach, as you’ll have to keep these edge cases in mind all the time (Chollet, 2017):

1. Data representativeness: all datasets, which are essentially samples, must represent the patterns in the population as much as possible. This becomes especially important when you generate samples from a sample (i.e., from your full dataset). For example, if the first part of your dataset has pictures of ice cream, while the latter one only represents espressos, trouble is guaranteed when you generate the split as displayed above. Random shuffling may help you solve these issues.
2. The arrow of time: if you have a time series dataset, your dataset is likely ordered chronologically. If you’d shuffle randomly, and then perform simple hold-out validation, you’d effectively “[predict] the future given the past” (Chollet, 2017). Such temporal leaks don’t benefit model performance.
3. Data redundancy: if some samples appear more than once, a simple hold-out split with random shuffling may introduce redundancy between training and testing datasets. That is, identical samples belong to both datasets. This is problematic too, as data used for training thus leaks into the dataset for testing implicitly.

Now, as we can see, while a simple hold-out split based approach can be effective and will be efficient in terms of computational resources, it also requires you to monitor for these edge cases continuously.

🚀 Something for you? Interesting Machine Learning books 📚
MachineCurve.com will earn a small affiliate commission from the Amazon Services LLC Associates Program when you purchase one of the books linked above.

### K-fold Cross Validation

A more expensive and less naïve approach would be to perform K-fold Cross Validation. Here, you set some value for $$K$$ and (hey, what’s in a name 😋) the dataset is split into $$K$$ partitions of equal size. $$K – 1$$ are used for training, while one is used for testing. This process is repeated $$K$$ times, with a different partition used for testing each time.

For example, this would be the scenario for our dataset with $$K = 5$$ (i.e., once again the 80/20 split, but then 5 times!):

For each split, the same model is trained, and performance is displayed per fold. For evaluation purposes, you can obviously also average it across all folds. While this produces better estimates, K-fold Cross Validation also increases training cost: in the $$K = 5$$ scenario above, the model must be trained for 5 times.

Let’s now extend our viewpoint with a few variations of K-fold Cross Validation 🙂

If you have no computational limitations whatsoever, you might wish to try a special case of K-fold Cross Validation, called Leave One Out Cross Validation (or LOOCV, Khandelwal 2019). LOOCV means $$K = N$$, where $$N$$ is the number of samples in your dataset. As the number of models trained is maximized, the precision of the model performance average is maximized too, but so is the cost of training due to the sheer amount of models that must be trained.

If you have a binary classification problem, you might also wish to take a look at Stratified Cross Validation (Khandelwal, 2019). It extends K-fold Cross Validation by ensuring an equal distribution of the target classes over the splits. This ensures that your classification problem is balanced. It doesn’t work for multiclass classification due to the way that samples are distributed.

Finally, if you have a time series dataset, you might wish to use Time-series Cross Validation (Khandelwal, 2019). Check here how it works.

## Creating a Keras model with K-fold Cross Validation

Now that we understand how K-fold Cross Validation works, it’s time to code an example with the Keras deep learning framework 🙂

Coding it will be a multi-stage process:

• Firstly, we’ll take a look at what we need in order to run our model successfully.
• Then, we take a look at today’s model.
• Subsequently, we add K-fold Cross Validation, train the model instances, and average performance.
• Finally, we output the performance metrics on screen.

### What we’ll need to run our model

For running the model, we’ll need to install a set of software dependencies. For today’s blog post, they are as follows:

• TensorFlow 2.0+, which includes the Keras deep learning framework;
• The most recent version of scikit-learn;
• Numpy.

That’s it, already! 🙂

### Our model: a CIFAR-10 CNN classifier

Now, today’s model.

We’ll be using a convolutional neural network that can be used to classify CIFAR-10 images into a set of 10 classes. The images are varied, as you can see here:

Now, my goal is not to replicate the process of creating the model here, as we already did that in our blog post “How to build a ConvNet for CIFAR-10 and CIFAR-100 classification with Keras?”. Take a look at that post if you wish to understand the steps that lead to the model below.

(Do note that this is a small adaptation, where we removed the third convolutional block for reasons of speed.)

Here is the full model code of the original CIFAR-10 CNN classifier, which we can use when adding K-fold Cross Validation:

from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt

# Model configuration
batch_size = 50
img_width, img_height, img_num_channels = 32, 32, 3
loss_function = sparse_categorical_crossentropy
no_classes = 100
no_epochs = 100
optimizer = Adam()
verbosity = 1

# Load CIFAR-10 data
(input_train, target_train), (input_test, target_test) = cifar10.load_data()

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Fit data to model
history = model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

# Visualize history
# Plot history: Loss
plt.plot(history.history['val_loss'])
plt.title('Validation loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.show()

# Plot history: Accuracy
plt.plot(history.history['val_accuracy'])
plt.title('Validation accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.show()

### Removing obsolete code

Now, let’s slightly adapt the model in order to add K-fold Cross Validation.

Firstly, we’ll strip off some code that we no longer need:

import matplotlib.pyplot as plt

We will no longer generate the visualizations, and besides the import we thus also remove the part generating them:

# Visualize history
# Plot history: Loss
plt.plot(history.history['val_loss'])
plt.title('Validation loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.show()

# Plot history: Accuracy
plt.plot(history.history['val_accuracy'])
plt.title('Validation accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.show()

### Adding K-fold Cross Validation

Secondly, let’s add the KFold code from scikit-learn to the imports – as well as numpy:

from sklearn.model_selection import KFold
import numpy as np

Which…

Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).

Scikit-learn (n.d.) sklearn.model_selection.KFold

Precisely what we want!

We also add a new configuration value:

num_folds = 10

This will ensure that our $$K = 10$$.

What’s more, directly after the “normalize data” step, we add two empty lists for storing the results of cross validation:

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Define per-fold score containers <-- these are new
acc_per_fold = []
loss_per_fold = []

This is followed by a concat of our ‘training’ and ‘testing’ datasets – remember that K-fold Cross Validation makes the split!

# Merge inputs and targets
inputs = np.concatenate((input_train, input_test), axis=0)
targets = np.concatenate((target_train, target_test), axis=0)

Based on this prior work, we can add the code for K-fold Cross Validation:

fold_no = 1
for train, test in kfold.split(input_train, target_train):

Ensure that all the model related steps are now wrapped inside the for loop. Also make sure to add a couple of extra print statements and to replace the inputs and targets to model.fit:

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(inputs, targets):

# Define the model architecture
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Generate a print
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

We next replace the “test loss” print with one related to what we’re doing. Also, we increase the fold_no:

  # Generate generalization metrics
scores = model.evaluate(inputs[test], targets[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

# Increase fold number
fold_no = fold_no + 1

Here, we simply print a “score for fold X” – and add the accuracy and sparse categorical crossentropy loss values to the lists.

Now, why do we do that?

Simple: at the end, we provide an overview of all scores and the averages. This allows us to easily compare the model with others, as we can simply compare these outputs. Add this code at the end of the model, but make sure that it is not wrapped inside the for loop:

# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
print('------------------------------------------------------------------------')
print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')

#### Full model code

Altogether, this is the new code for your K-fold Cross Validation scenario with $$K = 10$$:

🚀 Something for you? Interesting Machine Learning books 📚
MachineCurve.com will earn a small affiliate commission from the Amazon Services LLC Associates Program when you purchase one of the books linked above.
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import KFold
import numpy as np

# Model configuration
batch_size = 50
img_width, img_height, img_num_channels = 32, 32, 3
loss_function = sparse_categorical_crossentropy
no_classes = 100
no_epochs = 25
optimizer = Adam()
verbosity = 1
num_folds = 10

# Load CIFAR-10 data
(input_train, target_train), (input_test, target_test) = cifar10.load_data()

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')

# Normalize data
input_train = input_train / 255
input_test = input_test / 255

# Define per-fold score containers
acc_per_fold = []
loss_per_fold = []

# Merge inputs and targets
inputs = np.concatenate((input_train, input_test), axis=0)
targets = np.concatenate((target_train, target_test), axis=0)

# Define the K-fold Cross Validator
kfold = KFold(n_splits=num_folds, shuffle=True)

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(inputs, targets):

# Define the model architecture
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))

# Compile the model
model.compile(loss=loss_function,
optimizer=optimizer,
metrics=['accuracy'])

# Generate a print
print('------------------------------------------------------------------------')
print(f'Training for fold {fold_no} ...')

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity)

# Generate generalization metrics
scores = model.evaluate(inputs[test], targets[test], verbose=0)
print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%')
acc_per_fold.append(scores[1] * 100)
loss_per_fold.append(scores[0])

# Increase fold number
fold_no = fold_no + 1

# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
print('------------------------------------------------------------------------')
print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print('------------------------------------------------------------------------')

## Results

Now, it’s time to run the model, to see whether we can get some nice results 🙂

Say, for example, that you saved the model as k-fold-model.py in some folder. Open up your command prompt – for example, Anaconda Prompt – and cd to the folder where your file is stored. Make sure that your dependencies are installed and then run python k-fold-model.py.

If everything goes well, the model should start training for 25 epochs per fold.

### Evaluating the performance of your model

During training, it should produce batches like this one:

------------------------------------------------------------------------
Training for fold 3 ...
Train on 43200 samples, validate on 10800 samples
Epoch 1/25
43200/43200 [==============================] - 9s 200us/sample - loss: 1.5628 - accuracy: 0.4281 - val_loss: 1.2300 - val_accuracy: 0.5618
Epoch 2/25
43200/43200 [==============================] - 7s 165us/sample - loss: 1.1368 - accuracy: 0.5959 - val_loss: 1.0767 - val_accuracy: 0.6187
Epoch 3/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.9737 - accuracy: 0.6557 - val_loss: 0.9869 - val_accuracy: 0.6522
Epoch 4/25
43200/43200 [==============================] - 7s 169us/sample - loss: 0.8665 - accuracy: 0.6967 - val_loss: 0.9347 - val_accuracy: 0.6772
Epoch 5/25
43200/43200 [==============================] - 8s 175us/sample - loss: 0.7792 - accuracy: 0.7281 - val_loss: 0.8909 - val_accuracy: 0.6918
Epoch 6/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.7110 - accuracy: 0.7508 - val_loss: 0.9058 - val_accuracy: 0.6917
Epoch 7/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.6460 - accuracy: 0.7745 - val_loss: 0.9357 - val_accuracy: 0.6892
Epoch 8/25
43200/43200 [==============================] - 8s 184us/sample - loss: 0.5885 - accuracy: 0.7963 - val_loss: 0.9242 - val_accuracy: 0.6962
Epoch 9/25
43200/43200 [==============================] - 7s 156us/sample - loss: 0.5293 - accuracy: 0.8134 - val_loss: 0.9631 - val_accuracy: 0.6892
Epoch 10/25
43200/43200 [==============================] - 7s 164us/sample - loss: 0.4722 - accuracy: 0.8346 - val_loss: 0.9965 - val_accuracy: 0.6931
Epoch 11/25
43200/43200 [==============================] - 7s 161us/sample - loss: 0.4168 - accuracy: 0.8530 - val_loss: 1.0481 - val_accuracy: 0.6957
Epoch 12/25
43200/43200 [==============================] - 7s 159us/sample - loss: 0.3680 - accuracy: 0.8689 - val_loss: 1.1481 - val_accuracy: 0.6938
Epoch 13/25
43200/43200 [==============================] - 7s 165us/sample - loss: 0.3279 - accuracy: 0.8850 - val_loss: 1.1438 - val_accuracy: 0.6940
Epoch 14/25
43200/43200 [==============================] - 7s 171us/sample - loss: 0.2822 - accuracy: 0.8997 - val_loss: 1.2441 - val_accuracy: 0.6832
Epoch 15/25
43200/43200 [==============================] - 7s 167us/sample - loss: 0.2415 - accuracy: 0.9149 - val_loss: 1.3760 - val_accuracy: 0.6786
Epoch 16/25
43200/43200 [==============================] - 7s 170us/sample - loss: 0.2029 - accuracy: 0.9294 - val_loss: 1.4653 - val_accuracy: 0.6820
Epoch 17/25
43200/43200 [==============================] - 7s 165us/sample - loss: 0.1858 - accuracy: 0.9339 - val_loss: 1.6131 - val_accuracy: 0.6793
Epoch 18/25
43200/43200 [==============================] - 7s 171us/sample - loss: 0.1593 - accuracy: 0.9439 - val_loss: 1.7192 - val_accuracy: 0.6703
Epoch 19/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.1271 - accuracy: 0.9565 - val_loss: 1.7989 - val_accuracy: 0.6807
Epoch 20/25
43200/43200 [==============================] - 8s 190us/sample - loss: 0.1264 - accuracy: 0.9547 - val_loss: 1.9215 - val_accuracy: 0.6743
Epoch 21/25
43200/43200 [==============================] - 9s 207us/sample - loss: 0.1148 - accuracy: 0.9587 - val_loss: 1.9823 - val_accuracy: 0.6720
Epoch 22/25
43200/43200 [==============================] - 7s 167us/sample - loss: 0.1110 - accuracy: 0.9615 - val_loss: 2.0952 - val_accuracy: 0.6681
Epoch 23/25
43200/43200 [==============================] - 7s 166us/sample - loss: 0.0984 - accuracy: 0.9653 - val_loss: 2.1623 - val_accuracy: 0.6746
Epoch 24/25
43200/43200 [==============================] - 7s 168us/sample - loss: 0.0886 - accuracy: 0.9691 - val_loss: 2.2377 - val_accuracy: 0.6772
Epoch 25/25
43200/43200 [==============================] - 7s 166us/sample - loss: 0.0855 - accuracy: 0.9697 - val_loss: 2.3857 - val_accuracy: 0.6670
Score for fold 3: loss of 2.4695983460744224; accuracy of 66.46666526794434%
------------------------------------------------------------------------

Do note the increasing validation loss, a clear sign of overfitting.

And finally, after the 10th fold, it should display the overview with results per fold and the average:

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1 - Loss: 2.4094747734069824 - Accuracy: 67.96666383743286%
------------------------------------------------------------------------
> Fold 2 - Loss: 1.768296229839325 - Accuracy: 67.03333258628845%
------------------------------------------------------------------------
> Fold 3 - Loss: 2.4695983460744224 - Accuracy: 66.46666526794434%
------------------------------------------------------------------------
> Fold 4 - Loss: 2.363724467277527 - Accuracy: 66.28333330154419%
------------------------------------------------------------------------
> Fold 5 - Loss: 2.083754387060801 - Accuracy: 65.51666855812073%
------------------------------------------------------------------------
> Fold 6 - Loss: 2.2160572570165 - Accuracy: 65.6499981880188%
------------------------------------------------------------------------
> Fold 7 - Loss: 1.7227793588638305 - Accuracy: 66.76666736602783%
------------------------------------------------------------------------
> Fold 8 - Loss: 2.357142448425293 - Accuracy: 67.25000143051147%
------------------------------------------------------------------------
> Fold 9 - Loss: 1.553109979470571 - Accuracy: 65.54999947547913%
------------------------------------------------------------------------
> Fold 10 - Loss: 2.426255855560303 - Accuracy: 66.03333353996277%
------------------------------------------------------------------------
Average scores for all folds:
> Accuracy: 66.45166635513306 (+- 0.7683473645622098)
> Loss: 2.1370193102995554
------------------------------------------------------------------------

This allows you to compare the performance across folds, and compare the averages of the folds across model types you’re evaluating 🙂

In our case, the model produces accuracies of 60-70%. This is acceptable, but there is still room for improvement. But hey, that wasn’t the scope of this blog post 🙂

### Model finalization

If you’re satisfied with the performance of your model, you can finalize it. There are two options for doing so:

• Save the best performing model instance (check “How to save and load a model with Keras?” – do note that this requires retraining because you haven’t saved models with the code above), and use it for generating predictions.
• Retrain the model, but this time with all the data – i.e., without making the split. Save that model, and use it for generating predictions.

Both sides have advantages and disadvantages. The advantages of the first are that you don’t have to retrain, as you can simply use the best-performing fold which was saved during the training procedure. As retraining may be expensive, this could be an option, especially when your model is large. However, the disadvantage is that you simply miss out a percentage of your data – which may bring your training sample closer to the actual patterns in the population rather than your sample. If that’s the case, then the second option is better.

However, that’s entirely up to you! 🙂

## Summary

In this blog post, we looked at the concept of model evaluation: what is it? Why would we need it in the first place? And how to do so objectively? If we can’t evaluate models without introducing bias of some sort, there’s no point in evaluating at all, is there?

We introduced simple hold-out splits for this purpose, and showed that while they are efficient in terms of the required computational resources, they are also naïve. K-fold Cross Validation is $$K$$ times more expensive, but can produce significantly better estimates because it trains the models for $$K$$ times, each time with a different train/test split.

To illustrate this further, we provided an example implementation for the Keras deep learning framework using TensorFlow 2.0. Using a Convolutional Neural Network for CIFAR-10 classification, we generated evaluations that performed in the range of 60-70% accuracies.

I hope you’ve learnt something from today’s blog post. If you did, feel free to leave a comment in the comments section! Please do the same if you have questions, if you spotted mistakes or when you have other remarks. I’ll happily answer your comments and will improve my blog if that’s the best thing to do.

Thank you for reading MachineCurve today and happy engineering! 😎

🚀 Boost your ML knowledge with MachineCurve Continue your Keras journey 👩‍💻 Learn about supervised learning with the Keras Deep Learning framework, including tutorials on ConvNets, autoencoders, activation functions, optimizers... and a lot more! Python examples are included. Enjoy our 100+ free Keras tutorials

## References

Scikit-learn. (n.d.). sklearn.model_selection.KFold — scikit-learn 0.22.1 documentation. Retrieved February 17, 2020, from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

Allibhai, E. (2018, October 3). Holdout vs. Cross-validation in Machine Learning. Retrieved from https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f

Chollet, F. (2017). Deep Learning with Python. New York, NY: Manning Publications.

Khandelwal, R. (2019, January 25). K fold and other cross-validation techniques. Retrieved from https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e

Bogdanovist. (n.d.). How to choose a predictive model after k-fold cross-validation? Retrieved from https://stats.stackexchange.com/a/52277

## Do you want to start learning ML from a developer perspective? 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to learn new things and better understand concepts you already know. We send emails every Friday.
By signing up, you consent that any information you receive can include services and special offers by email.

## 19 thoughts on “How to use K-fold Cross Validation with Keras?”

1. Devidas

great post
but How can I save the best performance among all the folds in the program itself.
Also retrain on whole data without using validation will it become robust model for unknown population samples.
please clarify and if possible need code snippest.
Thanks
Thanks

1. Chris

Hi Devidas,

Thanks for your questions.

Question 1: how to save the best performing Keras model across all the folds in K-fold cross validation:
This cannot be done out of the box. However, as you can see in my code, using for ... in, I loop over the folds, and train the model again and again with the split made for that particular fold.

In those cases, you could use Keras ModelCheckpoint to save the best model per fold.

You would need to add to the imports: from tensorflow.keras.callbacks import ModelCheckpoint

Also make sure to import the os module: import os

…and subsequently add the callback to your code so that it runs during training:

 fold_no = 0 for train, test in kfold.split(inputs, targets):

 # Define the model architecture model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(256, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(no_classes, activation='softmax'))

 # Compile the model model.compile(loss=loss_function, optimizer=optimizer, metrics=['accuracy'])

 # Define callbacks checkpoint_path = f'./some_folder/{fold_no}' os.mkdir(checkpoint_path) keras_callbacks = [ ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') ]

 # Increase fold no fold_no += 1

 # Generate a print print('------------------------------------------------------------------------') print(f'Training for fold {fold_no} ...')

 # Fit data to model history = model.fit(inputs[train], targets[train], batch_size=batch_size, epochs=no_epochs, verbose=verbosity, validation_split=validation_split, callbacks=keras_callbacks) 

Now, all best instances of your model given the particular fold are saved. Based on how the folds perform (which you’ll see in your terminal after training), you can pick the saved model that works best.

However, I wouldn’t recommend this, as each fold is trained with a subset of your data – and it might in fact be bias that drives the better performance. Be careful when doing this.

Question 2: if you retrain without validation data, will it become a robust model for unknown samples?
That’s difficult to say, because it depends on the distribution from which you draw the samples. For example, if you cross-validate a ConvNet trained on the MNIST dataset with K-fold cross validation, and it performs well across all folds, you can be confident that you can train it with full data for once. You might nevertheless wish to use validation data for detecting e.g. overfitting though (also see https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). Now, if you fed it CIFAR10 data in production usage, you could obviously expect very poor performance.

Hope this helps!

Best,
Chris

2. Josseline

Hi Chris! Thanks for this post! It helps me a lot 🙂 I have some questions and I hope you can help me.
1. Do you store checkpoints per fold when val_loss is the lowest in all epochs? I am doing my own implementation on Pytorch and I don’t have clear the criteria of ModelCheckpoint.
2. I am trying with different hyperparameters of one model and I would like to choose what is the best one. I test every config doing KFolds CV to train and validation. Should I store the parameters with the best metric per fold (no matter the epoch) and then choose the best one overall folds?

Thanks in advance 🙂

1. Chris

Hi Josseline,

Thanks for your compliment 🙂 With regards to your questions:

1. The Keras ModelCheckpoint can be configured in many ways. See https://keras.io/callbacks/#modelcheckpoint for all options. In my case, I use it to save the best-performing model instance only, by setting ‘val_loss’, ‘min’ for minimum validation loss and save_best_only=True for saving the epoch with lowest validation loss only. In practice, this means that after every epoch, it checks whether validation loss is lower this time, and if so, it saves the model. Did you know that something similar (albeit differently) is available for PyTorch? https://pytorch.org/ignite/handlers.html#ignite.handlers.ModelCheckpoint It seems that you need to write your own checking logic, though.
2. If I understand you correctly, you are using different hyperparameters for every fold. I wouldn’t do it this way. Instead, I would train every fold with the same set of hyperparameters, and keep your training/validation/testing sets constant. This way, what goes in (the data) is sampled from the same distribution all the time, and (should you use a same random seed for e.g. random weight initialization) nothing much should interfere from a data point of view. Then, for every different set of hyperparameters, I would repeat K-fold Cross Validation. This way, across many K-fold cross validation instances, I can see how well one set of hyperparameters performs generally (within the folds) and how well a model performs across different sets of hyperparameters (across the folds).

Hope this helps. Regards,
Chris

3. Josseline

Thanks for your reply Chris! 🙂

For every set of hyperparameters I repeat K-Folds CV to get training and validation splits, in order to get K instances for every hyperparameters config. My doubt is how can I choose a model instance for every experiment? I mean, I don’t know what is the best way to decided between hyperparameters sets if per every one I applied K-Folds CV.

I hope I had explained it better 🙂

Regards

1. Chris

Hi Josseline,

So if I understand you correctly, if you have two experiments – say, the same architecture, same hyperparameters, but with one you use the ‘Adam’ optimizer whereas with the other you use the ‘SGD’ optimizer – you repeat K-fold cross validation twice?

So, if K = 10, you effectively make 2×10 splits, train your 2 architectures 10 times each with the different splits, then average the outcome for each fold and check whether there are abnormalities within the folds?

If that’s a correct understanding, now, would my understanding of your question be correct if I’d say your question is “what hyperparameters to choose for my model?”? If not, my apologies.

If so – there are limited general answers to that question. Often, I start with Xavier or He initialization (based on whether I do not or do use ReLU activated layers), Adam optimization, some regularization (L1/L2/Dropout) and LR Range tested learning rates with decay. Then, I start experimenting, and change a few hyperparameters here and there – also based on intuition and what I see happening during the training process. Doing so, K-fold CV can help me validate the model performance across various train/test splits each time, before training the model with the full dataset.

4. Josseline

For now, my experiments are limited to variations of a base architecture, for example, trying with different amount of filters to my convolutional layers and set my learning rate. I applied K-Fold CV because I have a small dataset (with less than 2000 samples) but after it, I don’t know yet what would be my final model, I mean I would like to know what is the strategy to how to decide what of them have the best performance during K-Fold CV.

Do you store all the trained models during your experiments? I am a beginner in this area, so my apologies if I said something wrong in my questions.

1. Chris

Hi Josseline,

Don’t worry, nothing wrong in your questions, it’s the exact opposite in fact – it would be weird for me to answer your question wrongly because I read it wrongly 🙂

I do store all the models during my experiments – but only the best ones per fold (see example code in one of my comments above for implementing this with Keras).

I would consider this strategy:
1. For every variation, train with K-fold CV with the exact same dataset. Set K = 5 for example given your number of samples. Also make sure to use the same loss metric across variations and to use validation data when training.
2. After every training ends (i.e. all the K = 5 splits have finished training), check for abnormalities in every individual fold (this could indicate a disbalanced dataset) and whether your average across the folds is acceptably high.
3. If you see no abnormalities, you can be confident that your model will generalize to data sampled from that distribution. This means that you can now train every variation again, but then with the entire dataset (i.e. no test data – you just used K-fold CV to validate that it generalizes).
4. As you trained all variations with the same dataset, an example way to choose the best final model would be to pick the trained-on-full-dataset model with lowest validation loss after training.

That’s a general strategy I would follow. However, since you have very few samples (2000 is really small in deep learning terms, where ~60k is considered small if the data is complex), you might also wish to take a look at SVM based classification/regression with manual feature extraction. For example, one option for computer vision based problems would be using a clustering algorithm such as Mean Shift to derive more abstract characteristics followed by an SVM classifier. This setup would be better suited to smaller datasets, I’d say – because your neural networks will likely overfit pretty rapidly.

Here’s more information about Mean Shift and SVM classifiers:
https://www.machinecurve.com/index.php/2020/04/23/how-to-perform-mean-shift-clustering-with-python-in-scikit/
https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/

Regards,
Chris

1. Josseline

Thanks for your help Chris! I am going to put into practice this strategy. I am aware my dataset is pretty small, I am thinking in use data augmentation in order to increase the samples used for training. I read SVM would be another approach, I am going to check your suggestions 🙂

1. Chris

Data augmentation would absolutely be of help in your case. Best of luck! 🙂

Chris

5. Rebeen Ali

Hi thank you very much,

have you shared your code in GitHub to see all the code together

Thank you

6. Rebeen

Hi
could you please provide this code in a github to see all the code together

thank you

7. John

Hi, the formatting of this tutorial is a bit confusing at the moment, all the embedded code appears in single lines with no line breaks. Is there some way to fix this, or perhaps a link to download and view the code ourselves?
Thank you

1. Chris

Hi John,
Thanks for your comment. I am aware of the issue and am looking for a fix. Most likely, I can spend some time on the matter tomorrow.
Regards,
Chris

2. Chris

Hi John,
Things should be normal again!
Regards,
Chris

8. Ming

Hi Chris,

Thank you for sharing a very nice tutorial. But I am just curious about ‘validation_split’ inside the cross validation.

# Fit data to model
history = model.fit(inputs[train], targets[train],
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)

This will actually reserve 0.2 of inputs[train], targets[train] in your code to be used as validation data. Why do you need validation data here since all results will be averaged after cross validation?

In wikipedia, they don’t use validation data.
https://en.wikipedia.org/wiki/Cross-validation_(statistics)

1. Chris

Hi Ming,

Thanks and I agree, I’ve adapted the article.

Regards,
Chris