The post A simple Conv3D example with Keras appeared first on MachineCurve.
]]>The cover image is courtesy of David de la Iglesia Castro, the creator of the 3D MNIST dataset.
We all know about the computer vision applications which allow us to perform object detection, to name just one.
How these Conv2D networks work has been explained in another blog post.
For many applications, however, it’s not enough to stick to two dimensions. Rather, the height or time dimension is also important. In videos, which are essentially many images stacked together, time is this third axis. It can however also be height or number of layers, in e.g. the layered image structure of an MRI scan. In both cases, the third axis intrinsically links the two-dimensional parts together, and hence cannot be ignored.
Enter three-dimensional convolutional neural networks, or Conv3Ds. In this blog post, we’ll cover this type of CNNs. More specifically, we will first take a look at the differences between ‘normal’ convolutional neural networks (Conv2Ds) versus the three-dimensional ones (Conv3D). Subsequently, we will actually provide a Keras-based implementation of a Conv3D, with the 3D MNIST dataset available at Kaggle. We discuss all the individual parts of the implementation before arriving at the final code, which ensures that you’ll understand what happens on the fly.
All right, let’s go!
Note that the code for this blog post is also available on GitHub.
If you are familiar with convolutional neural networks, it’s likely that you understand what happens in a traditional or two-dimensional CNN:
A two-dimensional image, with multiple channels (three in the RGB input in the image above), is interpreted by a certain number (N
) kernels of some size, in our case 3x3x3. The actual interpretation happens because each kernel slides over the input image; literally, from the left to the right, then down a bit; from the left to the right, and so on. By means of element-wise multiplications, it generates a feature map which is smaller than the original input, and in fact is a more abstract summary of the original input image. Hence, by stacking multiple convolutional layers, it becomes possible to generate a very abstract representation of some input representing some average object, which allows us to classify them into groups.
For more information, I’d really recommend my other blog post, Convolutional Neural Networks and their components for computer vision.
Now, with three-dimensional convolutional layers, things are different – but not too different. Instead of three dimensions in the input image (the two image dimensions and the channels dimension, you’ll have four: the two image dimensions, the time/height dimension, and the channels dimension). As such, the feature map is also three-dimensional. This means that the filters move in three dimensions instead of two: not only from left to right and from the top to the bottom, but also forward and backward. Three-dimensional convolutional layers will therefore be more expensive in terms of the required computational resources, but allow you to retrieve much richer insights.
Now that we understand them intuitively, let’s see if we can build one!
…creating a machine learning requires a dataset with which the model can be trained.
The 3D MNIST dataset that is available at Kaggle serves this purpose. It is an adaptation of the original MNIST dataset which we used to create e.g. the regular CNN. The authors of the dataset converted the two-dimensional data into 3D by means of point clouds, as follows:
Since the data is three-dimensional, we can use it to give an example of how the Keras Conv3D layers work.
Since it is relatively simple (the 2D dataset yielded accuracies of almost 100% in the 2D CNN scenario), I’m confident that we can reach similar accuracies here as well, allowing us to focus on the model architecture rather than poking into datasets to maximize performance.
Let’s now create the model!
Before we start coding, let’s make sure that you have all the software dependencies installed that we need for successful completion:
Besides the software dependencies, you’ll also need the data itself. The dataset is available on Kaggle, which is a community of machine learning enthusiasts where competitions, question and answers and datasets are posted.
There are two ways of installing the dataset into your host machine:
pip install kaggle
. Next, you can issue kaggle datasets download -d daavoo/3d-mnist
(if you included the kaggle.json
API key file in the ~/.kaggle
– read here how to do this) and the dataset must download. We will need the file full_dataset_vectors.h5
. Note that for the 3D MNIST dataset, this option is currently (as of October 2019) broken, and you will have to download the data manually.full_dataset_vectors.h5
.For both scenarios, you’ll need a free Kaggle account.
Let’s move the file full_dataset_vectors.h5
into a new folder (e.g. 3d-cnn
) and create a Python file such as 3d_cnn.py
. Now that the data has been downloaded & that the model file is created, we can start coding!
So let’s open up your code editor and on y va! ( for let’s go!).
As usual, we import the dependencies first:
'''
A simple Conv3D example with Keras
'''
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv3D, MaxPooling3D
from keras.utils import to_categorical
import h5py
import numpy as np
import matplotlib.pyplot as plt
For most of them, I already explained why we need them. However, for the Keras ones, I’ll explain them in a slightly more detailed way:
to_categorical
function. The loss function we’re using to compute how bad the model performs during training, categorical crossentropy, requires that we convert our integer target data (e.g. \(8\) when it’s an 8) into categorical vectors representing true/false values for class presence, e.g. \([0, 0, 0, 0, 0, 0, 0, 0, 1, 0]\) for class 8 over all classes 0-9. to_categorical
converts the integer target data into categorical format.Now that we imported all dependencies, we can proceed with some model configuration variables that allow us to configure the model in an orderly fashion:
# -- Preparatory code --
# Model configuration
batch_size = 100
no_epochs = 30
learning_rate = 0.001
no_classes = 10
validation_split = 0.2
verbosity = 1
Specifically, we configure the model as follows:
Contrary to the two-dimensional CNN, we must add some helper functions:
# Convert 1D vector into 3D values, provided by the 3D MNIST authors at
# https://www.kaggle.com/daavoo/3d-mnist
def array_to_color(array, cmap="Oranges"):
s_m = plt.cm.ScalarMappable(cmap=cmap)
return s_m.to_rgba(array)[:,:-1]
# Reshape data into format that can be handled by Conv3D layers.
# Courtesy of Sam Berglin; Zheming Lian; Jiahui Jang - University of Wisconsin-Madison
# Report - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf
# Code - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/network_final_version.ipynb
def rgb_data_transform(data):
data_t = []
for i in range(data.shape[0]):
data_t.append(array_to_color(data[i]).reshape(16, 16, 16, 3))
return np.asarray(data_t, dtype=np.float32)
The first helper function, array_to_color
, was provided by the authors of the 3D MNIST dataset and courtesy goes out to them. What it does is this: the imported data will be of one channel only. This function converts the data into RGB format, and hence into three channels. This ensures resemblence with the original 2D scenario.
Next, we use rgb_data_transform
, which was created by machine learning students Sam Berglin, Zheming Lian and Jiahui Jang at the University of Wisconsin-Madison. Under guidance of professor Sebastian Raschka, whose Mlxtend library we use quite often, they also created a 3D ConvNet for the 3D MNIST dataset, but then using PyTorch instead of Keras.
The function reshapes the data, which per sample comes in a (4096,) shape (16x16x16 pixels = 4096 pixels), so in a one-dimensional array. Their function reshapes the data into three-channeled, four-dimensional 16x16x16x3 format, making use of array_to_color
. The Conv3D function can now handle the data.
We can next import and prepare the data:
# -- Process code --
# Load the HDF5 data file
with h5py.File("./full_dataset_vectors.h5", "r") as hf:
# Split the data into training/test features/targets
X_train = hf["X_train"][:]
targets_train = hf["y_train"][:]
X_test = hf["X_test"][:]
targets_test = hf["y_test"][:]
# Determine sample shape
sample_shape = (16, 16, 16, 3)
# Reshape data into 3D format
X_train = rgb_data_transform(X_train)
X_test = rgb_data_transform(X_test)
# Convert target vectors to categorical targets
targets_train = to_categorical(targets_train).astype(np.integer)
targets_test = to_categorical(targets_test).astype(np.integer)
The first line containing with
ensures that we open up the HDF5 file as hf
, which we can subsequently use to retrieve the data we need.
Specifically, we first load the training and testing data into two different variables: the X
es for the feature vectors, the targets
for the… well, unsurprisingly, targets
Next, we determine the shape of each sample, which we must supply to the Keras model later.
Next, we actually transform and reshape the data from one-channeled (4096,) format into three-channeled (16, 16, 16, 3) format. This is followed by converting the targets into categorical format, which concludes the preparatory phase.
We can now finally create the model architecture and start the training process.
First – the architecture:
# Create the model
model = Sequential()
model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform'))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(no_classes, activation='softmax'))
As discussed, we use the Keras Sequential API with Conv3D, MaxPooling3D, Flatten and Dense layers.
Specifically, we use two three-dimensional convolutional layers with 3x3x3 kernels, ReLU activation functions and hence He uniform init.
3D max pooling is applied with 2x2x2 pool sizes.
Once the convolutional operations are completed, we Flatten the feature maps and feed the result to a Dense layer which also activates and initializes using the ReLU/He combination.
Finally, we output the data into a Dense layer with no_classes
(= 10) neurons and a Softmax activation function. This activation function generates a multiclass probability distribution over all the possible target classes, essentially a vector with probabilities that the sample belongs to that particular class, all values summing to 100% (or, statistically, 1).
Second – the training procedure:
# Compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=['accuracy'])
# Fit data to model
history = model.fit(X_train, targets_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
We first compile
the model which essentially configures the architecture according to the hyperparameters that we set in the configuration section.
Next, we fit
the data to the model, using the other configuration settings set before. Fitting the data starts the training process. The output of this training process is stored in the history
object which we can use for visualization purposes.
Finally, we can add some code for evaluating model performance:
# Generate generalization metrics
score = model.evaluate(X_test, targets_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
# Plot history: Categorical crossentropy & Accuracy
plt.plot(history.history['loss'], label='Categorical crossentropy (training data)')
plt.plot(history.history['val_loss'], label='Categorical crossentropy (validation data)')
plt.plot(history.history['accuracy'], label='Accuracy (training data)')
plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)')
plt.title('Model performance for 3D MNIST Keras Conv3D example')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
The above code simply evaluates the model by means of the testing data, printing the output to the console, as well as generating a plot displaying categorical crossentropy & accuracy over the training epochs.
Altogether, we arrive at this model code:
'''
A simple Conv3D example with Keras
'''
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv3D, MaxPooling3D
from keras.utils import to_categorical
import h5py
import numpy as np
import matplotlib.pyplot as plt
# -- Preparatory code --
# Model configuration
batch_size = 100
no_epochs = 30
learning_rate = 0.001
no_classes = 10
validation_split = 0.2
verbosity = 1
# Convert 1D vector into 3D values, provided by the 3D MNIST authors at
# https://www.kaggle.com/daavoo/3d-mnist
def array_to_color(array, cmap="Oranges"):
s_m = plt.cm.ScalarMappable(cmap=cmap)
return s_m.to_rgba(array)[:,:-1]
# Reshape data into format that can be handled by Conv3D layers.
# Courtesy of Sam Berglin; Zheming Lian; Jiahui Jang - University of Wisconsin-Madison
# Report - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf
# Code - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/network_final_version.ipynb
def rgb_data_transform(data):
data_t = []
for i in range(data.shape[0]):
data_t.append(array_to_color(data[i]).reshape(16, 16, 16, 3))
return np.asarray(data_t, dtype=np.float32)
# -- Process code --
# Load the HDF5 data file
with h5py.File("./full_dataset_vectors.h5", "r") as hf:
# Split the data into training/test features/targets
X_train = hf["X_train"][:]
targets_train = hf["y_train"][:]
X_test = hf["X_test"][:]
targets_test = hf["y_test"][:]
# Determine sample shape
sample_shape = (16, 16, 16, 3)
# Reshape data into 3D format
X_train = rgb_data_transform(X_train)
X_test = rgb_data_transform(X_test)
# Convert target vectors to categorical targets
targets_train = to_categorical(targets_train).astype(np.integer)
targets_test = to_categorical(targets_test).astype(np.integer)
# Create the model
model = Sequential()
model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform'))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Flatten())
model.add(Dense(256, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(no_classes, activation='softmax'))
# Compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=['accuracy'])
# Fit data to model
history = model.fit(X_train, targets_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
# Generate generalization metrics
score = model.evaluate(X_test, targets_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
# Plot history: Categorical crossentropy & Accuracy
plt.plot(history.history['loss'], label='Categorical crossentropy (training data)')
plt.plot(history.history['val_loss'], label='Categorical crossentropy (validation data)')
plt.plot(history.history['accuracy'], label='Accuracy (training data)')
plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)')
plt.title('Model performance for 3D MNIST Keras Conv3D example')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
Running the model produces mediocre performance – a test accuracy of approximately 65.6%, contrary to the 99%+ of the 2D model:
Train on 8000 samples, validate on 2000 samples
Epoch 1/30
2019-10-18 14:49:16.626766: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-10-18 14:49:17.253904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
8000/8000 [==============================] - 5s 643us/step - loss: 2.1907 - accuracy: 0.2256 - val_loss: 1.8527 - val_accuracy: 0.3580
Epoch 2/30
8000/8000 [==============================] - 2s 305us/step - loss: 1.6607 - accuracy: 0.4305 - val_loss: 1.4618 - val_accuracy: 0.5090
Epoch 3/30
8000/8000 [==============================] - 2s 308us/step - loss: 1.3590 - accuracy: 0.5337 - val_loss: 1.2485 - val_accuracy: 0.5760
Epoch 4/30
8000/8000 [==============================] - 2s 309us/step - loss: 1.2173 - accuracy: 0.5807 - val_loss: 1.2304 - val_accuracy: 0.5620
Epoch 5/30
8000/8000 [==============================] - 2s 306us/step - loss: 1.1320 - accuracy: 0.6084 - val_loss: 1.1913 - val_accuracy: 0.5795
Epoch 6/30
8000/8000 [==============================] - 2s 305us/step - loss: 1.0423 - accuracy: 0.6376 - val_loss: 1.1136 - val_accuracy: 0.6140
Epoch 7/30
8000/8000 [==============================] - 2s 310us/step - loss: 0.9899 - accuracy: 0.6572 - val_loss: 1.0940 - val_accuracy: 0.6255
Epoch 8/30
8000/8000 [==============================] - 2s 304us/step - loss: 0.9365 - accuracy: 0.6730 - val_loss: 1.0905 - val_accuracy: 0.6310
Epoch 9/30
8000/8000 [==============================] - 2s 305us/step - loss: 0.8850 - accuracy: 0.6975 - val_loss: 1.0407 - val_accuracy: 0.6425
Epoch 10/30
8000/8000 [==============================] - 2s 309us/step - loss: 0.8458 - accuracy: 0.7115 - val_loss: 1.0667 - val_accuracy: 0.6315
Epoch 11/30
8000/8000 [==============================] - 3s 320us/step - loss: 0.7971 - accuracy: 0.7284 - val_loss: 1.0328 - val_accuracy: 0.6420
Epoch 12/30
8000/8000 [==============================] - 3s 328us/step - loss: 0.7661 - accuracy: 0.7411 - val_loss: 1.0596 - val_accuracy: 0.6365
Epoch 13/30
8000/8000 [==============================] - 3s 324us/step - loss: 0.7151 - accuracy: 0.7592 - val_loss: 1.0463 - val_accuracy: 0.6470
Epoch 14/30
8000/8000 [==============================] - 3s 334us/step - loss: 0.6850 - accuracy: 0.7676 - val_loss: 1.0592 - val_accuracy: 0.6355
Epoch 15/30
8000/8000 [==============================] - 3s 341us/step - loss: 0.6359 - accuracy: 0.7839 - val_loss: 1.0492 - val_accuracy: 0.6555
Epoch 16/30
8000/8000 [==============================] - 3s 334us/step - loss: 0.6136 - accuracy: 0.7960 - val_loss: 1.0399 - val_accuracy: 0.6570
Epoch 17/30
8000/8000 [==============================] - 3s 327us/step - loss: 0.5794 - accuracy: 0.8039 - val_loss: 1.0548 - val_accuracy: 0.6545
Epoch 18/30
8000/8000 [==============================] - 3s 330us/step - loss: 0.5398 - accuracy: 0.8169 - val_loss: 1.0807 - val_accuracy: 0.6550
Epoch 19/30
8000/8000 [==============================] - 3s 351us/step - loss: 0.5199 - accuracy: 0.8236 - val_loss: 1.0881 - val_accuracy: 0.6570
Epoch 20/30
8000/8000 [==============================] - 3s 332us/step - loss: 0.4850 - accuracy: 0.8350 - val_loss: 1.0920 - val_accuracy: 0.6485
Epoch 21/30
8000/8000 [==============================] - 3s 330us/step - loss: 0.4452 - accuracy: 0.8549 - val_loss: 1.1540 - val_accuracy: 0.6510
Epoch 22/30
8000/8000 [==============================] - 3s 332us/step - loss: 0.4051 - accuracy: 0.8696 - val_loss: 1.1422 - val_accuracy: 0.6570
Epoch 23/30
8000/8000 [==============================] - 3s 347us/step - loss: 0.3743 - accuracy: 0.8811 - val_loss: 1.1720 - val_accuracy: 0.6610
Epoch 24/30
8000/8000 [==============================] - 3s 349us/step - loss: 0.3575 - accuracy: 0.8816 - val_loss: 1.2174 - val_accuracy: 0.6580
Epoch 25/30
8000/8000 [==============================] - 3s 349us/step - loss: 0.3223 - accuracy: 0.8981 - val_loss: 1.2345 - val_accuracy: 0.6525
Epoch 26/30
8000/8000 [==============================] - 3s 351us/step - loss: 0.2859 - accuracy: 0.9134 - val_loss: 1.2514 - val_accuracy: 0.6555
Epoch 27/30
8000/8000 [==============================] - 3s 347us/step - loss: 0.2598 - accuracy: 0.9218 - val_loss: 1.2969 - val_accuracy: 0.6595
Epoch 28/30
8000/8000 [==============================] - 3s 350us/step - loss: 0.2377 - accuracy: 0.9291 - val_loss: 1.3296 - val_accuracy: 0.6625
Epoch 29/30
8000/8000 [==============================] - 3s 349us/step - loss: 0.2119 - accuracy: 0.9362 - val_loss: 1.3784 - val_accuracy: 0.6550
Epoch 30/30
8000/8000 [==============================] - 3s 350us/step - loss: 0.1987 - accuracy: 0.9429 - val_loss: 1.4143 - val_accuracy: 0.6515
Test loss: 1.4300630502700806 / Test accuracy: 0.656000018119812
We can derive a little bit more information from the diagram that we generated based on the history
object:
The first and most clear warning signal is the orange line, or the categorical crossentropy loss on the validation data. It’s increasing, which means that the model is overfitting – or adapting too much to the training data. The blue line illustrates this even further, since loss is decreasing rapidly there, while the ‘check’ gets worse and worse.
This deviation also becomes visible in the accuracy plot, albeit less significantly.
Now – we got a working Conv3D model with the 3D MNIST dataset, but can we improve on the 65.6% accuracy by doing something about the overfitting?
Adding Dropout to the model architecture allows us to ‘drop’ random elements from the feature maps during training. Although this confuses the model, it disallows it to adapt too much to the training data:
# Create the model
model = Sequential()
model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Dropout(0.5))
model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform'))
model.add(MaxPooling3D(pool_size=(2, 2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(256, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(no_classes, activation='softmax'))
Don’t forget to add it as an extra import:
from keras.layers import Dense, Flatten, Conv3D, MaxPooling3D, Dropout
With Dropout, overfitting can be reduced:
However, testing accuracy remains mediocre. This suggests that the model cannot further improve because the quantity of data is too low. Perhaps, if more data were added, or when a process called Data Augmentation is used, we can improve performance even further. However, that’s for another time!
In this blog post, we’ve seen how Conv3D layers differ from Conv2D but more importantly, we’ve seen a Keras based implementation of a convolutional neural network that can handle three-dimensional input data. I hope you’ve learnt something from this blog – and if you did, I would appreciate a comment below!
Thanks for reading and happy engineering
Note that the code for this blog post is also available on GitHub.
GitHub. (n.d.). daavoo – Overview. Retrieved from https://github.com/daavoo
Berglin, S., Lian, Z., & Jiang, J. (2019). 3D Convolutional Neural Networks. Retrieved from https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf
Kaggle. (n.d.). 3D MNIST. Retrieved from https://www.kaggle.com/daavoo/3d-mnist
GitHub. (2019, September 19). Kaggle/kaggle-api. Retrieved from https://github.com/Kaggle/kaggle-api
MachineCurve. (2019, May 30). Convolutional Neural Networks and their components for computer vision. Retrieved from https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/
MachineCurve. (2019, September 23). Understanding separable convolutions. Retrieved from https://www.machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/
About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/
Avoid wasting resources with EarlyStopping and ModelCheckpoint in Keras – MachineCurve. (2019, June 3). Retrieved from https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/
The post A simple Conv3D example with Keras appeared first on MachineCurve.
]]>The post How to use categorical / multiclass hinge with Keras? appeared first on MachineCurve.
]]>This resulted in blog posts that e.g. covered huber loss and hinge & squared hinge loss. Today, in this blog post, we’ll extend the latter to multiclass classification: we cover categorical hinge loss, or multiclass hinge loss. How can categorical hinge / multiclass hinge be implemented with Keras? That’s what well find out today.
Let’s go!
Note that the full code for the models we create in this blog post is also available through my Keras Loss Functions repository on GitHub.
In that previous blog, we looked at hinge loss and squared hinge loss – which actually helped us to generate a decision boundary between two classes and hence a classifier, but yep – two classes only.
Hinge loss and squared hinge loss can be used for binary classification problems.
Unfortunately, many of today’s problems aren’t binary, but rather, multiclass: the number of possible target classes is \(> 2\).
And hinge and squared hinge do not accommodate for this.
But categorical hinge loss, or multiclass hinge loss, does – and it is available in Keras!
Multiclass hinge was introduced by researchers Weston and Watkins (Wikipedia, 2011):
What this means in plain English is this:
For a prediction \(y\), take all \(y\) values unequal to \(t\), and compute the loss. Eventually, sum them together to find the multiclass hinge loss.
The name categorical hinge loss, which is also used in place of multiclass hinge loss, already implies what’s happening here:
We first convert our regular targets into categorical data. That is, if we have three possible target classes {0, 1, 2}, an arbitrary target (e.g. 2) would be converted into categorical format (in that case, \([0, 0, 1]\)).
Next, for any sample, our DL model generates a multiclass probability distribution over all possible target classes. That is, for the total probability of 100% (or, statistically, \(1\)) it generates the probability that any of the possible categorical classes is the actual target class (in the scenario above, e.g. \([0.25, 0.25, 0.50]\) – which would mean class two, but with some uncertainty.
Computing the loss – the difference between actual target and predicted targets – is then equal to computing the hinge loss for taking the prediction for all the computed classes, except for the target class, since loss is always 0 there. The hinge loss computation itself is similar to the traditional hinge loss.
Categorical hinge loss can be optimized as well and hence used for generating decision boundaries in multiclass machine learning problems. Let’s now see how we can implement it with Keras.
…which requires defining a dataset first
In our post covering traditional hinge loss, we generated data ourselves because this increases simplicity.
We’ll do so as well in today’s blog. Specifically, we create a dataset with three separable clusters that looks as follows:
How? Let’s find out.
First, open some folder and create a Python file where you’ll write your code – e.g. multiclass-hinge.py
.
Next, open a development environment as well as the file, and you can start coding
First, we add the imports:
'''
Keras model discussing Categorical (multiclass) Hinge loss.
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from mlxtend.plotting import plot_decision_regions
We need Keras since we build the model by means of its APIs and functionalities. From Keras, we need:
We also need Matplotlib for generating visualizations of our dataset, Numpy for basic number processing, Scikit-learn for generating the dataset and Mlxtend for visualizing the decision boundary of our model.
We next add some configuration options:
# Configuration options
num_samples_total = 3000
training_split = 1000
num_classes = 3
feature_vector_length = len(X_training[0])
input_shape = (feature_vector_length,)
loss_function_used = 'categorical_hinge'
learning_rate_used = 0.03
optimizer_used = keras.optimizers.adam(lr=learning_rate_used)
additional_metrics = ['accuracy']
num_epochs = 30
batch_size = 5
validation_split = 0.2 # 20%
The three clusters contain 3000 samples in total divided over three classes or clusters, as we saw in the image above. The training_split
value is 1000, which means that 1000 samples are split off the training set to serve as testing data.
The length of the feature vector is described by the data: specifically, from the training feature vectors (X_training
) we take the first (any arbitrary value will do, but the first is a more common approach) and compute its length. Next, we determine the shape of the model input by means of this length – we have a one-dimensional array, and hence define it as (feature_vector_length,)
.
Next, we specify the hyper parameters. Obviously, we’ll use categorical hinge loss. We set the learning rate to 0.03 since traditional hinge required a more aggressive value contrary to 0.001, which is default in Keras. We use the Adam optimizer and configure it to use this learning rate, which is very common today since Adam is the de facto standard optimizer used in DL projects.
As an additional metric, we specify accuracy, as we have done before in many of our blog posts. Accuracy is more intuitively understandable to humans.
The model will train for 30 epochs with a batch size of 5 samples per forward pass, and 20% of the training data (2000 samples, hence 400 samples) will be used for validating each epoch as validation data.
Next, we can generate the data:
# Generate data
X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15), (0,15)], n_features = num_classes, center_box=(0, 1), cluster_std = 1.5)
categorical_targets = to_categorical(targets)
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = categorical_targets[training_split:]
Targets_testing = categorical_targets[:training_split].astype(np.integer)
We use Scikit-learns make_blobs
function to generate data. It simply does as it suggests: it generates blobs of data, or clusters of data, where you specify them to be. Specifically, it generates num_samples_total
(3000, see model configuration section) in our case, splits them across three clusters centered at \({ (0, 0), (15, 15), (0,15) }\). The standard deviation in a cluster is approximately 1.5 to ensure that they are actually separable.
Next, we must convert our target values (which are one of \({ 0, 1, 2 }\)) into categorical format since our categorical hinge loss requires categorical format (and hence no integer targets such as \(2\), but categorical vectors like \([0, 0, 1]\).
Subsequently, we can split our feature vectors and target vectors according to the training_split
we configured in our model configuration. Note that we add .astype(np.integer
) to the testing targets. We do this because when visualizing categorical data, the Mlxtend library requires the vector contents to be integers (instead of floating point numbers).
We can finally visualize the data we generated:
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Three clusters ')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
…which, as illustrated before, looks like this:
As illustrated before, this is what is generated
We can work with this!
If you wish to run this model on your machine, you’ll need to install some dependencies to make the code work. First of all, you need Keras, the deep learning framework with which this model is built. It’s the most essential dependency.
Secondly, you’ll need a numbers processing framework on top of which Keras runs. This is likely Tensorflow, but Theano and CNTK are also supported.
Additionally, you’ll need the de facto standard Python libraries Matplotlib, Numpy and Scikit-learn – they can be installed with pip
quite easily.
Another package, which can also be installed with pip
, is Sebastian Raschka’s Mlxtend. We use it to visualize the decision boundary of our model.Creating the model architecture
We will create a very simple model today, a four-layered (two hidden layers, one input layer and one output layer) MLP:
# Create the model
model = Sequential()
model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(num_classes, activation='tanh'))
More specifically, we use the Keras Sequential API which allows us to stack multiple layers on top of each other. We subsequently add
the Dense or densely-connected layers; the first having four neurons, the second two, and the last num_classes
, or three in our case. The hidden layers activate by means of the ReLU activation function and hence are initialized with He uniform init. The last layer activates with tanh.
Next, we configure the model and start the training process:
# Configure the model and start training
model.compile(loss=loss_function_used, optimizer=optimizer_used, metrics=additional_metrics)
history = model.fit(X_training, Targets_training, epochs=num_epochs, batch_size=batch_size, verbose=1, validation_split=validation_split)
It’s as simple as calling model.compile
with the settings that we configured under model configuration, followed by model.fit
which fits the training data to the model architecture specified above. The training history is saved in the history
object which we can use for visualization purposes.
Next, we must add some more code for testing the model’s ability to generalize to data it hasn’t seen before.
In order to test model performance, we add some code that evaluates the model with the testing set:
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
What it will do is this: it takes the testing data (both features and targets) and feeds them through the model, comparing predicted target with the actual prediction. Since the model has never seen the data before, it tells us something about the degree of overfitting that occurred during training. When the model performs well during validation but also during testing, it’s useful to practice.
Visualizing the decision boundaries of the model (remember, we have a three-class classification problem!) is the next step.
I must admit, I had a little help from dr. Sebastian Raschka here, the creator of Mlxtend (also see https://github.com/rasbt/mlxtend/issues/607). As noted before, we had to convert our targets into categorical format, or e.g. \(target = 2\) into \(target = [0, 0, 1]\). Mlxtend does not natively support this, but fortunately, Raschka helped out by creating a helper class that embeds the model yet converts the way it makes predictions (back into non-categorical format). This looks as follows:
'''
The Onehot2Int class is used to adapt the model so that it generates non-categorical data.
This is required by the `plot_decision_regions` function.
The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607.
Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt.
Thanks!
'''
# No hot encoding version
class Onehot2Int(object):
def __init__(self, model):
self.model = model
def predict(self, X):
y_pred = self.model.predict(X)
return np.argmax(y_pred, axis=1)
# fit keras_model
keras_model_no_ohe = Onehot2Int(model)
# Plot decision boundary
plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3)
plt.show()
'''
Finish plotting the decision boundary.
'''
Finally, we can visualize the training process itself by adding some extra code – which essentially plots the Keras history
object with Matplotlib:
# Visualize training process
plt.plot(history.history['loss'], label='Categorical Hinge loss (training data)')
plt.plot(history.history['val_loss'], label='Categorical Hinge loss (validation data)')
plt.title('Categorical Hinge loss for circles')
plt.ylabel('Categorical Hinge loss value')
plt.yscale('log')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
Now that we’ve completed our code, we can actually run the model!
Open up a terminal where you have access to the software dependencies required to run the code, cd
to the directory where your file is located, and execute e.g. python multiclass-hinge.py
.
After the visualization of your dataset (with the three clusters), you’ll see the training process run and complete – as well as model evaluation with the testing set:
Epoch 1/30
2019-10-16 19:39:12.492536: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
1600/1600 [==============================] - 1s 906us/step - loss: 0.5006 - accuracy: 0.6950 - val_loss: 0.3591 - val_accuracy: 0.6600
Epoch 2/30
1600/1600 [==============================] - 1s 603us/step - loss: 0.3397 - accuracy: 0.6681 - val_loss: 0.3528 - val_accuracy: 0.6500
Epoch 3/30
1600/1600 [==============================] - 1s 615us/step - loss: 0.3398 - accuracy: 0.6681 - val_loss: 0.3721 - val_accuracy: 0.7425
Epoch 4/30
1600/1600 [==============================] - 1s 617us/step - loss: 0.3379 - accuracy: 0.8119 - val_loss: 0.3512 - val_accuracy: 0.8500
Epoch 5/30
1600/1600 [==============================] - 1s 625us/step - loss: 0.3368 - accuracy: 0.8869 - val_loss: 0.3515 - val_accuracy: 0.8600
Epoch 6/30
1600/1600 [==============================] - 1s 608us/step - loss: 0.3358 - accuracy: 0.8906 - val_loss: 0.3506 - val_accuracy: 0.9325
Epoch 7/30
1600/1600 [==============================] - 1s 606us/step - loss: 0.3367 - accuracy: 0.9344 - val_loss: 0.3532 - val_accuracy: 0.9375
Epoch 8/30
1600/1600 [==============================] - 1s 606us/step - loss: 0.3365 - accuracy: 0.9375 - val_loss: 0.3530 - val_accuracy: 0.9425
Epoch 9/30
1600/1600 [==============================] - 1s 625us/step - loss: 0.3364 - accuracy: 0.9419 - val_loss: 0.3528 - val_accuracy: 0.9475
Epoch 10/30
1600/1600 [==============================] - 1s 627us/step - loss: 0.3364 - accuracy: 0.9450 - val_loss: 0.3527 - val_accuracy: 0.9500
Epoch 11/30
1600/1600 [==============================] - 1s 606us/step - loss: 0.3363 - accuracy: 0.9506 - val_loss: 0.3525 - val_accuracy: 0.9525
Epoch 12/30
1600/1600 [==============================] - 1s 642us/step - loss: 0.3366 - accuracy: 0.9425 - val_loss: 0.3589 - val_accuracy: 0.6475
Epoch 13/30
1600/1600 [==============================] - 1s 704us/step - loss: 0.3526 - accuracy: 0.8606 - val_loss: 0.3506 - val_accuracy: 0.9850
Epoch 14/30
1600/1600 [==============================] - 1s 699us/step - loss: 0.3364 - accuracy: 0.9925 - val_loss: 0.3502 - val_accuracy: 0.9875
Epoch 15/30
1600/1600 [==============================] - 1s 627us/step - loss: 0.3363 - accuracy: 0.9944 - val_loss: 0.3502 - val_accuracy: 0.9875
Epoch 16/30
1600/1600 [==============================] - 1s 670us/step - loss: 0.3363 - accuracy: 0.9937 - val_loss: 0.3502 - val_accuracy: 0.9875
Epoch 17/30
1600/1600 [==============================] - 1s 637us/step - loss: 0.3362 - accuracy: 0.9694 - val_loss: 0.3530 - val_accuracy: 0.9400
Epoch 18/30
1600/1600 [==============================] - 1s 637us/step - loss: 0.3456 - accuracy: 0.9744 - val_loss: 0.3537 - val_accuracy: 0.9825
Epoch 19/30
1600/1600 [==============================] - 1s 635us/step - loss: 0.3347 - accuracy: 0.9975 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 20/30
1600/1600 [==============================] - 1s 644us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 21/30
1600/1600 [==============================] - 1s 655us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 22/30
1600/1600 [==============================] - 1s 636us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 23/30
1600/1600 [==============================] - 1s 648us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 24/30
1600/1600 [==============================] - 1s 655us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 25/30
1600/1600 [==============================] - 1s 656us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 26/30
1600/1600 [==============================] - 1s 641us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950
Epoch 27/30
1600/1600 [==============================] - 1s 644us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950
Epoch 28/30
1600/1600 [==============================] - 1s 666us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950
Epoch 29/30
1600/1600 [==============================] - 1s 645us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950
Epoch 30/30
1600/1600 [==============================] - 1s 669us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950
1000/1000 [==============================] - 0s 46us/step
Test results - Loss: 0.3260095896720886 - Accuracy: 99.80000257492065%
In my case, it was able to achieve very high accuracy – 99.5% on the testing set and 99.8% on the training set! Indeed, the decision boundaries allow us to classify the majority of samples correctly:
…and the training process looks like this:
Just after the first epoch, model performance pretty much maxed out.
…which is not unsurprising given the fact that our datasets are quite separable by nature, or perhaps, by design The relative ease with which the datasets are separable allows us to focus on the topic of this blog post, which was the categorical hinge loss.
All in all, we’ve got a working model using categorical hinge in Keras!
When merging all code together, we get this:
'''
Keras model discussing Categorical (multiclass) Hinge loss.
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from mlxtend.plotting import plot_decision_regions
# Configuration options
num_samples_total = 3000
training_split = 1000
num_classes = 3
feature_vector_length = len(X_training[0])
input_shape = (feature_vector_length,)
loss_function_used = 'categorical_hinge'
learning_rate_used = 0.03
optimizer_used = keras.optimizers.adam(lr=learning_rate_used)
additional_metrics = ['accuracy']
num_epochs = 30
batch_size = 5
validation_split = 0.2 # 20%
# Generate data
X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15), (0,15)], n_features = num_classes, center_box=(0, 1), cluster_std = 1.5)
categorical_targets = to_categorical(targets)
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = categorical_targets[training_split:]
Targets_testing = categorical_targets[:training_split].astype(np.integer)
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Three clusters ')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Create the model
model = Sequential()
model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(num_classes, activation='tanh'))
# Configure the model and start training
model.compile(loss=loss_function_used, optimizer=optimizer_used, metrics=additional_metrics)
history = model.fit(X_training, Targets_training, epochs=num_epochs, batch_size=batch_size, verbose=1, validation_split=validation_split)
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
'''
The Onehot2Int class is used to adapt the model so that it generates non-categorical data.
This is required by the `plot_decision_regions` function.
The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607.
Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt.
Thanks!
'''
# No hot encoding version
class Onehot2Int(object):
def __init__(self, model):
self.model = model
def predict(self, X):
y_pred = self.model.predict(X)
return np.argmax(y_pred, axis=1)
# fit keras_model
keras_model_no_ohe = Onehot2Int(model)
# Plot decision boundary
plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3)
plt.show()
'''
Finish plotting the decision boundary.
'''
# Visualize training process
plt.plot(history.history['loss'], label='Categorical Hinge loss (training data)')
plt.plot(history.history['val_loss'], label='Categorical Hinge loss (validation data)')
plt.title('Categorical Hinge loss for circles')
plt.ylabel('Categorical Hinge loss value')
plt.yscale('log')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
In this blog post, we’ve seen how categorical hinge extends binary (normal) hinge loss and squared hinge loss to multiclass classification problems. We considered the loss mathematically, but also built up an example with Keras that allows us to use categorical hinge with a real dataset, generating visualizations of the training process and decision boundaries as well. This concludes today’s post.
I hope you’ve learnt something here. If you did, I’d appreciate it if you let me know! You can do so by leaving a comment below Thanks a lot – and happy engineering!
Note that the full code for the models we created in this blog post is also available through my Keras Loss Functions repository on GitHub.
Wikipedia. (2011, September 16). Hinge loss. Retrieved from https://en.wikipedia.org/wiki/Hinge_loss
Raschka, S. (n.d.). Home – mlxtend. Retrieved from http://rasbt.github.io/mlxtend/
Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. Journal of Open Source Software, 3(24), 638. doi:10.21105/joss.00638
About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/
Keras. (n.d.). Losses. Retrieved from http://keras.io/losses
The post How to use categorical / multiclass hinge with Keras? appeared first on MachineCurve.
]]>The post How to use hinge & squared hinge loss with Keras? appeared first on MachineCurve.
]]>Today, we’ll cover two closely related loss functions that can be used in neural networks – and hence in Keras – that behave similar to how a Support Vector Machine generates a decision boundary for classification: the hinge loss and squared hinge loss.
In this blog, you’ll first find a brief introduction to the two loss functions, in order to ensure that you intuitively understand the maths before we move on to implementing one.
Next, we introduce today’s dataset, which we ourselves generate. Subsequently, we implement both hinge loss functions with Keras, and discuss the implementation so that you understand what happens. Before wrapping up, we’ll also show model performance.
Let’s go!
Note that the full code for the models we create in this blog post is also available through my Keras Loss Functions repository on GitHub.
In our blog post on loss functions, we defined the hinge loss as follows (Wikipedia, 2011):
Maths can look very frightning, but the explanation of the above formula is actually really easy.
When you’re training a machine learning model, you effectively feed forward your data, generating predictions, which you then compare with the actual targets to generate some cost value – that’s the loss value. In the case of using the hinge loss formula for generating this value, you compare the prediction (\(y\)) with the actual target for the prediction (\(t\)), substract this value from 1 and subsequently compute the maximum value between 0 and the result of the earlier computation.
For every sample, our target variable \(t\) is either +1 or -1.
This means that:
This looks as follows if the target is [latex]+1\) – for all targets >= 1, loss is zero (the prediction is correct or even overly correct), whereas loss increases when the predictions are incorrect.
What effectively happens is that hinge loss will attempt to maximize the decision boundary between the two groups that must be discriminated in your machine learning problem. In that way, it looks somewhat like how Support Vector Machines work, but it’s also kind of different (e.g., with hinge loss in Keras there is no such thing as support vectors).
Suppose that you need to draw a very fine decision boundary. In that case, you wish to punish larger errors more significantly than smaller errors. Squared hinge loss may then be what you are looking for, especially when you already considered the hinge loss function for your machine learning problem.
Squared hinge loss is nothing else but a square of the output of the hinge’s \(max(…)\) function. It generates a loss function as illustrated above, compared to regular hinge loss.
As you can see, larger errors are punished more significantly than with traditional hinge, whereas smaller errors are punished slightly lightlier.
Additionally, especially around \(target = +1.0\) in the situation above (if your target were \(-1.0\), it would apply there too) the loss function of traditional hinge loss behaves relatively non-smooth, like the ReLU activation function does so around \(x = 0\). Although it is very unlikely, it might impact how your model optimizes since the loss landscape is not smooth. With squared hinge, the function is smooth – but it is more sensitive to larger errors (outliers).
Therefore, choose carefully!
Now that we know about what hinge loss and squared hinge loss are, we can start our actual implementation. We’ll have to first implement & discuss our dataset in order to be able to create a model.
Before you start, it’s a good idea to create a file (e.g. hinge-loss.py
) in some folder on your machine. Then, you can start off by adding the necessary software dependencies:
'''
Keras model discussing Hinge loss.
'''
import keras
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles
from mlxtend.plotting import plot_decision_regions
First, and foremost, you need the Keras deep learning framework, which allows you to create neural network architectures relatively easily. From Keras, you’ll import the Sequential API and the Dense layer (representing densely-connected layers, or the MLP-like layers you always see when people use neural networks in their presentations).
You’ll subsequently import the PyPlot API from Matplotlib for visualization, Numpy for number processing, make_circles
from Scikit-learn to generate today’s dataset and Mlxtend for visualizing the decision boundary of your model.
Hence, this is what you need to run today’s code:
…preferably in an Anaconda environment so that your packages run isolated from other Python ones.
As indicated, we can now generate the data that we use to demonstrate how hinge loss and squared hinge loss works. We generate data today because it allows us to entirely focus on the loss functions rather than cleaning the data. Of course, you can also apply the insights from this blog posts to other, real datasets.
We first specify some configuration options:
# Configuration options
num_samples_total = 1000
training_split = 250
Put very simply, these specify how many samples are generated in total and how many are split off the training set to form the testing set. With this configuration, we generate 1000 samples, of which 750 are training data and 250 are testing data. You’ll later see that the 750 training samples are subsequently split into true training data and validation data.
Next, we actually generate the data:
# Generate data
X, targets = make_circles(n_samples = num_samples_total, factor=0.1)
targets[np.where(targets == 0)] = -1
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = targets[training_split:]
Targets_testing = targets[:training_split]
We first call make_circles
to generate num_samples_total
(1000 as configured) for our machine learning problem. make_circles
does what it suggests: it generates two circles, a larger one and a smaller one, which are separable – and hence perfect for machine learning blog posts The factor
parameter, which should be \(0 < factor < 1\), determines how close the circles are to each other. The lower the value, the farther the circles are positioned from each other.
We next convert all zero targets into -1. Why? Very simple: make_circles
generates targets that are either 0 or 1, which is very common in those scenarios. Zero or one would in plain English be ‘the larger circle’ or ‘the smaller circle’, but since targets are numeric in Keras they are 0 and 1.
Hinge loss doesn’t work with zeroes and ones. Instead, targets must be either +1 or -1. Hence, we’ll have to convert all zero targets into -1 in order to support Hinge loss.
Finally, we split the data into training and testing data, for both the feature vectors (the \(X\) variables) and the targets.
We can now also visualize the data, to get a feel for what we just did:
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Nonlinear data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
This looks as follows:
As you can see, we have generated two circles that are composed of individual data points: a large one and a smaller one. These are perfectly separable, although not linearly.
(With traditional SVMs one would have to perform the kernel trick in order to make data linearly separable in kernel space. With neural networks, this is less of a problem, since the layers activate nonlinearly.)
Now that we have a feel for the dataset, we can actually implement a Keras model that makes use of hinge loss and, in another run, squared hinge loss, in order to
As usual, we first define some variables for model configuration by adding this to our code:
# Set the input shape
feature_vector_shape = len(X_training[0])
input_shape = (feature_vector_shape,)
loss_function_used = 'hinge'
print(f'Feature shape: {input_shape}')
We set the shape of our feature vector to the length of the first sample from our training set. If this sample is of length 3, this means that there are three features in the feature vector. Since the array is only one-dimensional, the shape would be a one-dimensional vector of length 3. Since our training set contains X and Y values for the data points, our input_shape
is (2,).
Obviously, we use hinge
as our loss function. Using squared hinge loss is possible too by simply changing hinge
into squared_hinge
. That’s up to you!
Next, we define the architecture for our model:
# Create the model
model = Sequential()
model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(2, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))
We use the Keras Sequential API, which allows us to stack multiple layers easily. Contrary to other blog posts, e.g. ones where we created a MLP for classification or regression, I decided to add three layers instead of two. This was done for the reason that the dataset is slightly more complex: the decision boundary cannot be represented as a line, but must be a circle separating the smaller one from the larger one. Hence, I thought, a little bit more capacity for processing data would be useful.
The layers activate with Rectified Linear Unit or ReLU, except for the last one, which activates by means of Tanh. I chose ReLU because it is the de facto standard activation function and requires fewest computational resources without compromising in predictive performance. I chose Tanh because of the way the predictions must be generated: they should end up in the range [-1, +1], given the way Hinge loss works (remember why we had to convert our generated targets from zero to minus one?).
Tanh indeed precisely does this — converting a linear value to a range close to [-1, +1], namely (-1, +1) – the actual ones are not included here, but this doesn’t matter much. It looks like this:
The kernels of the ReLU activating layers are initialized with He uniform init instead of Glorot init for the reason that this approach works better mathematically.
Information is eventually converted into one prediction: the target. Hence, the final layer has one neuron. The intermediate ones have fewer neurons, in order to stimulate the model to generate more abstract representations of the information during the feedforward procedure.
Now that we know what architecture we’ll use, we can perform hyperparameter configuration. We can also actually start training our model.
However, first, the hyperparameters:
# Configure the model and start training
model.compile(loss=loss_function_used, optimizer=keras.optimizers.adam(lr=0.03), metrics=['accuracy'])
The loss function used is, indeed, hinge
loss. We use Adam for optimization and manually configure the learning rate to 0.03 since initial experiments showed that the default learning rate is insufficient to learn the decision boundary many times. In your case, it may be that you have to shuffle with the learning rate as well; you can configure it there. As an additional metric, we included accuracy, since it can be interpreted by humans slightly better.
Now the actual training process:
history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2)
We fit the training data (X_training
and Targets_training
) to the model architecture and allow it to optimize for 30 epochs, or iterations. Each batch that is fed forward through the network during an epoch contains five samples, which allows to benefit from accurate gradients without losing too much time and / or resources which increase with decreasing batch size. Verbosity mode is set to 1 (‘True’) in order to output everything during the training process, which helps your understanding. As highlighted before, we split the training data into true training data and validation data: 20% of the training data is used for validation.
Hence, from the 1000 samples that were generated, 250 are used for testing, 600 are used for training and 150 are used for validation (600 + 150 + 250 = 1000).
We store the results of the fitting (training) procedure into a history
object, which allows us the actually visualize model performance across epochs. But first, we add code for testing the model for its generalization power:
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
Then a plot of the decision boundary based on the testing data:
# Plot decision boundary
plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2)
plt.show()
And eventually, the visualization for the training process:
# Visualize training process
plt.plot(history.history['loss'], label='Hinge loss (testing data)')
plt.plot(history.history['val_loss'], label='Hinge loss (validation data)')
plt.title('Hinge loss for circles')
plt.ylabel('Hinge loss value')
plt.yscale('log')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
(A logarithmic scale is used because loss drops significantly during the first epoch, distorting the image if scaled linearly.)
Now, if you followed the process until now, you have a file called hinge-loss.py
. Open up the terminal which can access your setup (e.g. Anaconda Prompt or a regular terminal), cd
to the folder where your .py
is stored and execute python hinge-loss.py
. The training process should then start.
These are the results.
For hinge loss, we quite unsurprisingly found that validation accuracy went to 100% immediately. This is indeed unsurprising because the dataset is quite well separable (the distance between circles is large), the model was made quite capable of interpreting relatively complex data, and a relatively aggressive learning rate was set. This is the visualization of the training process using a logarithmic scale:
The decision boundary:
Or in plain text:
Epoch 1/30
600/600 [==============================] - 1s 1ms/step - loss: 0.4317 - accuracy: 0.6083 - val_loss: 0.0584 - val_accuracy: 1.0000
Epoch 2/30
600/600 [==============================] - 0s 682us/step - loss: 0.0281 - accuracy: 1.0000 - val_loss: 0.0124 - val_accuracy: 1.0000
Epoch 3/30
600/600 [==============================] - 0s 688us/step - loss: 0.0097 - accuracy: 1.0000 - val_loss: 0.0062 - val_accuracy: 1.0000
Epoch 4/30
600/600 [==============================] - 0s 693us/step - loss: 0.0054 - accuracy: 1.0000 - val_loss: 0.0038 - val_accuracy: 1.0000
Epoch 5/30
600/600 [==============================] - 0s 707us/step - loss: 0.0036 - accuracy: 1.0000 - val_loss: 0.0027 - val_accuracy: 1.0000
Epoch 6/30
600/600 [==============================] - 0s 692us/step - loss: 0.0026 - accuracy: 1.0000 - val_loss: 0.0020 - val_accuracy: 1.0000
Epoch 7/30
600/600 [==============================] - 0s 747us/step - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.0015 - val_accuracy: 1.0000
Epoch 8/30
600/600 [==============================] - 0s 717us/step - loss: 0.0015 - accuracy: 1.0000 - val_loss: 0.0012 - val_accuracy: 1.0000
Epoch 9/30
600/600 [==============================] - 0s 735us/step - loss: 0.0012 - accuracy: 1.0000 - val_loss: 0.0010 - val_accuracy: 1.0000
Epoch 10/30
600/600 [==============================] - 0s 737us/step - loss: 0.0010 - accuracy: 1.0000 - val_loss: 8.4231e-04 - val_accuracy: 1.0000
Epoch 11/30
600/600 [==============================] - 0s 720us/step - loss: 8.6515e-04 - accuracy: 1.0000 - val_loss: 7.1493e-04 - val_accuracy: 1.0000
Epoch 12/30
600/600 [==============================] - 0s 786us/step - loss: 7.3818e-04 - accuracy: 1.0000 - val_loss: 6.1438e-04 - val_accuracy: 1.0000
Epoch 13/30
600/600 [==============================] - 0s 732us/step - loss: 6.3710e-04 - accuracy: 1.0000 - val_loss: 5.3248e-04 - val_accuracy: 1.0000
Epoch 14/30
600/600 [==============================] - 0s 703us/step - loss: 5.5483e-04 - accuracy: 1.0000 - val_loss: 4.6540e-04 - val_accuracy: 1.0000
Epoch 15/30
600/600 [==============================] - 0s 728us/step - loss: 4.8701e-04 - accuracy: 1.0000 - val_loss: 4.1065e-04 - val_accuracy: 1.0000
Epoch 16/30
600/600 [==============================] - 0s 732us/step - loss: 4.3043e-04 - accuracy: 1.0000 - val_loss: 3.6310e-04 - val_accuracy: 1.0000
Epoch 17/30
600/600 [==============================] - 0s 733us/step - loss: 3.8266e-04 - accuracy: 1.0000 - val_loss: 3.2392e-04 - val_accuracy: 1.0000
Epoch 18/30
600/600 [==============================] - 0s 782us/step - loss: 3.4199e-04 - accuracy: 1.0000 - val_loss: 2.9011e-04 - val_accuracy: 1.0000
Epoch 19/30
600/600 [==============================] - 0s 755us/step - loss: 3.0694e-04 - accuracy: 1.0000 - val_loss: 2.6136e-04 - val_accuracy: 1.0000
Epoch 20/30
600/600 [==============================] - 0s 768us/step - loss: 2.7671e-04 - accuracy: 1.0000 - val_loss: 2.3608e-04 - val_accuracy: 1.0000
Epoch 21/30
600/600 [==============================] - 0s 778us/step - loss: 2.5032e-04 - accuracy: 1.0000 - val_loss: 2.1384e-04 - val_accuracy: 1.0000
Epoch 22/30
600/600 [==============================] - 0s 725us/step - loss: 2.2715e-04 - accuracy: 1.0000 - val_loss: 1.9442e-04 - val_accuracy: 1.0000
Epoch 23/30
600/600 [==============================] - 0s 728us/step - loss: 2.0676e-04 - accuracy: 1.0000 - val_loss: 1.7737e-04 - val_accuracy: 1.0000
Epoch 24/30
600/600 [==============================] - 0s 680us/step - loss: 1.8870e-04 - accuracy: 1.0000 - val_loss: 1.6208e-04 - val_accuracy: 1.0000
Epoch 25/30
600/600 [==============================] - 0s 738us/step - loss: 1.7264e-04 - accuracy: 1.0000 - val_loss: 1.4832e-04 - val_accuracy: 1.0000
Epoch 26/30
600/600 [==============================] - 0s 702us/step - loss: 1.5826e-04 - accuracy: 1.0000 - val_loss: 1.3628e-04 - val_accuracy: 1.0000
Epoch 27/30
600/600 [==============================] - 0s 802us/step - loss: 1.4534e-04 - accuracy: 1.0000 - val_loss: 1.2523e-04 - val_accuracy: 1.0000
Epoch 28/30
600/600 [==============================] - 0s 738us/step - loss: 1.3374e-04 - accuracy: 1.0000 - val_loss: 1.1538e-04 - val_accuracy: 1.0000
Epoch 29/30
600/600 [==============================] - 0s 762us/step - loss: 1.2326e-04 - accuracy: 1.0000 - val_loss: 1.0645e-04 - val_accuracy: 1.0000
Epoch 30/30
600/600 [==============================] - 0s 742us/step - loss: 1.1379e-04 - accuracy: 1.0000 - val_loss: 9.8244e-05 - val_accuracy: 1.0000
250/250 [==============================] - 0s 52us/step
Test results - Loss: 0.0001128034592838958 - Accuracy: 100.0%
We can see that validation loss is still decreasing together with training loss, so the model is not overfitting yet.
Reason why? Simple. My thesis is that this occurs because the data, both in the training and validation set, is perfectly separable. The decision boundary is crystal clear.
By changing loss_function_used
into squared_hinge
we can now show you results for squared hinge:
loss_function_used = 'squared_hinge'
Visually, it looks as follows:
And once again plain text:
Epoch 1/30
600/600 [==============================] - 1s 1ms/step - loss: 0.2361 - accuracy: 0.7117 - val_loss: 0.0158 - val_accuracy: 1.0000
Epoch 2/30
600/600 [==============================] - 0s 718us/step - loss: 0.0087 - accuracy: 1.0000 - val_loss: 0.0050 - val_accuracy: 1.0000
Epoch 3/30
600/600 [==============================] - 0s 727us/step - loss: 0.0036 - accuracy: 1.0000 - val_loss: 0.0026 - val_accuracy: 1.0000
Epoch 4/30
600/600 [==============================] - 0s 723us/step - loss: 0.0020 - accuracy: 1.0000 - val_loss: 0.0016 - val_accuracy: 1.0000
Epoch 5/30
600/600 [==============================] - 0s 723us/step - loss: 0.0014 - accuracy: 1.0000 - val_loss: 0.0011 - val_accuracy: 1.0000
Epoch 6/30
600/600 [==============================] - 0s 713us/step - loss: 9.7200e-04 - accuracy: 1.0000 - val_loss: 8.3221e-04 - val_accuracy: 1.0000
Epoch 7/30
600/600 [==============================] - 0s 697us/step - loss: 7.3653e-04 - accuracy: 1.0000 - val_loss: 6.4083e-04 - val_accuracy: 1.0000
Epoch 8/30
600/600 [==============================] - 0s 688us/step - loss: 5.7907e-04 - accuracy: 1.0000 - val_loss: 5.1182e-04 - val_accuracy: 1.0000
Epoch 9/30
600/600 [==============================] - 0s 712us/step - loss: 4.6838e-04 - accuracy: 1.0000 - val_loss: 4.1928e-04 - val_accuracy: 1.0000
Epoch 10/30
600/600 [==============================] - 0s 698us/step - loss: 3.8692e-04 - accuracy: 1.0000 - val_loss: 3.4947e-04 - val_accuracy: 1.0000
Epoch 11/30
600/600 [==============================] - 0s 723us/step - loss: 3.2525e-04 - accuracy: 1.0000 - val_loss: 2.9533e-04 - val_accuracy: 1.0000
Epoch 12/30
600/600 [==============================] - 0s 735us/step - loss: 2.7692e-04 - accuracy: 1.0000 - val_loss: 2.5270e-04 - val_accuracy: 1.0000
Epoch 13/30
600/600 [==============================] - 0s 710us/step - loss: 2.3846e-04 - accuracy: 1.0000 - val_loss: 2.1917e-04 - val_accuracy: 1.0000
Epoch 14/30
600/600 [==============================] - 0s 773us/step - loss: 2.0745e-04 - accuracy: 1.0000 - val_loss: 1.9093e-04 - val_accuracy: 1.0000
Epoch 15/30
600/600 [==============================] - 0s 718us/step - loss: 1.8180e-04 - accuracy: 1.0000 - val_loss: 1.6780e-04 - val_accuracy: 1.0000
Epoch 16/30
600/600 [==============================] - 0s 730us/step - loss: 1.6039e-04 - accuracy: 1.0000 - val_loss: 1.4876e-04 - val_accuracy: 1.0000
Epoch 17/30
600/600 [==============================] - 0s 698us/step - loss: 1.4249e-04 - accuracy: 1.0000 - val_loss: 1.3220e-04 - val_accuracy: 1.0000
Epoch 18/30
600/600 [==============================] - 0s 807us/step - loss: 1.2717e-04 - accuracy: 1.0000 - val_loss: 1.1842e-04 - val_accuracy: 1.0000
Epoch 19/30
600/600 [==============================] - 0s 722us/step - loss: 1.1404e-04 - accuracy: 1.0000 - val_loss: 1.0641e-04 - val_accuracy: 1.0000
Epoch 20/30
600/600 [==============================] - 1s 860us/step - loss: 1.0269e-04 - accuracy: 1.0000 - val_loss: 9.5853e-05 - val_accuracy: 1.0000
Epoch 21/30
600/600 [==============================] - 0s 768us/step - loss: 9.2805e-05 - accuracy: 1.0000 - val_loss: 8.6761e-05 - val_accuracy: 1.0000
Epoch 22/30
600/600 [==============================] - 0s 753us/step - loss: 8.4169e-05 - accuracy: 1.0000 - val_loss: 7.8690e-05 - val_accuracy: 1.0000
Epoch 23/30
600/600 [==============================] - 0s 727us/step - loss: 7.6554e-05 - accuracy: 1.0000 - val_loss: 7.1713e-05 - val_accuracy: 1.0000
Epoch 24/30
600/600 [==============================] - 0s 720us/step - loss: 6.9799e-05 - accuracy: 1.0000 - val_loss: 6.5581e-05 - val_accuracy: 1.0000
Epoch 25/30
600/600 [==============================] - 0s 715us/step - loss: 6.3808e-05 - accuracy: 1.0000 - val_loss: 5.9929e-05 - val_accuracy: 1.0000
Epoch 26/30
600/600 [==============================] - 0s 695us/step - loss: 5.8448e-05 - accuracy: 1.0000 - val_loss: 5.4957e-05 - val_accuracy: 1.0000
Epoch 27/30
600/600 [==============================] - 0s 730us/step - loss: 5.3656e-05 - accuracy: 1.0000 - val_loss: 5.0587e-05 - val_accuracy: 1.0000
Epoch 28/30
600/600 [==============================] - 0s 760us/step - loss: 4.9353e-05 - accuracy: 1.0000 - val_loss: 4.6493e-05 - val_accuracy: 1.0000
Epoch 29/30
600/600 [==============================] - 0s 750us/step - loss: 4.5461e-05 - accuracy: 1.0000 - val_loss: 4.2852e-05 - val_accuracy: 1.0000
Epoch 30/30
600/600 [==============================] - 0s 753us/step - loss: 4.1936e-05 - accuracy: 1.0000 - val_loss: 3.9584e-05 - val_accuracy: 1.0000
250/250 [==============================] - 0s 56us/step
Test results - Loss: 4.163062170846388e-05 - Accuracy: 100.0%
As you can see, squared hinge works as well. Comparing the two decision boundaries –
…it seems to be the case that the decision boundary for squared hinge is closer, or tighter. Perhaps due to the smoothness of the loss landscape? However, this cannot be said for sure.
In this blog post, we’ve seen how to create a machine learning model with Keras by means of the hinge loss and the squared hinge loss cost functions. We introduced hinge loss and squared hinge intuitively from a mathematical point of view, then swiftly moved on to an actual implementation. Results demonstrate that hinge loss and squared hinge loss can be successfully used in nonlinear classification scenarios, but they are relatively sensitive to the separability of your dataset (whether it’s linear or nonlinear does not matter). Perhaps, binary crossentropy is less sensitive – and we’ll take a look at this in a next blog post.
For now, it remains to thank you for reading this post – I hope you’ve been able to derive some new insights from it! Please let me know what you think by writing a comment below , I’d really appreciate it! Thanks and happy engineering!
Note that the full code for the models we created in this blog post is also available through my Keras Loss Functions repository on GitHub.
Wikipedia. (2011, September 16). Hinge loss. Retrieved from https://en.wikipedia.org/wiki/Hinge_loss
About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/
Intuitively understanding SVM and SVR – MachineCurve. (2019, September 20). Retrieved from https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/
Mastering Keras – MachineCurve. (2019, July 21). Retrieved from https://www.machinecurve.com/index.php/mastering-keras/
How to create a basic MLP classifier with the Keras Sequential API – MachineCurve. (2019, July 27). Retrieved from https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/
How to visualize the decision boundary for your Keras model? – MachineCurve. (2019, October 11). Retrieved from https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/
The post How to use hinge & squared hinge loss with Keras? appeared first on MachineCurve.
]]>The post Leaky ReLU: improving traditional ReLU appeared first on MachineCurve.
]]>But how is it an improvement? How does Leaky ReLU work? In this blog, we’ll take a look. We identify what ReLU does and why this may be problematic in some cases. We then introduce Leaky ReLU and argue why its design can help reduce the impact of the problems of traditional ReLU. Subsequently, we briefly look into whether it is actually better and why traditional ReLU is still in favor today.
Rectified Linear Unit, or ReLU, is one of the most common activation functions used in neural networks today. It is added to layers in neural networks to add nonlinearity, which is required to handle today’s ever more complex and nonlinear datasets.
Each neuron computes a dot product and adds a bias value before the value is output to the neurons in the subsequent layer. These mathematical operations are linear in nature. This is not bad if we were training the model against a dataset that is linearly separable (in the case of classification) or where a line needs to be estimated (when regressing).
However, if data is nonlinear, we face problems. Linear neuron outputs ensure that the system as a whole, thus the entire neural network, behaves linearly. By consequence, it cannot handle such data, which is very common today: the MNIST dataset, which we used for showing how to build classifiers in Keras, is nonlinear – and it is one of the simpler ones!
Activation functions come to the rescue by adding nonlinearity. They’re placed directly after the neural outputs and do nothing else but converting some input to some output. Because the mathematical functions used are nonlinear, the output is nonlinear – which is exactly what we want, since now the system behaves nonlinearly and nonlinear data is supported!
Note that although activation functions are pretty much nonlinear all the time, it’s of course also possible to use the identity function \(f(x) = x\) as an activation function. It would be pointless, but it can be done.
Now ReLU. It can be expressed as follows:
\begin{equation} f(x) = \begin{cases} 0, & \text{if}\ x < 0 \\ x, & \text{otherwise} \\ \end{cases} \end{equation}And visualized in this way:
For all values \(\geq 0\), it behaves linearly, but essentially behaves nonlinearly by outputting zeroes for all negative inputs.
Hence, it can be used as a nonlinear activation function.
It’s grown very popular and may be the most popular activation used today – it is more popular than the older Sigmoid and Tanh activation functions – for the reason that it can be computed relatively inexpensively. Computing ReLU is equal to computing \(ReLU(x) = max(0, x)\), which is much less expensive than the exponents or trigonometric operations necessary otherwise.
However, it’s not the silver bullet and every time you’ll run into trouble when using ReLU. It doesn’t happen often – which makes it highly generalizable across machine learning domains and machine learning problems – but you may run into some issues.
Firstly, ReLU is not continuously differentiable. At \(x = 0\), the breaking point between \(x\) and 0, the gradient cannot be computed. This is not too problematic, but can very lightly impact training performance.
Secondly, and more gravely, ReLU sets all values < 0 to zero. This is beneficial in terms of sparsity, as the network will adapt to ensure that the most important neurons have values of > 0. However, this is a problem as well, since the gradient of 0 is 0 and hence neurons arriving at large negative values cannot recover from being stuck at 0. The neuron effectively dies and hence the problem is known as the dying ReLU problem. You’re especially vulnerable to it when your neurons are not initialized properly or when your data is not normalized very well, causing significant weight swings during the first phases of optimizing your model. The impact of this problem may be that your network essentially stops learning and underperforms.
What if you caused a slight but significant information leak in the left part of ReLU, i.e. the part where the output is always 0?
This is the premise behind Leaky ReLU, one of the possible newer activation functions that attempts to minimize one’s sensitivity to the dying ReLU problem.
Mathematically, it is defined as follows (Maas et al., 2013):
\begin{equation} f(x) = \begin{cases} 0.01x, & \text{if}\ x < 0 \\ x, & \text{otherwise} \\ \end{cases} \end{equation}Leaky ReLU can be visualized as follows:
If you compare this with the image for traditional ReLU above, you’ll see that for all \(inputs < 0\), the outputs are slightly descending. The thesis is that these small numbers reduce the death of ReLU activated neurons. This way, you’ll have to worry less about the initialization of your neural network and the normalization of your data. Although these topics remain important, they are slightly less critical.
Next, the question: does Leaky ReLU really work? That is, does it really reduce the likelihood that your ReLU activating network dies off?
Let’s try and find out.
Nouroz Rahman isn’t convinced:
However, I personally don’t think Leaky ReLU provides any advantage over ReLU, holistically, considering both training and accuracy although some papers claimed to achieve that. That’s why Leaky ReLU is trivial in deep learning and honestly speaking, I have never used it or thought of the necessity of using it.
Nouroz Rahman
In a 2018 study, Pedamonti argues that Leaky ReLU and ReLU performance on the MNIST dataset is similar. Even though the problem of dying neural networks may now be solved theoretically, it can be the case that it simply doesn’t happen very often – and that in those cases, normal ReLU works as well. “It’s simple, it’s fast, it’s standard” – someone argued. And I tend to agree.
In this blog post, we’ve seen what challenges ReLU-activated neural networks. We also introduced the Leaky ReLU which attempts to resolve issues with traditional ReLU that are related to dying neural networks. We can conclude that in many cases, it seems to be the case that traditional / normal ReLU is relevant, and that Leaky ReLU benefits in those cases where you suspect your neurons are dying. I’d say: use ReLU if you can, and other linear rectifiers if you need to.
Happy engineering!
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. Retrieved from https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc
What are the advantages of using Leaky Rectified Linear Units (Leaky ReLU) over normal ReLU in deep learning? (n.d.). Retrieved from https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning
Pedamonti, D. (2018). Comparison of non-linear activation functions for deep neural networks on MNIST classification task. arXiv preprint arXiv:1804.02763.
The post Leaky ReLU: improving traditional ReLU appeared first on MachineCurve.
]]>The post Using Huber loss in Keras appeared first on MachineCurve.
]]>But how to implement this loss function in Keras?
That’s what we will find out in this blog.
We first briefly recap the concept of a loss function and introduce Huber loss. Next, we present a Keras example implementation that uses the Boston Housing Prices Dataset to generate a regression model. Let’s go!
Note that the full code is also available on GitHub, in my Keras loss functions repository.
When you train machine learning models, you feed data to the network, generate predictions, compare them with the actual values (the targets) and then compute what is known as a loss. This loss essentially tells you something about the performance of the network: the higher it is, the worse your networks performs overall.
There are many ways for computing the loss value. Huber loss is one of them. It essentially combines the Mean Absolute Error and the Mean Squared Error depending on some delta parameter, or 𝛿. This parameter must be configured by the machine learning engineer up front and is dependent on your data.
Huber loss looks like this:
As you can see, for target = 0, the loss increases when the error increases. However, the speed with which it increases depends on this 𝛿 value. In fact, Grover (2019) writes about this as follows: Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)
When you compare this statement with the benefits and disbenefits of both the MAE and the MSE, you’ll gain some insights about how to adapt this delta parameter:
Let’s now see if we can complete a regression problem with Huber loss!
Next, we show you how to use Huber loss with Keras to create a regression model. We’ll use the Boston housing price regression dataset which comes with Keras by default – that’ll make the example easier to follow. Obviously, you can always use your own data instead!
Since we need to know how to configure 𝛿, we must inspect the data at first. Do the target values contain many outliers? Some statistical analysis would be useful here.
Only then, we create the model and configure 𝛿 to an estimate that seems adequate. Finally, we run the model, check performance, and see whether we can improve 𝛿 any further.
Keras comes with datasets on board the framework: they have them stored on some Amazon AWS server and when you load the data, they automatically download it for you and store it in user-defined variables. It allows you to experiment with deep learning and the framework easily. This way, you can get a feel for DL practice and neural networks without getting lost in the complexity of loading, preprocessing and structuring your data.
The Boston housing price regression dataset is one of these datasets. It is taken by Keras from the Carnegie Mellon University StatLib library that contains many datasets for training ML models. It is described as follows:
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980.
StatLib Datasets Archive
And contains these variables, according to the StatLib website:
In total, one sample contains 13 features (CRIM to LSTAT) which together approximate the median value of the owner-occupied homes or MEDV. The structure of this dataset, mapping some variables to a real-valued number, allows us to perform regression.
Let’s now take a look at the dataset itself, and particularly its target values.
The number of outliers helps us tell something about the value for d that we have to choose. When thinking back to my Introduction to Statistics class at university, I remember that box plots can help visually identify outliers in a statistical sample:
Examination of the data for unusual observations that are far removed from the mass of data. These points are often referred to as outliers. Two graphical techniques for identifying outliers, scatter plots and box plots, (…)
Engineering Statistics Handbook
The sample, in our case, is the Boston housing dataset: it contains some mappings between feature variables and target prices, but obviously doesn’t represent all homes in Boston, which would be the statistical population. Nevertheless, we can write some code to generate a box plot based on this dataset:
'''
Generate a BoxPlot image to determine how many outliers are within the Boston Housing Pricing Dataset.
'''
import keras
from keras.datasets import boston_housing
import numpy as np
import matplotlib.pyplot as plt
# Load the data
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()
# We only need the targets, but do need to consider all of them
y = np.concatenate((y_train, y_test))
# Generate box plot
plt.boxplot(y)
plt.title('Boston housing price regression dataset - boxplot')
plt.show()
And next run it, to find this box plot:
Note that we concatenated the training data and the testing data for this box plot. Although the plot hints to the fact that many outliers exist, and primarily at the high end of the statistical spectrum (which does make sense after all, since in life extremely high house prices are quite common whereas extremely low ones are not), we cannot yet conclude that the MSE may not be a good idea. We’ll need to inspect the individual datasets too.
We can do that by simply adapting our code to:
y = y_train
or
y = y_test
This results in the following box plots:
Although the number of outliers is more extreme in the training data, they are present in the testing dataset as well.
Their structure is also quite similar: most of them, if not all, are present in the high end segment of the housing market.
Do note, however, that the median value for the testing dataset and the training dataset are slightly different. This means that patterns underlying housing prices present in the testing data may not be captured fully during the training process, because the statistical sample is slightly different. However, there is only one way to find out – by actually creating a regression model!
Let’s now create the model. Create a file called huber_loss.py
in some folder and open the file in a development environment. We’re then ready to add some code! However, let’s analyze first what you’ll need to use Huber loss in Keras.
The primary dependency that you’ll need is Keras, the deep learning framework for Python. However, not any version of Keras works – I quite soon ran into trouble with respect to a (relatively) outdated Keras version… with errors like huber_loss not found
.
I had to upgrade Keras to the newest version, as apparently Huber loss was added quite recently – but this also meant that I had to upgrade Tensorflow, the processing engine on top of which my Keras runs. Since on my machine Tensorflow runs on GPU, I also had to upgrade CUDA to support the newest Tensorflow version. Some insights:
Since for installing CUDA you’ll also need CuDNN, I refer you to another blogpost which perfectly explains how to install Tensorflow GPU and CUDA. However, you’ll need to consider the requirements listed above or even better, the official Tensorflow GPU requirements! When you install them correctly, you’ll be able to run Huber loss in Keras
…cost me an afternoon to fix this, though
Now that we can start coding, let’s import the Python dependencies that we need first:
'''
Keras model demonstrating Huber loss
'''
from keras.datasets import boston_housing
from keras.models import Sequential
from keras.layers import Dense
from keras.losses import huber_loss
import numpy as np
import matplotlib.pyplot as plt
Obviously, we need the boston_housing
dataset from the available Keras datasets. Additionally, we import Sequential
as we will build our model using the Keras Sequential API. We’re creating a very simple model, a multilayer perceptron, with which we’ll attempt to regress a function that correctly estimates the median values of Boston homes. For this reason, we import Dense
layers or densely-connected ones.
We also need huber_loss
since that’s the los function we use. Numpy is used for number processing and we use Matplotlib to visualize the end result.
Next, we’ll have to perform a pretty weird thing to make Huber loss usable in Keras.
Even though Keras apparently natively supports Huber loss by providing huber_loss
as a String value during model configuration, there’s no point in this, since the delta value discussed before cannot be configured that way. Hence, we need to think differently.
…but there was no way to include Huber loss directly into Keras, it seemed, until I came across an answer on Stackoverflow! It defines a custom Huber loss Keras function which can be successfully used. I slightly adapted it, and we’ll add it next:
# Define the Huber loss so that it can be used with Keras
def huber_loss_wrapper(**huber_loss_kwargs):
def huber_loss_wrapped_function(y_true, y_pred):
return huber_loss(y_true, y_pred, **huber_loss_kwargs)
return huber_loss_wrapped_function
We next load the data by calling the Keras load_data()
function on the housing dataset and prepare the input layer shape, which we can add to the initial hidden layer later:
# Load data
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()
# Set the input shape
shape_dimension = len(x_train[0])
input_shape = (shape_dimension,)
print(f'Feature shape: {input_shape}')
Next, we do actually provide the model architecture and configuration:
# Create the model
model = Sequential()
model.add(Dense(16, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(8, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))
# Configure the model and start training
model.compile(loss=huber_loss_wrapper(delta=1.5), optimizer='adam', metrics=['mean_absolute_error'])
history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2)
As discussed, we use the Sequential API; here, we use two densely-connected hidden layers and one output layer. The hidden ones activate by means of ReLU and for this reason require He uniform initialization. The final layer activates linearly, because it regresses the actual value.
Compiling the model requires specifying the delta value, which we set to 1.5, given our estimate that we don’t want true MAE but that given the outliers identified earlier full MSE resemblence is not smart either. We’ll optimize by means of Adam and also define the MAE as an extra error metric. This way, we can have an estimate about what the true error is in terms of thousands of dollars: the MAE keeps its domain understanding whereas Huber loss does not.
Subsequently, we fit the training data to the model, complete 250 epochs with a batch size of 1 (true SGD-like optimization, albeit with Adam), use 20% of the data as validation data and ensure that the entire training process is output to standard output.
Finally, we add some code for performance testing and visualization:
# Test the model after training
test_results = model.evaluate(x_test, y_test, verbose=1)
print(f'Test results - Loss: {test_results[0]} - MAE: {test_results[1]}')
# Plot history: Huber loss and MAE
plt.plot(history.history['loss'], label='Huber loss (training data)')
plt.plot(history.history['val_loss'], label='Huber loss (validation data)')
plt.title('Boston Housing Price Dataset regression model - Huber loss')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
plt.title('Boston Housing Price Dataset regression model - MAE')
plt.plot(history.history['mean_absolute_error'], label='MAE (training data)')
plt.plot(history.history['val_mean_absolute_error'], label='MAE (validation data)')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
Let’s now take a look at how the model has optimized over the epochs with the Huber loss:
And with the MAE:
We can see that overall, the model was still improving at the 250th epoch, although progress was stalling – which is perfectly normal in such a training process. The mean absolute error was approximately $3.639.
Test results - Loss: 4.502029736836751 - MAE: 3.6392388343811035
In this blog post, we’ve seen how the Huber loss can be used to balance between MAE and MSE in machine learning regression problems. By means of the delta parameter, or 𝛿, you can configure which one it should resemble most, benefiting from the fact that you can check the number of outliers in your dataset a priori. I hope you’ve enjoyed this blog and learnt something from it – please let me know in the comments if you have any questions or remarks. Thanks and happy engineering!
Note that the full code is also available on GitHub, in my Keras loss functions repository.
Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
StatLib—Datasets Archive. (n.d.). Retrieved from http://lib.stat.cmu.edu/datasets/
Keras. (n.d.). Datasets. Retrieved from https://keras.io/datasets/
Keras. (n.d.). Boston housing price regression dataset. Retrieved from https://keras.io/datasets/#boston-housing-price-regression-dataset
Carnegie Mellon University StatLib. (n.d.). Boston house-price data. Retrieved from http://lib.stat.cmu.edu/datasets/boston
Engineering Statistics Handbook. (n.d.). 7.1.6. What are outliers in the data? Retrieved from https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
Using Tensorflow Huber loss in Keras. (n.d.). Retrieved from https://stackoverflow.com/questions/47840527/using-tensorflow-huber-loss-in-keras
The post Using Huber loss in Keras appeared first on MachineCurve.
]]>The post How to visualize the decision boundary for your Keras model? appeared first on MachineCurve.
]]>Many algorithms have many different approaches to generating such decision boundaries. Neural networks learn them differently, dependent on the optimizer, activation function(s) and loss function used in your training setting. They support multiclass classification quite natively in many cases.
Support Vector machines learn them by finding a maximum-margin boundary between the two (!) classes in your ML problem. Indeed, SVMs do not work for more than two classes, and many SVMs have to be trained and merged to support multiclass classification.
Linear classifiers generate a linear decision boundary, which can happen in a multitude of ways – whether with SVMs, neural networks or more traditional techniques such as just fitting a line.
And so on.
But how do we visualize such a decision boundary? Especially: how do I visualize the decision boundary for my Keras classifier? That’s what we’ll answer in this blog post today. By means of the library Mlxtend created by Raschka (2018), we show you by means of example code how to visualize the decision boundaries of classifiers for both linearly separable and nonlinear data.
Are you ready?
Let’s go!
Note that code is also available on GitHub, in my Keras Visualizations repository.
Now that we know what a decision boundary is, we can try to visualize some of them for our Keras models. Here, we’ll provide an example for visualizing the decision boundary with linearly separable data.
Thus, data which can be separated by drawing a line in between the clusters. Typically, this is seen with classifiers and particularly Support Vector Machines (which maximize the margin between the line and the two clusters), but also with neural networks.
Let’s start. Perhaps, create a file in some folder called decision_boundary_linear_data.py
in which you’ll add the following code.
We first import the required dependencies:
# Imports
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from mlxtend.plotting import plot_decision_regions
We use Keras for training our machine learning model, which on my machine is configured to run on top of TensorFlow. Additionally, we’ll import Matplotlib, which we need to visualize our dataset. Numpy is imported for preprocessing the data, Scikit-learn‘s function make_blobs
is imported for generating the linearly separable clusters of data and Mlxtend is used for visualizing the decision boundary.
Next, we set some configuration options:
# Configuration options
num_samples_total = 1000
training_split = 250
The number of samples used in our visualization experiment is 1000 – to keep the training process fast, while still being able to show the predictive power of our model.
We use 250 samples of them as testing data by splitting them off the total dataset.
Let’s now generate data for the experiment.
With the help of the Scikit-learn library we generate data using the make_blobs
function. It generates n_samples
data points at the centers (0, 0) and (15, 15). The n_features
is two: our samples have an (x, y) value on a 2D-space. The standard deviation of our cluster is set at 2.5. This allows us to add some spread without losing linear separability.
# Generate data
X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15)], n_features = 2, center_box=(0, 1), cluster_std = 2.5)
targets[np.where(targets == 0)] = -1
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = targets[training_split:]
Targets_testing = targets[:training_split]
Scikit-learn’s make_blobs
generates numbers as targets, starting at 0. However, we will use Hinge loss in an attempt to maximize the decision boundary between our clusters. This should be possible given its separability. Hinge loss does not understand a target value of 0; rather, targets must be -1 or +1. Hence, we next convert all zero targets into minus one.
We finally split between training and testing data given the number of splitoff values that we configured earlier.
We next visualize our data:
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Linearly separable data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
Put simply, we generate a scatter plot with Matplotlib, which clearly shows linear separability for our dataset:
We next add the (relatively basic) Keras model used today:
# Set the input shape
feature_vector_shape = len(X_training[0])
input_shape = (feature_vector_shape,)
print(f'Feature shape: {input_shape}')
# Create the model
model = Sequential()
model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))
# Configure the model and start training
model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy'])
model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2)
We configure the input shape and next define the model architecture – we use Keras’s Sequential API and let the data pass through two densely-connected layers. Two such layers should be sufficient for generating a successful decision boundary since our data is relatively simple – and in fact, linearly separable.
Do note that since we use the ReLU activation function, we cannot use Glorot uniform initialization – the default choice in Keras. Rather, we must use He initialization, and choose to do so with a uniform distribution.
Next, we compile the model, using squared hinge as our loss function, Adam as our optimizer (it’s the de facto standard one used today) and accuracy as an additional metric – pretty much the choices I always make when creating models with Keras.
Next, we fit the training data to the model, perform 50 iterations (or epochs) with batch sizes of 25, and use 20% of our 750 training samples for validating the outcomes of the training process after every epoch. Verbosity is set to 1 to show what happens during training.
Subsequently, we add another default metric, which tests the final model once it stops training against the test set – to also show its power to generalize to data the model has not seen before.
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
Finally, we add code for visualizing the model’s decision boundary. We use Mlxtend for this purpose, which is “a Python library of useful tools for the day-to-day data science tasks”. Great!
What’s even better is that we can visualize the decision boundary of our Keras model with only two lines of code:
# Plot decision boundary
plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2)
plt.show()
Note that we use our testing data for this rather than our training data, that we input the instance of our Keras model and that we display a legend.
Altogether, this is the code for the entire experiment:
# Imports
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs
from mlxtend.plotting import plot_decision_regions
# Configuration options
num_samples_total = 1000
training_split = 250
# Generate data
X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15)], n_features = 2, center_box=(0, 1), cluster_std = 2.5)
targets[np.where(targets == 0)] = -1
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = targets[training_split:]
Targets_testing = targets[:training_split]
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Linearly separable data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Set the input shape
feature_vector_shape = len(X_training[0])
input_shape = (feature_vector_shape,)
print(f'Feature shape: {input_shape}')
# Create the model
model = Sequential()
model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))
# Configure the model and start training
model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy'])
model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2)
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
# Plot decision boundary
plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2)
plt.show()
Running the code requires that you installed all dependencies mentioned earlier; preferably in an Anaconda environment to keep them isolated. Next, you can open up a terminal, navigate to the folder your file is located in and hit e.g. python decision_boundary_linear_data.py
. What you will see is that Keras starts training the model, but that also the visualization above and the decision boundary visualization is generated for you.
Epoch 1/50
600/600 [==============================] - 5s 8ms/step - loss: 1.4986 - acc: 0.4917 - val_loss: 1.0199 - val_acc: 0.6333
Epoch 2/50
600/600 [==============================] - 0s 107us/step - loss: 0.7973 - acc: 0.6933 - val_loss: 0.6743 - val_acc: 0.7400
Epoch 3/50
600/600 [==============================] - 0s 102us/step - loss: 0.6273 - acc: 0.7467 - val_loss: 0.6020 - val_acc: 0.7800
Epoch 4/50
600/600 [==============================] - 0s 102us/step - loss: 0.5472 - acc: 0.7750 - val_loss: 0.5241 - val_acc: 0.8200
Epoch 5/50
600/600 [==============================] - 0s 93us/step - loss: 0.4313 - acc: 0.8000 - val_loss: 0.4170 - val_acc: 0.8467
Epoch 6/50
600/600 [==============================] - 0s 97us/step - loss: 0.2492 - acc: 0.8283 - val_loss: 0.1900 - val_acc: 0.8800
Epoch 7/50
600/600 [==============================] - 0s 107us/step - loss: 0.1199 - acc: 0.8850 - val_loss: 0.1109 - val_acc: 0.9133
Epoch 8/50
600/600 [==============================] - 0s 98us/step - loss: 0.0917 - acc: 0.9000 - val_loss: 0.0797 - val_acc: 0.9200
Epoch 9/50
600/600 [==============================] - 0s 96us/step - loss: 0.0738 - acc: 0.9183 - val_loss: 0.0603 - val_acc: 0.9200
Epoch 10/50
600/600 [==============================] - 0s 98us/step - loss: 0.0686 - acc: 0.9200 - val_loss: 0.0610 - val_acc: 0.9200
Epoch 11/50
600/600 [==============================] - 0s 101us/step - loss: 0.0629 - acc: 0.9367 - val_loss: 0.0486 - val_acc: 0.9333
Epoch 12/50
600/600 [==============================] - 0s 108us/step - loss: 0.0574 - acc: 0.9367 - val_loss: 0.0487 - val_acc: 0.9267
Epoch 13/50
600/600 [==============================] - 0s 102us/step - loss: 0.0508 - acc: 0.9400 - val_loss: 0.0382 - val_acc: 0.9467
Epoch 14/50
600/600 [==============================] - 0s 109us/step - loss: 0.0467 - acc: 0.9483 - val_loss: 0.0348 - val_acc: 0.9533
Epoch 15/50
600/600 [==============================] - 0s 108us/step - loss: 0.0446 - acc: 0.9467 - val_loss: 0.0348 - val_acc: 0.9467
Epoch 16/50
600/600 [==============================] - 0s 109us/step - loss: 0.0385 - acc: 0.9583 - val_loss: 0.0280 - val_acc: 0.9533
Epoch 17/50
600/600 [==============================] - 0s 100us/step - loss: 0.0366 - acc: 0.9583 - val_loss: 0.0288 - val_acc: 0.9467
Epoch 18/50
600/600 [==============================] - 0s 105us/step - loss: 0.0320 - acc: 0.9633 - val_loss: 0.0227 - val_acc: 0.9733
Epoch 19/50
600/600 [==============================] - 0s 100us/step - loss: 0.0289 - acc: 0.9633 - val_loss: 0.0224 - val_acc: 0.9733
Epoch 20/50
600/600 [==============================] - 0s 107us/step - loss: 0.0264 - acc: 0.9683 - val_loss: 0.0202 - val_acc: 0.9733
Epoch 21/50
600/600 [==============================] - 0s 99us/step - loss: 0.0251 - acc: 0.9767 - val_loss: 0.0227 - val_acc: 0.9667
Epoch 22/50
600/600 [==============================] - 0s 95us/step - loss: 0.0247 - acc: 0.9750 - val_loss: 0.0170 - val_acc: 0.9800
Epoch 23/50
600/600 [==============================] - 0s 101us/step - loss: 0.0210 - acc: 0.9833 - val_loss: 0.0170 - val_acc: 0.9800
Epoch 24/50
600/600 [==============================] - 0s 104us/step - loss: 0.0192 - acc: 0.9833 - val_loss: 0.0148 - val_acc: 0.9933
Epoch 25/50
600/600 [==============================] - 0s 105us/step - loss: 0.0191 - acc: 0.9833 - val_loss: 0.0138 - val_acc: 0.9867
Epoch 26/50
600/600 [==============================] - 0s 103us/step - loss: 0.0169 - acc: 0.9867 - val_loss: 0.0128 - val_acc: 0.9933
Epoch 27/50
600/600 [==============================] - 0s 105us/step - loss: 0.0157 - acc: 0.9867 - val_loss: 0.0121 - val_acc: 1.0000
Epoch 28/50
600/600 [==============================] - 0s 103us/step - loss: 0.0150 - acc: 0.9883 - val_loss: 0.0118 - val_acc: 0.9933
Epoch 29/50
600/600 [==============================] - 0s 106us/step - loss: 0.0140 - acc: 0.9883 - val_loss: 0.0112 - val_acc: 1.0000
Epoch 30/50
600/600 [==============================] - 0s 105us/step - loss: 0.0131 - acc: 0.9917 - val_loss: 0.0101 - val_acc: 1.0000
Epoch 31/50
600/600 [==============================] - 0s 110us/step - loss: 0.0123 - acc: 0.9917 - val_loss: 0.0099 - val_acc: 1.0000
Epoch 32/50
600/600 [==============================] - 0s 111us/step - loss: 0.0119 - acc: 0.9917 - val_loss: 0.0102 - val_acc: 0.9933
Epoch 33/50
600/600 [==============================] - 0s 116us/step - loss: 0.0116 - acc: 0.9933 - val_loss: 0.0093 - val_acc: 1.0000
Epoch 34/50
600/600 [==============================] - 0s 108us/step - loss: 0.0107 - acc: 0.9933 - val_loss: 0.0085 - val_acc: 1.0000
Epoch 35/50
600/600 [==============================] - 0s 102us/step - loss: 0.0100 - acc: 0.9933 - val_loss: 0.0081 - val_acc: 1.0000
Epoch 36/50
600/600 [==============================] - 0s 103us/step - loss: 0.0095 - acc: 0.9917 - val_loss: 0.0078 - val_acc: 1.0000
Epoch 37/50
600/600 [==============================] - 0s 105us/step - loss: 0.0093 - acc: 0.9967 - val_loss: 0.0079 - val_acc: 1.0000
Epoch 38/50
600/600 [==============================] - 0s 104us/step - loss: 0.0088 - acc: 0.9950 - val_loss: 0.0072 - val_acc: 1.0000
Epoch 39/50
600/600 [==============================] - 0s 98us/step - loss: 0.0085 - acc: 0.9967 - val_loss: 0.0069 - val_acc: 1.0000
Epoch 40/50
600/600 [==============================] - 0s 103us/step - loss: 0.0079 - acc: 0.9983 - val_loss: 0.0066 - val_acc: 1.0000
Epoch 41/50
600/600 [==============================] - 0s 103us/step - loss: 0.0075 - acc: 0.9967 - val_loss: 0.0065 - val_acc: 1.0000
Epoch 42/50
600/600 [==============================] - 0s 101us/step - loss: 0.0074 - acc: 0.9950 - val_loss: 0.0060 - val_acc: 1.0000
Epoch 43/50
600/600 [==============================] - 0s 101us/step - loss: 0.0072 - acc: 0.9967 - val_loss: 0.0057 - val_acc: 1.0000
Epoch 44/50
600/600 [==============================] - 0s 105us/step - loss: 0.0071 - acc: 0.9950 - val_loss: 0.0056 - val_acc: 1.0000
Epoch 45/50
600/600 [==============================] - 0s 105us/step - loss: 0.0065 - acc: 0.9983 - val_loss: 0.0054 - val_acc: 1.0000
Epoch 46/50
600/600 [==============================] - 0s 110us/step - loss: 0.0062 - acc: 0.9983 - val_loss: 0.0056 - val_acc: 1.0000
Epoch 47/50
600/600 [==============================] - 0s 105us/step - loss: 0.0059 - acc: 0.9983 - val_loss: 0.0051 - val_acc: 1.0000
Epoch 48/50
600/600 [==============================] - 0s 103us/step - loss: 0.0057 - acc: 0.9983 - val_loss: 0.0049 - val_acc: 1.0000
Epoch 49/50
600/600 [==============================] - 0s 101us/step - loss: 0.0056 - acc: 0.9983 - val_loss: 0.0047 - val_acc: 1.0000
Epoch 50/50
600/600 [==============================] - 0s 105us/step - loss: 0.0054 - acc: 0.9983 - val_loss: 0.0050 - val_acc: 1.0000
250/250 [==============================] - 0s 28us/step
Test results - Loss: 0.007074932985007763 - Accuracy: 99.2%
As you can see, during training validation accuracy goes to 1 or 100%. Testing the model with the testing dataset yields an accuracy of 99.2%. That’s quite good news!
And the visualized decision boundary?
Let’s now take a look at an example with nonlinear data.
Now what if we have nonlinear data? We can do the same!
We’ll have to change a few lines in our code, though. Let’s first replace the make_blobs
import by make_moons
:
from sklearn.datasets import make_moons
Next, also replace the call to this function under Generate data to this:
X, targets = make_moons(n_samples = num_samples_total)
What happens? Well, unlike the linearly separable data, two shapes resembling half moons are generated; they cannot be linearly separated, at least in regular feature space:
Running the code with these adaptations (the full code can be retrieved next) shows that the Keras model is actually able to perform hinge-loss based nonlinear separation pretty successfully:
Epoch 50/50
600/600 [==============================] - 0s 107us/step - loss: 0.0748 - acc: 0.9233 - val_loss: 0.0714 - val_acc: 0.9400
250/250 [==============================] - 0s 26us/step
Test results - Loss: 0.07214225435256957 - Accuracy: 91.59999976158142%
And it looks as follows:
# Imports
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons
from mlxtend.plotting import plot_decision_regions
# Configuration options
num_samples_total = 1000
training_split = 250
# Generate data
X, targets = make_moons(n_samples = num_samples_total)
targets[np.where(targets == 0)] = -1
X_training = X[training_split:, :]
X_testing = X[:training_split, :]
Targets_training = targets[training_split:]
Targets_testing = targets[:training_split]
# Generate scatter plot for training data
plt.scatter(X_training[:,0], X_training[:,1])
plt.title('Nonlinear data')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Set the input shape
feature_vector_shape = len(X_training[0])
input_shape = (feature_vector_shape,)
print(f'Feature shape: {input_shape}')
# Create the model
model = Sequential()
model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='tanh'))
# Configure the model and start training
model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy'])
model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2)
# Test the model after training
test_results = model.evaluate(X_testing, Targets_testing, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%')
# Plot decision boundary
plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2)
plt.show()
In this blog, we’ve seen how to visualize the decision boundary of your Keras model by means of Mlxtend, a Python library that extends the toolkit of today’s data scientists. We saw that we only need two lines of code to provide for a basic visualization which clearly demonstrates the presence of the decision boundary.
I hope you’ve learnt something from this blog! Please let me know in a comment if you have any questions, any remarks or when you have comments. I’ll happily improve my blog if I made mistakes or forgot crucial information – and I’m also very eager to hear what you’ve done with the information! Thanks and happy engineering!
Note that code is also available on GitHub, in my Keras Visualizations repository.
Raschka, S. (n.d.). Home – mlxtend. Retrieved from http://rasbt.github.io/mlxtend/
Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. Journal of Open Source Software, 3(24), 638. doi:10.21105/joss.00638
Intuitively understanding SVM and SVR – MachineCurve. (2019, September 20). Retrieved from https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/
About loss and loss functions: Hinge loss – MachineCurve. (2019, October 4). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#hinge
About loss and loss functions: Squared hinge loss – MachineCurve. (2019, October 4). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#squared-hinge
ReLU, Sigmoid and Tanh: today’s most used activation functions – MachineCurve. (2019, September 4). Retrieved from https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/
He/Xavier initialization & activation functions: choose wisely – MachineCurve. (2019, September 18). Retrieved from https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/
The post How to visualize the decision boundary for your Keras model? appeared first on MachineCurve.
]]>The post How to visualize the training process in Keras? appeared first on MachineCurve.
]]>One way of achieving that is by exporting all the loss values and accuracies manually, adding them to an Excel sheet – before generating a chart.
Like I did a while ago
It goes without saying that there are smarter ways for doing that. In today’s blog, we’ll cover how to visualize the training process in Keras – just like above, but then with a little piece of extra code. This blog covers precisely what you need in order to generate such plots, it discusses the Keras History
object which contains the data you’ll need and presents the visualization code.
Let’s go!
Note that model code is also available on GitHub.
Since we’re creating some actual code, you’ll likely wish to run it on your machine. For this to work, you need to install certain software dependencies. Specifically:
Preferably, you run these in an Anaconda environment that isolates these packages from your other development environments. It saves you a lot of struggle as packages could otherwise interfere with each other.
In this blog we want to visualize the training process of a Keras model. This requires that we’ll work with an actual model. We use this simple one today:
# Load dependencies
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
# Load data
dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4))
# Shuffle dataset
np.random.shuffle(dataset)
# Separate features and targets
X = dataset[:, 0:3]
Y = dataset[:, 3]
# Set the input shape
input_shape = (3,)
print(f'Feature shape: {input_shape}')
# Create the model
model = Sequential()
model.add(Dense(16, input_shape=input_shape, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))
# Configure the model and start training
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error'])
model.fit(X, Y, epochs=25, batch_size=1, verbose=1, validation_split=0.2)
Why such a simple one? Well – it’s not about the model today, so we should keep most complexity out of here. The regular reader recognizes that this is the regression MLP that we created earlier. It loads Chennai, India based water reservoir water levels and attempts to predict the levels at one given the levels in the other 3 reservoirs. It does so by means of the Keras Sequential API and densely-conencted layers and MAE as a regression loss function, with MSE as an additional one. It performs training in 25 epochs.
Let’s create a file called history_visualization.py
and paste the above code into it.
History
objectWhen running this model, Keras maintains a so-called History
object in the background. This object keeps all loss values and other metric values in memory so that they can be used in e.g. TensorBoard, in Excel reports or indeed for our own custom visualizations.
The history object is the output of the fit
operation. Hence, it can be accessed in your Python script by slightly adapting that row in the above code to:
history = model.fit(X, Y, epochs=250, batch_size=1, verbose=1, validation_split=0.2)
In the Keras docs, we find:
The
Keras docs on model visualizationHistory.history
attribute is a dictionary recording training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable).
Also add print(history)
so that we can inspect the history before we visualize it, to get a feel for its structure.
It indeed outputs the model history (note that for simplicity we trained with only 5 epochs):
{'val_loss': [281.05517045470464, 281.0461930366744, 282.3450624835175, 283.21272195725317, 278.22250578392925], 'val_mean_squared_error': [131946.00690089026, 131610.73269158995, 132186.26299269326, 133621.92045977595, 131213.40662287443], 'loss': [319.1303724563634, 279.54961594772305, 277.2224043372698, 276.19018290098035, 276.37119589065435], 'mean_squared_error': [210561.46019607811, 132310.933269216, 131070.35584168187, 131204.38709398077, 131249.8484192732]}
Or, when nicely formatted:
{
"val_loss":[
281.05517045470464,
281.0461930366744,
282.3450624835175,
283.21272195725317,
278.22250578392925
],
"val_mean_squared_error":[
131946.00690089026,
131610.73269158995,
132186.26299269326,
133621.92045977595,
131213.40662287443
],
"loss":[
319.1303724563634,
279.54961594772305,
277.2224043372698,
276.19018290098035,
276.37119589065435
],
"mean_squared_error":[
210561.46019607811,
132310.933269216,
131070.35584168187,
131204.38709398077,
131249.8484192732
]
}
It nicely displays all the metrics that we defined: MAE (“loss” and “val_loss” i.e. for both testing and validation data) and MSE as an additional metric.
Since this is a simple Python dictionary structure, we can easily use it for visualization purposes.
Let’s now add an extra import – for Matplotlib, our visualization library:
import matplotlib.pyplot as plt
Next, ensure that the number of epochs is at 25 again.
Let’s now add a piece of code that visualizes the MAE:
# Plot history: MAE
plt.plot(history.history['loss'], label='MAE (testing data)')
plt.plot(history.history['val_loss'], label='MAE (validation data)')
plt.title('MAE for Chennai Reservoir Levels')
plt.ylabel('MAE value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
Note that since you defined MAE to be the official loss value (loss='mean_absolute_error'
), you’ll have to use loss
and val_loss
in the History object. Above, we additionally add labels, a title and a legend which eventually arrives at this:
Similarly, we can add a visualization of our MSE value – but here, we’ll have to use mean_squared_error
and val_mean_squared_error
instead, because they are an additional metric (metrics=['mean_squared_error']
).
# Plot history: MSE
plt.plot(history.history['mean_squared_error'], label='MSE (testing data)')
plt.plot(history.history['val_mean_squared_error'], label='MSE (validation data)')
plt.title('MSE for Chennai Reservoir Levels')
plt.ylabel('MSE value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()
This is the output for our training process:
What can we observe from the training process?
As you can see, visualizing the training process of your Keras model can help you understand how the model performs. While you can do this manually with e.g. Excel, we’ve seen in this blog that you can also use built-in Keras utils (namely, the History
object) to generate an overview of your training process. With Matplotlib, this history can subsequently be visualized.
I hope you’ve learnt something today – if so, please let me know in the comments; I’d appreciate your remarks! Feel free to leave a comment as well if you have any questions or when you think this blog can be improved. I’ll happily edit the text. Happy engineering!
Note that model code is also available on GitHub.
Keras. (n.d.). Visualization. Retrieved from https://keras.io/visualization/#model-visualization
Creating an MLP for regression with Keras – MachineCurve. (2019, July 30). Retrieved from https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/
How to visualize a model with Keras? – MachineCurve. (2019, October 7). Retrieved from https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/
TensorBoard: Visualizing Learning. (n.d.). Retrieved from https://www.tensorflow.org/tensorboard/r1/summaries
The post How to visualize the training process in Keras? appeared first on MachineCurve.
]]>The post How to visualize a model with Keras? appeared first on MachineCurve.
]]>…but creating such models is often a hassle when you have to do it manually. Solutions like www.draw.io are used quite often in those cases, because they are (relatively) quick and dirty, allowing you to create models fast.
However, there’s a better solution: the built-in plot_model
facility within Keras. It allows you to create a visualization of your model architecture. In this blog, I’ll show you how to create such a visualization. Specifically, I focus on the model itself, discussing its architecture so that you fully understand what happens. Subsquently, I’ll list some software dependencies that you’ll need – including a highlight about a bug in Keras that results in a weird error related to pydot
and GraphViz, which are used for visualization. Finally, I present you the code used for visualization and the end result.
Note that model code is also available on GitHub.
To show you how to visualize a Keras model, I think it’s best if we discussed one first.
Today, we will visualize the Convolutional Neural Network that we created earlier to demonstrate the benefits of using CNNs over densely-connected ones.
This is the code of that model:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# Model configuration
img_width, img_height = 28, 28
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
# Load MNIST dataset
(input_train, target_train), (input_test, target_test) = mnist.load_data()
# Reshape data based on channels first / channels last strategy.
# This is dependent on whether you use TF, Theano or CNTK as backend.
# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
if K.image_data_format() == 'channels_first':
input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height)
input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height)
input_shape = (1, img_width, img_height)
else:
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
# Convert them into black or white: [0, 1].
input_train = input_train / 255
input_test = input_test / 255
# Convert target vectors to categorical targets
target_train = keras.utils.to_categorical(target_train, no_classes)
target_test = keras.utils.to_categorical(target_test, no_classes)
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))
# Compile the model
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
# Fit data to model
model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
What does it do?
I’d suggest that you read the post if you wish to understand it very deeply, but I’ll briefly cover it here.
It simply classifies the MNIST dataset. This dataset contains 28 x 28 pixel images of digits, or numbers between 0 and 9, and our CNN classifies them with a staggering 99% accuracy. It does so by combining two convolutional blocks (which consist of a two-dimensional convolutional layer, two-dimensional max pooling and dropout) with densely-conneted layers. It’s the best of both worlds in terms of interpreting the image and generating final predictions.
But how to visualize this model’s architecture? Let’s find out.
plot_model
utilUtilities. I love them, because they make my life easier. They’re often relatively simple functions that can be called upon to perform some relatively simple actions. Don’t be fooled, however, because these actions often benefit one’s efficiently greatly – in this case, not having to visualize a model architecture yourself in tools like draw.io
I’m talking about the plot_model
util, which comes delivered with Keras.
It allows you to create a visualization of your Keras neural network.
More specifically, the Keras docs define it as follows:
from keras.utils import plot_model
plot_model(model, to_file='model.png')
From the Keras utilities, one needs to import the function, after which it can be used with very minimal parameters:
to_file
parameter, which essentially specifies a location on disk where the model visualization is stored.If you wish, you can supply some additional parameters as well:
False
by default) which controls whether the shape of the layer outputs are shown in the graph. This would be beneficial if besides the architecture you also need to understand how it transforms data.True
by default) which determines whether the names of the layers are displayed.False
by default) controls how nested models are displayed.However, likely, for a simple visualization, you don’t need them. Let’s now take a look what we would need if we were to create such a visualization.
If you wish to run the code presented in this blog successfully, you need to install certain software dependencies. You’ll need those to run it:
Preferably, you’ll run this from an Anaconda environment, which allows you to run these packages in an isolated fashion. Note that many people report that a pip
based installation of Graphviz doesn’t work; rather, you’ll have to install it separately into your host OS from their website. Bummer!
pydot
failed to call GraphVizWhen trying to visualize my Keras neural network with plot_model
, I ran into this error:
'`pydot` failed to call GraphViz.'
OSError: `pydot` failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.
…which essentially made sense at first, because I didn’t have Graphviz installed.
…but which didn’t after I installed it, because the error kept reappearing, even after restarting the Anaconda terminal.
Fortunately, the internet comes to the rescue in those cases:
Or you can install pydot 1.2.3 by pip.
XifengGuopip install pydot==1.2.3
Although downgrading packages is not likely to be the best long-term solution, it did certainly work in this case. The error was resolved and I could generate model visualizations. Let’s therefore now take a look at the visualization code.
When adapting the code from my original CNN, scrapping away the elements I don’t need for visualizing the model architecture, I end up with this:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils.vis_utils import plot_model
# Load MNIST dataset
(input_train, target_train), (input_test, target_test) = mnist.load_data()
# Reshape data based on channels first / channels last strategy.
# This is dependent on whether you use TF, Theano or CNTK as backend.
# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
if K.image_data_format() == 'channels_first':
input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height)
input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height)
input_shape = (1, img_width, img_height)
else:
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))
plot_model(model, to_file='model.png')
You’ll first perform the imports that you still need in order to successfully run the Python code. Specifically, you’ll import the Keras library, the Sequential API and certain layers – this is obviously dependent on what you want. Do you want to use the Functional API? That’s perfectly fine. Other layers? Fine too. I just used them since the CNN is exemplary.
Note that I also imported plot_model
with from keras.utils.vis_utils import plot_model
.
Subsequently, I kept the mnist
-specific reshaping based on the channels first / channels last approach of the framework. Although this might not be necessary in your model, I had to keep it in because the first Conv2D
layer’s input shape is dependent on input_shape
, which itself is generated by the reshaping process. For the sake of simplicity, I thus kept it MNIST-specific. If you don’t use MNIST, however, you can just keep this out, because it’s the architecture that actually matters.
Speaking about architecture: that’s what I finally kept in. Based on the Keras Sequential API, I apply the two convolutional blocks as discussed previously, before flattening their output and feeding it to the densely-connected layers generating the final prediction.
However, in this case, no such prediction is generated. Rather, the model
instance is used by plot_model
to generate a model visualization stored at disk as model.png
. Likely, you’ll add hyperparameter tuning and data fitting later on – but hey, that’s not the purpose of this blog.
And your final end result looks like this:
In this blog, you’ve seen how to create a Keras model visualization based on the plot_model
util provided by the library. I hope you found it useful – let me know in the comments section, I’d appreciate it! If not, let me know as well, so I can improve. For now: happy engineering!
Note that model code is also available on GitHub.
How to create a CNN classifier with Keras? – MachineCurve. (2019, September 24). Retrieved from https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/
Keras. (n.d.). Visualization. Retrieved from https://keras.io/visualization/
Avoid wasting resources with EarlyStopping and ModelCheckpoint in Keras – MachineCurve. (2019, June 3). Retrieved from https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/
pydot issue · Issue #7 · XifengGuo/CapsNet-Keras. (n.d.). Retrieved from https://github.com/XifengGuo/CapsNet-Keras/issues/7#issuecomment-536100376
The post How to visualize a model with Keras? appeared first on MachineCurve.
]]>The post How to use sparse categorical crossentropy in Keras? appeared first on MachineCurve.
]]>However, traditional categorical crossentropy requires that your data is one-hot encoded and hence converted into categorical format. Often, this is not what your dataset looks like when you’ll start creating your models. Rather, you likely have feature vectors with integer targets – such as 0 to 9 for the numbers 0 to 9.
This means that you’ll have to convert these targets first. In Keras, this can be done with to_categorical
, which essentially applies one-hot encoding to your training set’s targets. When applied, you can start using categorical crossentropy.
But did you know that there exists another type of loss – sparse categorical crossentropy – with which you can leave the integers as they are, yet benefit from crossentropy loss? I didn’t when I just started with Keras, simply because pretty much every article I read performs one-hot encoding before applying regular categorical crossentropy loss.
In this blog, we’ll figure out how to build a convolutional neural network with sparse categorical crossentropy loss.
We’ll create an actual CNN with Keras. It’ll be a simple one – an extension of a CNN that we created before, with the MNIST dataset. However, doing that allows us to compare the model in terms of its performance – to actually see whether sparse categorical crossentropy does as good a job as the regular one.
Let’s go!
Note that model code is also available on GitHub.
Have you also seen lines of code like these in your Keras projects?
target_train = keras.utils.to_categorical(target_train, no_classes)
target_test = keras.utils.to_categorical(target_test, no_classes)
Most likely, you have – because many blogs explaining how to create multiclass classifiers with Keras apply categorical crossentropy, which requires you to one-hot encode your target vectors.
Now you may wonder: what is one-hot encoding?
Suppose that you have a classification problem where you have four target classes: { 0, 1, 2, 3 }.
Your dataset likely comes in this flavor: { feature vector } -> target
, where your target is an integer value from { 0, 1, 2, 3 }.
However, as we saw in another blog on categorical crossentropy, its mathematical structure doesn’t allow us to feed it integers directly.
We’ll have to convert it into categorical format first – with one-hot encoding, or to_categorical
in Keras.
You’ll effectively transform your targets into this:
Note that when you have more classes, the trick goes on and on – you simply create n-dimensional vectors, where n equals the unique number of classes in your dataset.
When converted into categorical data, you can apply categorical crossentropy:
Don’t worry – it’s a human pitfall to always think defensively when we see maths.
It’s not so difficult at all, to be frank, so make sure to read on!
What you see is obviously the categorical crossentropy formula. What it does is actually really simple: it iterates over all the possible classes C
predicted by the ML during the forward pass of your machine learning training process.
For each class, it takes a look at the target observation of the class – i.e., whether the actual class matching the prediction in your training set is 0 or one. Additionally, it computes the (natural) logarithm of the prediction of the observation (the odds that it belongs to that class). From this, it follows that only one such value is relevant – the actual target. For this, it simply computes the natural log value which increases significantly when it is further away from 1:
Now, it could be the case that your dataset is not categorical at first … and possibly, that it is too large in order to use to_categorical
. In that case, it would be rather difficult to use categorical crossentropy, since it is dependent on categorical data.
However, when you have integer targets instead of categorical vectors as targets, you can use sparse categorical crossentropy. It’s an integer-based version of the categorical crossentropy loss function, which means that we don’t have to convert the targets into categorical format anymore.
Let’s now create a CNN with Keras that uses sparse categorical crossentropy. In some folder, create a file called model.py
and open it in some code editor.
As usual, like in our previous blog on creating a (regular) CNN with Keras, we use the MNIST dataset. This dataset, which contains thousands of 28×28 pixel handwritten digits (individual numbers from 0-9), is one of the standard datasets in machine learning training programs because it’s a very easy and normalized one. The images are also relatively small and high in quantity, which benefits the predictive and generalization power of your model when trained properly. This way, one can really focus on the machine learning aspects of an exercise, rather than the data related issues.
Let’s go!
If we wish to run the sparse categorical crossentropy Keras CNN, it’s necessary to install a few software tools:
Preferably, you run your model in an Anaconda environment. This way, you will be able to install your packages in a unique environment with which other packages do not interfere. Mingling Python packages is often a tedious job, which often leads to trouble. Anaconda resolves this by allowing you to use environments or isolated sandboxes in which your code can run. Really recommended!
This will be our model for today:
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# Model configuration
img_width, img_height = 28, 28
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
# Load MNIST dataset
(input_train, target_train), (input_test, target_test) = mnist.load_data()
# Reshape data based on channels first / channels last strategy.
# This is dependent on whether you use TF, Theano or CNTK as backend.
# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
if K.image_data_format() == 'channels_first':
input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height)
input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height)
input_shape = (1, img_width, img_height)
else:
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
# Convert them into black or white: [0, 1].
input_train = input_train / 255
input_test = input_test / 255
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))
# Compile the model
model.compile(loss=keras.losses.sparse_anacondacategorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
# Fit data to model
model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Let’s break creating the model apart.
First, we add our imports – packages and functions that we’ll need for our model to work as intended.
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
More specifically, we…
Next up, model configuration:
# Model configuration
img_width, img_height = 28, 28
batch_size = 250
no_epochs = 25
no_classes = 10
validation_split = 0.2
verbosity = 1
We specify image width and image height, which are 28 for both given the images in the MNIST dataset. We specify a batch size of 250, which means that during training 250 images at once will be processed. When all images are processed, one completes an epoch, of which we will have 25 in total during the training of our model. Additionally, we specify the number of classes in advance – 10, the numbers 0 to 9. 20% of our training set will be set apart for validating the model after every batch, and for educational purposes we set model verbosity to True (1) – which means that all possible output is actually displayed on screen.
Next, we load and prepare the MNIST data:
# Load MNIST dataset
(input_train, target_train), (input_test, target_test) = mnist.load_data()
# Reshape data based on channels first / channels last strategy.
# This is dependent on whether you use TF, Theano or CNTK as backend.
# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
if K.image_data_format() == 'channels_first':
input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height)
input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height)
input_shape = (1, img_width, img_height)
else:
input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1)
input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1)
input_shape = (img_width, img_height, 1)
What we do is simple – we use mnist.load_data()
to load the MNIST data into four Python variables, representing inputs and targets for both the training and testing datasets.
Additionally, we reshape based on the image processing approach of the framework that underlies our model on your system – either TensorFlow, Theano or CNTK. Some put the channel (remember, RGB images have 3 channels) first, i.e. specify it as the first dimension. Others put it last. Depending on what your framework does, we need to reshape the data so that it can be processed correctly.
Additionally, we perform some other preparations which concern the data instead of how it is handled by your system:
# Parse numbers as floats
input_train = input_train.astype('float32')
input_test = input_test.astype('float32')
# Normalize data
input_train = input_train / 255
input_test = input_test / 255
We first parse the numbers as floats. This benefits the optimization step of the training process.
Additionally, we normalize the data, which benefits the training process as well.
We then create the architecture of our model:
# Create the model
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(no_classes, activation='softmax'))
To be frank: the architecture of our model doesn’t really matter for showing that sparse categorical crossentropy really works. In fact, you can use the architecture you think is best for your machine learning problem. However, we put up the architecture above because it is very generic and hence works well in many simple classification scenarios:
no_classes
which in the case of the MNIST dataset is 10: each neuron generates the probability (summated to one considering all neurons together) that the input belongs to one of the 10 classes in the MNIST scenario.We next compile the model, which involves configuring it by means of hyperparameter tuning:
# Compile the model
model.compile(loss=keras.losses.sparse_categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=['accuracy'])
We specify the loss function used – sparse categorical crossentropy! We use it together with the Adam optimizer, which is one of the standard ones used today in very generic scenarios, and use accuracy as an additional metric, since it is more intuitive to humans.
Next, we fit the data following the specification created in the model configuration step and specify evaluation metrics that test the trained model with the testing data:
# Fit data to model
model.fit(input_train, target_train,
batch_size=batch_size,
epochs=no_epochs,
verbose=verbosity,
validation_split=validation_split)
# Generate generalization metrics
score = model.evaluate(input_test, target_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
Now, we can start the training process. Open a command prompt, possible the Anaconda one navigating to your environment by means of conda activate <env_name>
, and navigate to the folder storing model.py
by means of the cd
function.
Next, start the training process with Python: python model.py
.
You should then see something like this:
48000/48000 [==============================] - 21s 431us/step - loss: 0.3725 - acc: 0.8881 - val_loss: 0.0941 - val_acc: 0.9732
Epoch 2/25
48000/48000 [==============================] - 6s 124us/step - loss: 0.0974 - acc: 0.9698 - val_loss: 0.0609 - val_acc: 0.9821
Epoch 3/25
48000/48000 [==============================] - 6s 122us/step - loss: 0.0702 - acc: 0.9779 - val_loss: 0.0569 - val_acc: 0.9832
Epoch 4/25
48000/48000 [==============================] - 6s 124us/step - loss: 0.0548 - acc: 0.9832 - val_loss: 0.0405 - val_acc: 0.9877
Epoch 5/25
48000/48000 [==============================] - 6s 122us/step - loss: 0.0450 - acc: 0.9861 - val_loss: 0.0384 - val_acc: 0.9873
Epoch 6/25
48000/48000 [==============================] - 6s 122us/step - loss: 0.0384 - acc: 0.9877 - val_loss: 0.0366 - val_acc: 0.9886
Epoch 7/25
48000/48000 [==============================] - 5s 100us/step - loss: 0.0342 - acc: 0.9892 - val_loss: 0.0321 - val_acc: 0.9907
Epoch 8/25
48000/48000 [==============================] - 5s 94us/step - loss: 0.0301 - acc: 0.9899 - val_loss: 0.0323 - val_acc: 0.9898
Epoch 9/25
48000/48000 [==============================] - 4s 76us/step - loss: 0.0257 - acc: 0.9916 - val_loss: 0.0317 - val_acc: 0.9907
Epoch 10/25
48000/48000 [==============================] - 4s 76us/step - loss: 0.0238 - acc: 0.9922 - val_loss: 0.0318 - val_acc: 0.9910
Epoch 11/25
48000/48000 [==============================] - 4s 82us/step - loss: 0.0214 - acc: 0.9928 - val_loss: 0.0324 - val_acc: 0.9905
Epoch 12/25
48000/48000 [==============================] - 4s 85us/step - loss: 0.0201 - acc: 0.9934 - val_loss: 0.0296 - val_acc: 0.9907
Epoch 13/25
48000/48000 [==============================] - 4s 88us/step - loss: 0.0173 - acc: 0.9940 - val_loss: 0.0302 - val_acc: 0.9914
Epoch 14/25
48000/48000 [==============================] - 4s 79us/step - loss: 0.0157 - acc: 0.9948 - val_loss: 0.0306 - val_acc: 0.9912
Epoch 15/25
48000/48000 [==============================] - 4s 85us/step - loss: 0.0154 - acc: 0.9949 - val_loss: 0.0308 - val_acc: 0.9910
Epoch 16/25
48000/48000 [==============================] - 4s 84us/step - loss: 0.0146 - acc: 0.9950 - val_loss: 0.0278 - val_acc: 0.9918
Epoch 17/25
48000/48000 [==============================] - 4s 84us/step - loss: 0.0134 - acc: 0.9954 - val_loss: 0.0302 - val_acc: 0.9911
Epoch 18/25
48000/48000 [==============================] - 4s 79us/step - loss: 0.0129 - acc: 0.9956 - val_loss: 0.0280 - val_acc: 0.9922
Epoch 19/25
48000/48000 [==============================] - 4s 80us/step - loss: 0.0096 - acc: 0.9968 - val_loss: 0.0358 - val_acc: 0.9908
Epoch 20/25
48000/48000 [==============================] - 4s 79us/step - loss: 0.0114 - acc: 0.9960 - val_loss: 0.0310 - val_acc: 0.9899
Epoch 21/25
48000/48000 [==============================] - 4s 86us/step - loss: 0.0086 - acc: 0.9970 - val_loss: 0.0300 - val_acc: 0.9922
Epoch 22/25
48000/48000 [==============================] - 4s 88us/step - loss: 0.0088 - acc: 0.9970 - val_loss: 0.0320 - val_acc: 0.9915
Epoch 23/25
48000/48000 [==============================] - 4s 87us/step - loss: 0.0080 - acc: 0.9971 - val_loss: 0.0320 - val_acc: 0.9919
Epoch 24/25
48000/48000 [==============================] - 4s 87us/step - loss: 0.0083 - acc: 0.9969 - val_loss: 0.0416 - val_acc: 0.9887
Epoch 25/25
48000/48000 [==============================] - 4s 86us/step - loss: 0.0083 - acc: 0.9969 - val_loss: 0.0334 - val_acc: 0.9917
Test loss: 0.02523074444185986 / Test accuracy: 0.9932
25 epochs as configured, with impressive scores in both the validation and testing phases. It pretty much works as well as the classifier created with categorical crossentropy – and I actually think the difference can be attributed to the relative randomness of the model optimization process:
Epoch 25/25
48000/48000 [==============================] - 4s 85us/step - loss: 0.0072 - acc: 0.9975 - val_loss: 0.0319 - val_acc: 0.9925
Test loss: 0.02579820747410522 / Test accuracy: 0.9926
Well, today, we’ve seen how to create a Convolutional Neural Network (and by consequence, any model) with sparse categorical crossentropy in Keras. If you have integer targets in your dataset, which happens in many cases, you usually perform to_categorical
in order to use multiclass crossentropy loss. With sparse categorical crossentropy, this is no longer necessary. This blog demonstrated this by means of an example Keras implementation of a CNN that classifies the MNIST dataset.
Model code is also available on GitHub, if it benefits you.
I hope this blog helped you – if it did, or if you have any questions, let me know in the comments section! I’m happy to answer any questions you may have Thanks and enjoy coding!
Chollet, F. (2017). Deep Learning with Python. New York, NY: Manning Publications.
Keras. (n.d.). Losses. Retrieved from https://keras.io/losses/
How to create a CNN classifier with Keras? – MachineCurve. (2019, September 24). Retrieved from https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras
About loss and loss functions – MachineCurve. (2019, October 4). Retrieved from https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/
The post How to use sparse categorical crossentropy in Keras? appeared first on MachineCurve.
]]>The post About loss and loss functions appeared first on MachineCurve.
]]>The term cost function is also used equivalently.
But what is loss? And what is a loss function?
I’ll answer these two questions in this blog, which focuses on this optimization aspect of machine learning. We’ll first cover the high-level supervised learning process, to set the stage. This includes the role of training, validation and testing data when training supervised models.
Once we’re up to speed with those, we’ll introduce loss. We answer the question what is loss? However, we don’t forget what is a loss function? We’ll even look into some commonly used loss functions.
Let’s go!
Before we can actually introduce the concept of loss, we’ll have to take a look at the high-level supervised machine learning process. All supervised training approaches fall under this process, which means that it is equal for deep neural networks such as MLPs or ConvNets, but also for SVMs.
Let’s take a look at this training process, which is cyclical in nature.
We start with our features and targets, which are also called your dataset. This dataset is split into three parts before the training process starts: training data, validation data and testing data. The training data is used during the training process; more specificially, to generate predictions during the forward pass. However, after each training cycle, the predictive performance of the model must be tested. This is what the validation data is used for – it helps during model optimization.
Then there is testing data left. Assume that the validation data, which is essentially a statistical sample, does not fully match the population it describes in statistical terms. That is, the sample does not represent it fully and by consequence the mean and variance of the sample are (hopefully) slightly different than the actual population mean and variance. Hence, a little bias is introduced into the model every time you’ll optimize it with your validation data. While it may thus still work very well in terms of predictive power, it may be the case that it will lose its power to generalize. In that case, it would no longer work for data it has never seen before, e.g. data from a different sample. The testing data is used to test the model once the entire training process has finished (i.e., only after the last cycle), and allows us to tell something about the generalization power of our machine learning model.
The training data is fed into the machine learning model in what is called the forward pass. The origin of this name is really easy: the data is simply fed to the network, which means that it passes through it in a forward fashion. The end result is a set of predictions, one per sample. This means that when my training set consists of 1000 feature vectors (or rows with features) that are accompanied by 1000 targets, I will have 1000 predictions after my forward pass.
You do however want to know how well the model performs with respect to the targets originally set. A well-performing model would be interesting for production usage, whereas an ill-performing model must be optimized before it can be actually used.
This is where the concept of loss enters the equation.
Most generally speaking, the loss allows us to compare between some actual targets and predicted targets. It does so by imposing a “cost” (or, using a different term, a “loss”) on each prediction if it deviates from the actual targets.
It’s relatively easy to compute the loss conceptually: we agree on some cost for our machine learning predictions, compare the 1000 targets with the 1000 predictions and compute the 1000 costs, then add everything together and present the global loss.
Our goal when training a machine learning model?
To minimize the loss.
The reason why is simple: the lower the loss, the more the set of targets and the set of predictions resemble each other.
And the more they resemble each other, the better the machine learning model performs.
As you can see in the machine learning process depicted above, arrows are flowing backwards towards the machine learning model. Their goal: to optimize the internals of your model only slightly, so that it will perform better during the next cycle (or iteration, or epoch, as they are also called).
When loss is computed, the model must be improved. This is done by propagating the error backwards to the model structure, such as the model’s weights. This closes the learning cycle between feeding data forward, generating predictions, and improving it – by adapting the weights, the model likely improves (sometimes much, sometimes slightly) and hence learning takes place.
Depending on the model type used, there are many ways for optimizing the model, i.e. propagating the error backwards. In neural networks, often, a combination of gradient descent based methods and backpropagation is used: gradient descent like optimizers for computing the gradient or the direction in which to optimize, backpropagation for the actual error propagation.
In other model types, such as Support Vector Machines, we do not actually propagate the error backward, strictly speaking. However, we use methods such as quadratic optimization to find the mathematical optimum, which given linear separability of your data (whether in regular space or kernel space) must exist. However, visualizing it as “adapting the weights by computing some error” benefits understanding. Next up – the loss functions we can actually use for computing the error!
Here, we’ll cover a wide array of loss functions: some of them for regression, others for classification.
There are two main types of supervised learning problems: classification and regression. In the first, your aim is to classify a sample into the correct bucket, e.g. into one of the buckets ‘diabetes’ or ‘no diabetes’. In the latter case, however, you don’t classify but rather estimate some real valued number. What you’re trying to do is regress a mathematical function from some input data, and hence it’s called regression. For regression problems, there are many loss functions available.
Mean Absolute Error (MAE) is one of them. This is what it looks like:
Don’t worry about the maths, we’ll introduce the MAE intuitively now.
That weird E-like sign you see in the formula is what is called a Sigma sign, and it sums up what’s behind it: |Ei
|, in our case, where Ei
is the error (the difference between prediction and actual value) and the | signs mean that you’re taking the absolute value, or convert -3 into 3 and 3 remains 3.
The summation, in this case, means that we sum all the errors, for all the n
samples that were used for training the model. We therefore, after doing so, end up with a very large number. We divide this number by n
, or the number of samples used, to find the mean, or the average Absolute Error: the Mean Absolute Error or MAE.
It’s very well possible to use the MAE in a multitude of regression scenarios (Rich, n.d.). However, if your average error is very small, it may be better to use the Mean Squared Error that we will introduce next.
What’s more, and this is important: when you use the MAE in optimizations that use gradient descent, you’ll face the fact that the gradients are continuously large (Grover, 2019). Since this also occurs when the loss is low (and hence, you would only need to move a tiny bit), this is bad for learning – it’s easy to overshoot the minimum continously, finding a suboptimal model. Consider Huber loss (more below) if you face this problem. If you face larger errors and don’t care (yet?) about this issue with gradients, or if you’re here to learn, let’s move on to Mean Squared Error!
Another loss function used often in regression is Mean Squared Error (MSE). It sounds really difficult, especially when you look at the formula (Binieli, 2018):
… but fear not. It’s actually really easy to understand what MSE is and what it does!
We’ll break the formula above into three parts, which allows us to understand each element and subsequently how they work together to produce the MSE.
The primary part of the MSE is the middle part, being the Sigma symbol or the summation sign. What it does is really simple: it counts from i to n, and on every count executes what’s written behind it. In this case, that’s the third part – the square of (Yi – Y’i).
In our case, i
starts at 1 and n is not yet defined. Rather, n
is the number of samples in our training set and hence the number of predictions that has been made. In the scenario sketched above, n
would be 1000.
Then, the third part. It’s actually mathematical notation for what we already intuitively learnt earlier: it’s the difference between the actual target for the sample (Yi
) and the predicted target (Y'i
), the latter of which is removed from the first.
With one minor difference: the end result of this computation is squared. This property introduces some mathematical benefits during optimization (Rich, n.d.). Particularly, the MSE is continuously differentiable whereas the MAE is not (at x = 0). This means that optimizing the MSE is easier than optimizing the MAE.
Additionally, large errors introduce a much larger cost than smaller errors (because the differences are squared and larger errors produce much larger squares than smaller errors). This is both good and bad at the same time (Rich, n.d.). This is a good property when your errors are small, because optimization is then advanced (Quora, n.d.). However, using MSE rather than e.g. MAE will open your ML model up to outliers, which will severely disturb training (by means of introducing large errors).
Although the conclusion may be rather unsatisfactory, choosing between MAE and MSE is thus often heavily dependent on the dataset you’re using, introducing the need for some a priori inspection before starting your training process.
Finally, when we have the sum of the squared errors, we divide it by n – producing the mean squared error.
The Mean Absolute Percentage Error, or MAPE, really looks like the MAE, even though the formula looks somewhat different:
When using the MAPE, we don’t compute the absolute error, but rather, the mean error percentage with respect to the actual values. That is, suppose that my prediction is 12 while the actual target is 10, the MAPE for this prediction is \(| (10 – 12 ) / 10 | = 0.2\).
Similar to the MAE, we sum the error over all the samples, but subsequently face a different computation: \(100\% / n\). This looks difficult, but we can once again separate this computation into more easily understandable parts. More specifically, we can write it as a multiplication of \(100\%\) and \(1 / n\) instead. When multiplying the latter with the sum, you’ll find the same result as dividing it by n
, which we did with the MAE. That’s great.
The only thing left now is multiplying the whole with 100%. Why do we do that? Simple: because our computed error is a ratio and not a percentage. Like the example above, in which our error was 0.2, we don’t want to find the ratio, but the percentage instead. \(0.2 \times 100\%\) is … unsurprisingly … \(20\%\)! Hence, we multiply the mean ratio error with the percentage to find the MAPE!
Why use MAPE if you can also use MAE?
Very good question.
Firstly, it is a very intuitive value. Contrary to the absolute error, we have a sense of how well-performing the model is or how bad it performs when we can express the error in terms of a percentage. An error of 100 may seem large, but if the actual target is 1000000 while the estimate is 1000100, well, you get the point.
Secondly, it allows us to compare the performance of regression models on different datasets (Watson, 2019). Suppose that our goal is to train a regression model on the NASDAQ ETF and the Dutch AEX ETF. Since their absolute values are quite different, using MAE won’t help us much in comparing the performance of our model. MAPE, on the other hand, demonstrates the error in terms of a percentage – and a percentage is a percentage, whether you apply it to NASDAQ or to AEX. This way, it’s possible to compare model performance across statistically varying datasets.
Remember the MSE?
There’s also something called the RMSE, or the Root Mean Squared Error or Root Mean Squared Deviation (RMSD). It goes like this:
Simple, hey? It’s just the MSE but then its square root value.
How does this help us?
The errors of the MSE are squared – hey, what’s in a name.
The RMSE or RMSD errors are root squares of the square – and hence are back at the scale of the original targets (Dragos, 2018). This gives you much better intuition for the error in terms of the targets.
“Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.” (Grover, 2019).
Well, how’s that for a starter.
This is the mathematical formula:
And this the plot:
Okay, now let’s introduce some intuitive explanation.
The TensorFlow docs write this about Logcosh loss:
log(cosh(x))
is approximately equal to(x ** 2) / 2
for smallx
and toabs(x) - log(2)
for largex
. This means that ‘logcosh’ works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction.
Well, that’s great. It seems to be an improvement over MSE, or L2 loss. Recall that MSE is an improvement over MAE (L1 Loss) if your data set contains quite large errors, as it captures these better. However, this also means that it is much more sensitive to errors than the MAE. Logcosh helps against this problem:
log
.Hence: indeed, if you have both larger errors that must be detected as well as outliers, which you perhaps cannot remove from your dataset, consider using Logcosh! It’s available in many frameworks like TensorFlow as we saw above, but also in Keras.
Let’s move on to Huber loss, which we already hinted about in the section about the MAE:
Or, visually:
When interpreting the formula, we see two parts:
What is the effect of all this mathematical juggling?
Look at the visualization above.
For relatively small deltas (in our case, with \(\delta = 0.25\), you’ll see that the loss function becomes relatively flat. It takes quite a long time before loss increases, even when predictions are getting larger and larger.
For larger deltas, the slope of the function increases. As you can see, the larger the delta, the slower the increase of this slope: eventually, for really large \(\delta\) the slope of the loss tends to converge to some maximum.
If you look closely, you’ll notice the following:
Hey, haven’t we seen that before?
Yep: in our discussions about the MAE (insensitivity to larger errors) and the MSE (fixes this, but facing sensitivity to outliers).
Grover (2019) writes about this nicely:
Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)
That’s what this \(\delta\) is for! You are now in control about the ‘degree’ of MAE vs MSE-ness you’ll introduce in your loss function. When you face large errors due to outliers, you can try again with a lower \(\delta\); if your errors are too small to be picked up by your Huber loss, you can increase the delta instead.
And there’s another thing, which we also mentioned when discussing the MAE: it produces large gradients when you optimize your model by means of gradient descent, even when your errors are small (Grover, 2019). This is bad for model performance, as you will likely overshoot the mathematical optimum for your model. You don’t face this problem with MSE, as it tends to decrease towards the actual minimum (Grover, 2019). If you switch to Huber loss from MAE, you might find it to be an additional benefit.
Here’s why: Huber loss, like MSE, decreases as well when it approaches the mathematical optimum (Grover, 2019). This means that you can combine the best of both worlds: the insensitivity to larger errors from MAE with the sensitivity of the MSE and its suitability for gradient descent. Hooray for Huber loss! And like always, it’s also available when you train models with Keras.
Then why isn’t this the perfect loss function?
Because the benefit of the \(\delta\) is also becoming your bottleneck (Grover, 2019). As you have to configure them manually (or perhaps using some automated tooling), you’ll have to spend time and resources on finding the most optimum \(\delta\) for your dataset. This is an iterative problem that, in the extreme case, may become impractical at best and costly at worst. However, in most cases, it’s best just to experiment – perhaps, you’ll find better results!
Loss functions are also applied in classifiers. I already discussed in another post what classification is all about, so I’m going to repeat it here:
Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It’s an important job, one can argue, because we don’t want to sell customers tomatoes they can’t process into dinner. It’s the perfect job to illustrate what a human classifier would do.
Source: How to create a CNN classifier with Keras?
Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:
– If it’s green, it’s likely to be unripe (or: not sellable);
– If it smells, it is likely to be unsellable;
– The same goes for when it’s white or when fungus is visible on top of it.
If none of those occur, it’s likely that the tomato can be sold. We now have two classes: sellable tomatoes and non-sellable tomatoes. Human classifiers decide about which class an object (a tomato) belongs to.
The same principle occurs again in machine learning and deep learning.
Only then, we replace the human with a machine learning model. We’re then using machine learning for classification, or for deciding about some “model input” to “which class” it belongs.
We’ll now cover loss functions that are used for classification.
The hinge loss is defined as follows (Wikipedia, 2011):
It simply takes the maximum of either 0 or the computation \( 1 – t \times y\), where t
is the machine learning output value (being between -1 and +1) and y
is the true target (-1 or +1).
When the target equals the prediction, the computation \(t \times y\) is always one: \(1 \times 1 = -1 \times -1 = 1)\). Essentially, because then \(1 – t \times y = 1 – 1 = 1\), the max
function takes the maximum \(max(0, 0)\), which of course is 0.
That is: when the actual target meets the prediction, the loss is zero. Negative loss doesn’t exist. When the target != the prediction, the loss value increases.
For t = 1
, or \(1\) is your target, hinge loss looks like this:
Let’s now consider three scenarios which can occur, given our target \(t = 1\) (Kompella, 2017; Wikipedia, 2011):
In the first case, e.g. when \(y = 1.2\), the output of \(1 – t \ times y\) will be \( 1 – ( 1 \times 1.2 ) = 1 – 1.2 = -0.2\). Loss, then will be \(max(0, -0.2) = 0\). Hence, for all correct predictions – even if they are too correct, loss is zero. In the too correct situation, the classifier is simply very sure that the prediction is correct (Peltarion, n.d.).
In the second case, e.g. when \(y = -0.5\), the output of the loss equation will be \(1 – (1 \ times -0.5) = 1 – (-0.5) = 1.5\), and hence the loss will be \(max(0, 1.5) = 1.5\). Very wrong predictions are hence penalized significantly by the hinge loss function.
In the third case, e.g. when \(y = 0.9\), loss output function will be \(1 – (1 \times 0.9) = 1 – 0.9 = 0.1\). Loss will be \(max(0, 0.1) = 0.1\). We’re getting there – and that’s also indicated by the small but nonzero loss.
What this essentially sketches is a margin that you try to maximize: when the prediction is correct or even too correct, it doesn’t matter much, but when it’s not, we’re trying to correct. The correction process keeps going until the prediction is fully correct (or when the human tells the improvement to stop). We’re thus finding the most optimum decision boundary and are hence performing a maximum-margin operation.
It is therefore not surprising that hinge loss is one of the most commonly used loss functions in Support Vector Machines (Kompella, 2017). What’s more, hinge loss itself cannot be used with gradient descent like optimizers, those with which (deep) neural networks are trained. This occurs due to the fact that it’s not continuously differentiable, more precisely at the ‘boundary’ between no loss / minimum loss. Fortunately, a subgradient of the hinge loss function can be optimized, so it can (albeit in a different form) still be used in today’s deep learning models (Wikipedia, 2011). For example, hinge loss is available as a loss function in Keras.
The squared hinge loss is like the hinge formula displayed above, but then the \(max()\) function output is squared.
This helps achieving two things:
Both normal hinge and squared hinge loss work only for binary classification problems in which the actual target value is either +1 or -1. Although that’s perfectly fine for when you have such problems (e.g. the diabetes yes/no problem that we looked at previously), there are many other problems which cannot be solved in a binary fashion.
(Note that one approach to create a multiclass classifier, especially with SVMs, is to create many binary ones, feeding the data to each of them and counting classes, eventually taking the most-chosen class as output – it goes without saying that this is not very efficient.)
However, in neural networks and hence gradient based optimization problems, we’re not interested in doing that. It would mean that we have to train many networks, which significantly impacts the time performance of our ML training problem. Instead, we can use the multiclass hinge that has been introduced by researchers Weston and Watkins (Wikipedia, 2011):
What this means in plain English is this:
For all \(y\) (output) values unequal to \(t\), compute the loss. Eventually, sum them together to find the multiclass hinge loss.
Note that this does not mean that you sum over all possible values for y (which would be all real-valued numbers except \(t\)), but instead, you compute the sum over all the outputs generated by your ML model during the forward pass. That is, all the predictions. Only for those where \(y \neq t\), you compute the loss. This is obvious from an efficiency point of view: where \(y = t\), loss is always zero, so no \(max\) operation needs to be computed to find zero after all.
Keras implements the multiclass hinge loss as categorical hinge loss, requiring to change your targets into categorical format (one-hot encoded format) first by means of to_categorical
.
A loss function that’s used quite often in today’s neural networks is binary crossentropy. As you can guess, it’s a loss function for binary classification problems, i.e. where there exist two classes. Primarily, it can be used where the output of the neural network is somewhere between 0 and 1, e.g. by means of the Sigmoid layer.
This is its formula:
It can be visualized in this way:
And, like before, let’s now explain it in more intuitive ways.
The \(t\) in the formula is the target (0 or 1) and the \(p\) is the prediction (a real-valued number between 0 and 1, for example 0.12326).
When you input both into the formula, loss will be computed related to the target and the prediction. In the visualization above, where the target is 1, it becomes clear that loss is 0. However, when moving to the left, loss tends to increase (ML Cheatsheet documentation, n.d.). What’s more, it increases increasingly fast. Hence, it not only tends to punish wrong predictions, but also wrong predictions that are extremely confident (i.e., if the model is very confident that it’s 0 while it’s 1, it gets punished much harder than when it thinks it’s somewhere in between, e.g. 0.5). This latter property makes the binary cross entropy a valued loss function in classification problems.
When the target is 0, you can see that the loss is mirrored – which is exactly what we want:
Now what if you have no binary classification problem, but instead a multiclass one?
Thus: one where your output can belong to one of > 2 classes.
The CNN that we created with Keras using the MNIST dataset is a good example of this problem. As you can find in the blog (see the link), we used a different loss function there – categorical crossentropy. It’s still crossentropy, but then adapted to multiclass problems.
This is the formula with which we compute categorical crossentropy. Put very simply, we sum over all the classes that we have in our system, compute the target of the observation and the prediction of the observation and compute the observation target with the natural log of the observation prediction.
It took me some time to understand what was meant with a prediction, though, but thanks to Peltarion (n.d.), I got it.
The answer lies in the fact that the crossentropy is categorical and that hence categorical data is used, with one-hot encoding.
Suppose that we have dataset that presents what the odds are of getting diabetes after five years, just like the Pima Indians dataset we used before. However, this time another class is added, being “Possibly diabetic”, rendering us three classes for one’s condition after five years given current measurements:
That dataset would look like this:
Features | Target |
{ … } | 1 |
{ … } | 2 |
{ … } | 0 |
{ … } | 0 |
{ … } | 2 |
…and so on | …and so on |
However, categorical crossentropy cannot simply use integers as targets, because its formula doesn’t support this. Instead, we must apply one-hot encoding, which transforms the integer targets into categorial vectors, which are just vectors that displays all categories and whether it’s some class or not:
That’s what we always do with to_categorical
in Keras.
Our dataset then looks as follows:
Features | Target |
{ … } | \([0, 1, 0]\) |
{ … } | \([0, 0, 1]\) |
{ … } | \([1, 0, 0]\) |
{ … } | \([1, 0, 0]\) |
{ … } | \([0, 0, 1]\) |
…and so on | …and so on |
Now, we can explain with is meant with an observation.
Let’s look at the formula again and recall that we iterate over all the possible output classes – once for every prediction made, with some true target:
Now suppose that our trained model outputs for the set of features \({ … }\) or a very similar one that has target \([0, 1, 0]\) a probability distribution of \([0.25, 0.50, 0.25]\) – that’s what these models do, they pick no class, but instead compute the probability that it’s a particular class in the categorical vector.
Computing the loss, for \(c = 1\), what is the target value? It’s 0: in \(\textbf{t} = [0, 1, 0]\), the target value for class 0 is 0.
What is the prediction? Well, following the same logic, the prediction is 0.25.
We call these two observations with respect to the total prediction. By looking at all observations, merging them together, we can find the loss value for the entire prediction.
We multiply the target value with the log. But wait! We multiply the log with 0 – so the loss value for this target is 0.
It doesn’t surprise you that this happens for all targets except for one – where the target value is 1: in the prediction above, that would be for the second one.
Note that when the sum is complete, you’ll multiply it with -1 to find the true categorical crossentropy loss.
Hence, loss is driven by the actual target observation of your sample instead of all the non-targets. The structure of the formula however allows us to perform multiclass machine learning training with crossentropy. There we go, we learnt another loss function
But what if we don’t want to convert our integer targets into categorical format? We can use sparse categorical crossentropy instead (Lin, 2019).
It performs in pretty much similar ways to regular categorical crossentropy loss, but instead allows you to use integer targets! That’s nice.
Features | Target |
{ … } | 1 |
{ … } | 2 |
{ … } | 0 |
{ … } | 0 |
{ … } | 2 |
…and so on | …and so on |
In this blog, we’ve looked at the concept of loss functions, also known as cost functions. We showed why they are necessary by means of illustrating the high-level machine learning process and (at a high level) what happens during optimization. Additionally, we covered a wide range of loss functions, some of them for classification, others for regression. Although we introduced some maths, we also tried to explain them intuitively.
I hope you’ve learnt something from my blog! If you have any questions, remarks, comments or other forms of feedback, please feel free to leave a comment below! I’d also appreciate a comment telling me if you learnt something and if so, what you learnt. I’ll gladly improve my blog if mistakes are made. Thanks and happy engineering!
Chollet, F. (2017). Deep Learning with Python. New York, NY: Manning Publications.
Keras. (n.d.). Losses. Retrieved from https://keras.io/losses/
Binieli, M. (2018, October 8). Machine learning: an introduction to mean squared error and regression lines. Retrieved from https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/
Rich. (n.d.). Why square the difference instead of taking the absolute value in standard deviation? Retrieved from https://stats.stackexchange.com/a/121
Quora. (n.d.). What is the difference between squared error and absolute error? Retrieved from https://www.quora.com/What-is-the-difference-between-squared-error-and-absolute-error
Watson, N. (2019, June 14). Using Mean Absolute Error to Forecast Accuracy. Retrieved from https://canworksmart.com/using-mean-absolute-error-forecast-accuracy/
Drakos, G. (2018, December 5). How to select the Right Evaluation Metric for Machine Learning Models: Part 1 Regression Metrics. Retrieved from https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0
Wikipedia. (2011, September 16). Hinge loss. Retrieved from https://en.wikipedia.org/wiki/Hinge_loss
Kompella, R. (2017, October 19). Support vector machines ( intuitive understanding ) ? Part#1. Retrieved from https://towardsdatascience.com/support-vector-machines-intuitive-understanding-part-1-3fb049df4ba1
Peltarion. (n.d.). Squared hinge. Retrieved from https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/squared-hinge
Tay, J. (n.d.). Why is squared hinge loss differentiable? Retrieved from https://www.quora.com/Why-is-squared-hinge-loss-differentiable
Rakhlin, A. (n.d.). Online Methods in Machine Learning. Retrieved from http://www.mit.edu/~rakhlin/6.883/lectures/lecture05.pdf
Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
TensorFlow. (n.d.). tf.keras.losses.logcosh. Retrieved from https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh
ML Cheatsheet documentation. (n.d.). Loss Functions. Retrieved from https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
Peltarion. (n.d.). Categorical crossentropy. Retrieved from https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy
Lin, J. (2019, September 17). categorical_crossentropy VS. sparse_categorical_crossentropy. Retrieved from https://jovianlin.io/cat-crossentropy-vs-sparse-cat-crossentropy/
The post About loss and loss functions appeared first on MachineCurve.
]]>