Using Huber loss in Keras

Using Huber loss in Keras

The Huber loss function can be used to balance between the Mean Absolute Error, or MAE, and the Mean Squared Error, MSE. It is therefore a good loss function for when you have varied data or only a few outliers.

But how to implement this loss function in Keras?

That’s what we will find out in this blog.

We first briefly recap the concept of a loss function and introduce Huber loss. Next, we present a Keras example implementation that uses the Boston Housing Prices Dataset to generate a regression model. Let’s go!

Note that the full code is also available on GitHub, in my Keras loss functions repository.

About loss functions and Huber loss

When you train machine learning models, you feed data to the network, generate predictions, compare them with the actual values (the targets) and then compute what is known as a loss. This loss essentially tells you something about the performance of the network: the higher it is, the worse your networks performs overall.

There are many ways for computing the loss value. Huber loss is one of them. It essentially combines the Mean Absolute Error and the Mean Squared Error depending on some delta parameter, or 𝛿. This parameter must be configured by the machine learning engineer up front and is dependent on your data.

Huber loss looks like this:

As you can see, for target = 0, the loss increases when the error increases. However, the speed with which it increases depends on this 𝛿 value. In fact, Grover (2019) writes about this as follows: Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)

When you compare this statement with the benefits and disbenefits of both the MAE and the MSE, you’ll gain some insights about how to adapt this delta parameter:

  • If your dataset contains large outliers, it’s likely that your model will not be able to predict them correctly at once. In fact, it might take quite some time for it to recognize these, if it can do so at all. This results in large errors between predicted values and actual targets, because they’re outliers. Since MSE squares errors, large outliers will distort your loss value significantly. If outliers are present, you likely don’t want to use MSE. Huber loss will still be useful, but you’ll have to use small values for 𝛿.
  • If it does not contain many outliers, it’s likely that it will generate quite accurate predictions from the start – or at least, from some epochs after starting the training process. In this case, you may observe that the errors are very small overall. Then, one can argue, it may be worthwhile to let the largest small errors contribute more significantly to the error than the smaller ones. In this case, MSE is actually useful; hence, with Huber loss, you’ll likely want to use quite large values for 𝛿.
  • If you don’t know, you can always start somewhere in between – for example, in the plot above, 𝛿 = 1 represented MAE quite accurately, while 𝛿 = 3 tends to go towards MSE already. What if you used 𝛿 = 1.5 instead? You may benefit from both worlds.

Let’s now see if we can complete a regression problem with Huber loss!

Keras Huber loss example

Next, we show you how to use Huber loss with Keras to create a regression model. We’ll use the Boston housing price regression dataset which comes with Keras by default – that’ll make the example easier to follow. Obviously, you can always use your own data instead!

Since we need to know how to configure 𝛿, we must inspect the data at first. Do the target values contain many outliers? Some statistical analysis would be useful here.

Only then, we create the model and configure 𝛿 to an estimate that seems adequate. Finally, we run the model, check performance, and see whether we can improve 𝛿 any further.

Regression dataset: Boston housing price regression

Keras comes with datasets on board the framework: they have them stored on some Amazon AWS server and when you load the data, they automatically download it for you and store it in user-defined variables. It allows you to experiment with deep learning and the framework easily. This way, you can get a feel for DL practice and neural networks without getting lost in the complexity of loading, preprocessing and structuring your data.

The Boston housing price regression dataset is one of these datasets. It is taken by Keras from the Carnegie Mellon University StatLib library that contains many datasets for training ML models. It is described as follows:

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980.

StatLib Datasets Archive

And contains these variables, according to the StatLib website:

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per $10,000
  • PTRATIO pupil-teacher ratio by town
  • B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in $1000’s

In total, one sample contains 13 features (CRIM to LSTAT) which together approximate the median value of the owner-occupied homes or MEDV. The structure of this dataset, mapping some variables to a real-valued number, allows us to perform regression.

Let’s now take a look at the dataset itself, and particularly its target values.

Does the dataset have many outliers?

The number of outliers helps us tell something about the value for d that we have to choose. When thinking back to my Introduction to Statistics class at university, I remember that box plots can help visually identify outliers in a statistical sample:

Examination of the data for unusual observations that are far removed from the mass of data. These points are often referred to as outliers. Two graphical techniques for identifying outliers, scatter plots and box plots, (…)

Engineering Statistics Handbook

The sample, in our case, is the Boston housing dataset: it contains some mappings between feature variables and target prices, but obviously doesn’t represent all homes in Boston, which would be the statistical population. Nevertheless, we can write some code to generate a box plot based on this dataset:

'''
  Generate a BoxPlot image to determine how many outliers are within the Boston Housing Pricing Dataset.
'''
import keras
from keras.datasets import boston_housing
import numpy as np
import matplotlib.pyplot as plt

# Load the data
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

# We only need the targets, but do need to consider all of them
y = np.concatenate((y_train, y_test))

# Generate box plot
plt.boxplot(y)
plt.title('Boston housing price regression dataset - boxplot')
plt.show()

And next run it, to find this box plot:

Note that we concatenated the training data and the testing data for this box plot. Although the plot hints to the fact that many outliers exist, and primarily at the high end of the statistical spectrum (which does make sense after all, since in life extremely high house prices are quite common whereas extremely low ones are not), we cannot yet conclude that the MSE may not be a good idea. We’ll need to inspect the individual datasets too.

We can do that by simply adapting our code to:

y = y_train

or

y = y_test

This results in the following box plots:

Although the number of outliers is more extreme in the training data, they are present in the testing dataset as well.

Their structure is also quite similar: most of them, if not all, are present in the high end segment of the housing market.

Do note, however, that the median value for the testing dataset and the training dataset are slightly different. This means that patterns underlying housing prices present in the testing data may not be captured fully during the training process, because the statistical sample is slightly different. However, there is only one way to find out – by actually creating a regression model!

Creating the model

Let’s now create the model. Create a file called huber_loss.py in some folder and open the file in a development environment. We’re then ready to add some code! However, let’s analyze first what you’ll need to use Huber loss in Keras.

What you’ll need to use Huber loss in Keras

The primary dependency that you’ll need is Keras, the deep learning framework for Python. However, not any version of Keras works – I quite soon ran into trouble with respect to a (relatively) outdated Keras version… with errors like huber_loss not found.

I had to upgrade Keras to the newest version, as apparently Huber loss was added quite recently – but this also meant that I had to upgrade Tensorflow, the processing engine on top of which my Keras runs. Since on my machine Tensorflow runs on GPU, I also had to upgrade CUDA to support the newest Tensorflow version. Some insights:

  • Huber loss works with Keras version 2.3.1+
  • This Keras version requires Tensorflow 2.0.0+
  • Tensorflow 2.0.0+ requires CUDA 10.0 when you run it on GPU, contrary to previous versions, which ran on CUDA 9.0. Note that 10.0 is not the newest CUDA version available! Save yourself time and do not install 10.1, but 10.0 instead.

Since for installing CUDA you’ll also need CuDNN, I refer you to another blogpost which perfectly explains how to install Tensorflow GPU and CUDA. However, you’ll need to consider the requirements listed above or even better, the official Tensorflow GPU requirements! When you install them correctly, you’ll be able to run Huber loss in Keras 😊

…cost me an afternoon to fix this, though πŸ™ˆ

Model imports

Now that we can start coding, let’s import the Python dependencies that we need first:

'''
  Keras model demonstrating Huber loss
'''
from keras.datasets import boston_housing
from keras.models import Sequential
from keras.layers import Dense
from keras.losses import huber_loss
import numpy as np
import matplotlib.pyplot as plt

Obviously, we need the boston_housing dataset from the available Keras datasets. Additionally, we import Sequential as we will build our model using the Keras Sequential API. We’re creating a very simple model, a multilayer perceptron, with which we’ll attempt to regress a function that correctly estimates the median values of Boston homes. For this reason, we import Dense layers or densely-connected ones.

We also need huber_loss since that’s the los function we use. Numpy is used for number processing and we use Matplotlib to visualize the end result.

Defining Huber loss yourself to make it usable

Next, we’ll have to perform a pretty weird thing to make Huber loss usable in Keras.

Even though Keras apparently natively supports Huber loss by providing huber_loss as a String value during model configuration, there’s no point in this, since the delta value discussed before cannot be configured that way. Hence, we need to think differently.

…but there was no way to include Huber loss directly into Keras, it seemed, until I came across an answer on Stackoverflow! It defines a custom Huber loss Keras function which can be successfully used. I slightly adapted it, and we’ll add it next:

# Define the Huber loss so that it can be used with Keras
def huber_loss_wrapper(**huber_loss_kwargs):
    def huber_loss_wrapped_function(y_true, y_pred):
        return huber_loss(y_true, y_pred, **huber_loss_kwargs)
    return huber_loss_wrapped_function

Loading the dataset

We next load the data by calling the Keras load_data() function on the housing dataset and prepare the input layer shape, which we can add to the initial hidden layer later:

# Load data
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

# Set the input shape
shape_dimension = len(x_train[0])
input_shape = (shape_dimension,)
print(f'Feature shape: {input_shape}')

Preparing the model: architecture & configuration

Next, we do actually provide the model architecture and configuration:

# Create the model
model = Sequential()
model.add(Dense(16, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(8, activation='relu', kernel_initializer='he_uniform'))
model.add(Dense(1, activation='linear'))

# Configure the model and start training
model.compile(loss=huber_loss_wrapper(delta=1.5), optimizer='adam', metrics=['mean_absolute_error'])
history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2)

As discussed, we use the Sequential API; here, we use two densely-connected hidden layers and one output layer. The hidden ones activate by means of ReLU and for this reason require He uniform initialization. The final layer activates linearly, because it regresses the actual value.

Compiling the model requires specifying the delta value, which we set to 1.5, given our estimate that we don’t want true MAE but that given the outliers identified earlier full MSE resemblence is not smart either. We’ll optimize by means of Adam and also define the MAE as an extra error metric. This way, we can have an estimate about what the true error is in terms of thousands of dollars: the MAE keeps its domain understanding whereas Huber loss does not.

Subsequently, we fit the training data to the model, complete 250 epochs with a batch size of 1 (true SGD-like optimization, albeit with Adam), use 20% of the data as validation data and ensure that the entire training process is output to standard output.

Performance testing & visualization

Finally, we add some code for performance testing and visualization:

# Test the model after training
test_results = model.evaluate(x_test, y_test, verbose=1)
print(f'Test results - Loss: {test_results[0]} - MAE: {test_results[1]}')

# Plot history: Huber loss and MAE
plt.plot(history.history['loss'], label='Huber loss (training data)')
plt.plot(history.history['val_loss'], label='Huber loss (validation data)')
plt.title('Boston Housing Price Dataset regression model - Huber loss')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()

plt.title('Boston Housing Price Dataset regression model - MAE')
plt.plot(history.history['mean_absolute_error'], label='MAE (training data)')
plt.plot(history.history['val_mean_absolute_error'], label='MAE (validation data)')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend(loc="upper left")
plt.show()

Model performance for 𝛿 = 1.5

Let’s now take a look at how the model has optimized over the epochs with the Huber loss:

And with the MAE:

We can see that overall, the model was still improving at the 250th epoch, although progress was stalling – which is perfectly normal in such a training process. The mean absolute error was approximately $3.639.

Test results - Loss: 4.502029736836751 - MAE: 3.6392388343811035

Summary

In this blog post, we’ve seen how the Huber loss can be used to balance between MAE and MSE in machine learning regression problems. By means of the delta parameter, or 𝛿, you can configure which one it should resemble most, benefiting from the fact that you can check the number of outliers in your dataset a priori. I hope you’ve enjoyed this blog and learnt something from it – please let me know in the comments if you have any questions or remarks. Thanks and happy engineering! 😊

Note that the full code is also available on GitHub, in my Keras loss functions repository.

References

Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0

StatLib—Datasets Archive. (n.d.). Retrieved from http://lib.stat.cmu.edu/datasets/

Keras. (n.d.). Datasets. Retrieved from https://keras.io/datasets/

Keras. (n.d.). Boston housing price regression dataset. Retrieved from https://keras.io/datasets/#boston-housing-price-regression-dataset

Carnegie Mellon University StatLib. (n.d.). Boston house-price data. Retrieved from http://lib.stat.cmu.edu/datasets/boston

Engineering Statistics Handbook. (n.d.). 7.1.6. What are outliers in the data? Retrieved from https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

Using Tensorflow Huber loss in Keras. (n.d.). Retrieved from https://stackoverflow.com/questions/47840527/using-tensorflow-huber-loss-in-keras

Do you want to start learning ML from a developer perspective? πŸ‘©β€πŸ’»

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to learn new things and better understand concepts you already know. We send emails every Friday.

By signing up, you consent that any information you receive can include services and special offers by email.

6 thoughts on “Using Huber loss in Keras

  1. may

    How to choose delta value ?

    1. Chris

      Hi May,
      Often, it’s a matter of trial and error. I suggest you run a statistical analysis on your dataset first to find whether there are many outliers. You can then adapt the delta so that Huber looks more like MAE or MSE.
      Regards,
      Chris

  2. festo

    Failed to install TensorFlow, giving me error not found try to search using several links

    1. Chris

      Hi Festo,
      That could be many things:
      – You are using the wrong version of Python (32 bit instead of 64 bit)
      – You have multiple Python versions installed
      – You have installed it into the wrong version of Python
      – Anything else

      It’s best to follow the official TF guide for installing: https://www.tensorflow.org/install

      Regards,
      Christian

  3. festo

    im getting this error

    (base) C:\Users\MSIGWA FC>activate PythonGPU

    (PythonGPU) C:\Users\MSIGWA FC>conda install -c anaconda keras-gpu
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

    – keras-gpu

    Current channels:

    https://conda.anaconda.org/anaconda/win-32
    https://conda.anaconda.org/anaconda/noarch
    https://repo.anaconda.com/pkgs/main/win-32
    https://repo.anaconda.com/pkgs/main/noarch
    https://repo.anaconda.com/pkgs/r/win-32
    https://repo.anaconda.com/pkgs/r/noarch
    https://repo.anaconda.com/pkgs/msys2/win-32
    https://repo.anaconda.com/pkgs/msys2/noarch

    To search for alternate channels that may provide the conda package you’re
    looking for, navigate to

    https://anaconda.org

    and use the search bar at the top of the page.

    1. Chris

      Today, the newest versions of Keras are included in TensorFlow 2.x. That’s why it’s best to install tensorflow-gpu via https://anaconda.org/anaconda/tensorflow-gpu i.e. conda install -c anaconda tensorflow-gpu. You’ve tried to install the ‘old’ Keras – which has no tensorflow attached by default.

      Regards,
      Chris

Leave a Reply

Your email address will not be published. Required fields are marked *