The post Creating an MLP for regression with Keras appeared first on Machine Curve.

]]>In a previous blog we showed that Multilayer Perceptrons (MLPs) can be used successfully for classification, albeit that state-of-the-art methods may yield better performance for some datasets.

But MLPs can also be used for a regression problem. And that’s exactly what we will demonstrate in today’s blog.

We’ll create a MLP for regression for a (relatively simple) regression problem. For this reason, we’ll use the Chennai Water Management Dataset, which describes the water levels and daily amounts of rainfall for four water reservoirs near Chennai. It was uploaded during the Chennai Water Crisis of 2019, in which the reservoirs literally dried up. Despite our quest for a simple regression problem, the ‘business’ problem behind the data isn’t simple at all.

The code for this blog is also available at GitHub.

Let’s go.

If you wish to run the code that you’ll create during this tutorial, you do need to have a working setup. What you’ll need is:

- A running Python installation, preferably 3.6+
- A working installation of Keras:
`pip install keras`

. - A working installation of Tensorflow:
`pip install tensorflow`

. - A working NumPy package:
`pip install numpy`

.

Preferably, install these in an environment with Anaconda. See here how you can do that.

We created a Multilayer Perceptron for classifying data (MNIST data, to be specific) in another blog. As we’ll discover in this blog, MLPs can also be applied to regression. However, I must stress that there are a few differences that we must take into account before we proceed.

Firstly, the final activation function. For classification MLPs, we used the `Softmax`

activation function for the multiclass classification problem that we intended to solve. This does not work for regression MLPs. While you want to compute the probability that a sample belongs to any of the predetermined classes during classification (i.e., what Softmax does), you want something different during regression. In fact, what you want is to predict a real-valued number, like ‘24.05’. You therefore cannot use Softmax during regression. You’ll simply use the linear activation function instead for the final layer.

(For the same reason, you don’t convert your data with `to_categorical`

during regression).

Secondly, the loss function that you’ll define is different. For multiclass classification problems, categorical crossentropy was your loss function of preference (Chollet, 2017). Binary crossentropy would be the one for binary classification. However, once again, you’re regressing this time – and you cannot use crossentropy, which essentially attempts to compare probability distributions (or, by the analogy from our previous blog, purple elephants) and see how much they are alike. Instead, you’ll use the mean average error or mean squared error, or similar loss functions. These simply compute the difference between the prediction and the expected value and perform some operations to make the outcome better for optimization. We’ll cover them in more detail later.

Thirdly, while for Softmax based output layers the number of neurons had to be equal to the number of classes you wish to predict for, in the case of regression, you’ll simply use 1 output neuron – unless you wish to regress multiple values at the same time, but that’s not for now.

Let’s next first get used to our dataset

In this blog, we use the Chennai Water Management Dataset. It is a CC0 Public Domain dataset that is available at Kaggle. It is about the city of Chennai in India and especially its water management. Particularly:

Chennai also known as Madras is the capital of the Indian state of Tamil Nadu. Located on the Coromandel Coast off the Bay of Bengal, it is the biggest cultural, economic and educational centre of south India.

Being my second home, the city is facing an acute water shortage now (June 2019). Chennai is entirely dependent on ground water resources to meet its water needs. There are four reservoirs in the city, namely, Red Hills, Cholavaram, Poondi and Chembarambakkam, with a combined capacity of 11,057 mcft. These are the major sources of fresh water for the city.

Source: Sudalai Rajkumar, the author of the dataset

It was uploaded with the goal of inspiring people to come up with solutions that will help Chennai face its water shortage.

Can you imagine, a city with 7+ million people without solid access to water? It’s extreme.

Although we might not exactly aim for resolving Chennai’s water problem today, it’s still nice to use this dataset in order to make the problem more known to the world. Water shortage is an increasing problem given climate change and more and more cities throughout the world will face it in the years to come. Public awareness is the first step then, I’d say!

So let’s see if we can get a better idea about the water crisis that Chennai is facing right now.

The dataset provides daily rain and water levels for four reservoirs in the vicinity of Chennai: the Poondi Reservoir, the Cholavaram Reservoir, the Red Hills Reservoir and the Chembarambakkam Reservoir. They are some of the primary sources for water in Chennai, because the rivers are polluted with sewage (Wikipedia, 2013).

The lakes are located here:

For each of the four sites, the dataset provides two types of data. Firstly, it provides the daily amount of rain in millimeters (mm):

Secondly, it provides the daily water levels in the reservoirs in millions of cubic feet. Every million is about 28.3 million litres, if that makes this chart more intuitive:

Poondi Reservoir is the most important water reservoir for Chennai (Wikipedia, 2015). Rather unfortunately, if you inspect the water levels for this reservoir and add a trend line, you’ll see that they indeed decrease over the years:

The same can be observed for the other reservoirs:

Except for 2015, when there were heavy floods due to large amounts of rainfall, the reservoirs have been emptier than in the years before 2012. One of the primary reasons for this is that the monsoons have become less predictable over the last couple of years (NASA, 2019). By consequence, refilling those reservoirs becomes a challenging task, with real trouble starting this year.

This was Puzhal Lake (also known as the Red Hills Lake) on May 31, 2018:

This was the situation in June 2019:

As you can see, the Red Hills lake dried up entirely.

That’s bad – and it is the perfect example of what is known as the Chennai Water Crisis of 2019.

This is also perfectly visible in the data. As you can see, the lakes had been filled only marginally after the 2018 Monsoons and were empty by June:

Now that we have a feel for the dataset and the real problem that it presents, we could think of certain ways in which machine learning could potentially help the Chennai residents.

In this blog, we specifically tailor this quest towards the MLP we intend to create, but obviously, there’s much more imaginable.

The first question that popped into my mind was this one: *what if we can predict the water level at one particular reservoir given the current levels in the other three?* In that case, we might be able to accurately estimate the water contents in the case measurements at some lake are not possible.

Intuitively, that might make sense, because from the charts it indeed seems that the water levels fluctuate up and down together. Obviously, we would need to do correlation analyses if we wish to know for sure, but I’ll skip these for the sake of simplicity… we’re creating an MLP for regression today, and the dataset is -despite the severity of the problem- the means to an end.

Similarly, much more useful means of applying ML can be thought of with regards to this problem, such as timeseries based prediction, but we’ll keep it easy in order to focus on what we intend to create … an MLP.

As usual, we’ll start by creating a folder, say `keras-mlp-regression`

, and we create a model file named `model.py`

.

We then add our imports:

```
# Load dependencies
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
```

We use the Sequential API and the densely-connected layer type for creating the particular structure of the MLP. We’ll use NumPy for importing our data.

That’s what we do next, we load our dataset (it is available from Kaggle):

```
# Load data
dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4))
```

We use NumPy’s `loadtxt`

definition for loading the data from the CSV file. It works nicely with textual data, of which CSV data is a good example. Since the data is delimited by a `|`

, we configure that above. Additionally, we skip the first row (which contains the column names) and only use columns 1-4, representing the actual data.

We next split the data into feature vectors and targets:

```
# Separate features and targets
X = dataset[:, 0:3]
Y = dataset[:, 3]
```

The assumption that I make here is that the water levels at one reservoir can be predicted from the other three. Specifically, I use the first three (`0:3`

, a.k.a. zero to but excluding three) columns in the dataset as predictor variables, while I use the fourth (column `3`

) as the predicted variable.

In plain English, this means that I’m trying to predict the water levels at the Chembarambakkam reservoir based on the Red Hills, Poondi and Cholavaram reservoirs.

If you’re from the region and say in advance that it’s a false assumption – my apologies. Despite some research, I am not entirely sure about the assumption as well – and since I’m not from the region, I cannot know for sure. However, it would still be possible to train an MLP since it fits the data – and show you how to create one. And that’s what we’ll do next.

We set the input shape as our next step:

```
# Set the input shape
input_shape = (3,)
print(f'Feature shape: {input_shape}')
```

The input shape is a onedimensional vector of three features, this time. The features are the water levels at Red Hills, Poondi and Cholavaram reservoirs at one particular date, while the Chembarambakkam one is to be predicted.

Next, we create our MLP:

```
# Create the model
model = Sequential()
model.add(Dense(16, input_shape=input_shape, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='linear'))
```

Similar to the MLP for classification, we’re using the Keras Sequential API since it makes our life easier given the simplicity of our model.

We then specify three densely-connected layers of neurons: one with 16 outputs, one with 8 outputs and one with 1 output. This way, the neural network will be allowed to ‘think’ wider first, before converging to the actual prediction.

The input layer is specified by the input shape and therefore contains 3 neurons; one per input feature.

Note that we’re using ReLU based activation because it is one of the standard activation functions used today. However, note as well that for the final layer we’re no longer using `Softmax`

, as with the MLP classifier. Instead, we’re using the identity function or \(f(x) = x\) for generating the prediction. Using the linear function allows us to generate a real-valued or numeric prediction, which is exactly what we need.

We finally configure the model and start the training process:

```
# Configure the model and start training
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error'])
model.fit(X, Y, epochs=10, batch_size=10, verbose=1, validation_split=0.2)
```

Contrary to the MLP based classifier, in which we used categorical crossentropy as our loss function, we do not wish to compare certain classes (or as I called them, elephants).

Instead, we want to generate a real-valued or numeric prediction and see how much it deviates from the actual outcome.

Some loss functions are available for this, which are based on the error \(\text{E = prediction – real outcome}\) (Grover, 2019). Those include:

- The
**mean squared error**(MSE), which computes the squared error (\(error^2\)) for all the predictions made, and subsequently averages them by dividing it by the number of predictions. - The
**mean absolute error**(MAE), which instead of computing the squared error computes the absolute error (\(|error|\)) for all predictions made and subsequently averages them in the same way.

To illlustrate how they work, we’ll use an example: if there are two errors, e.g. \(-4\) and \(4\), the MSE will produce 16 twice, while the MAE produces 4 twice.

They both have their benefits and drawbacks, but generally, the MAE is used in situations in which outliers can be present (Grover, 2019).

We’ll train our MLP with both, adding the other as a support variable in the `metrics`

attribute.

Since the Adam optimizer is pretty much the standard optimizer used today, we use it in this example (Chollet, 2017). Adam is an extension of traditional stochastic gradient descent by means of momentum and local neuron optimization. I’ll cover the details in another blog later.

We use 10 epochs, a batch size of 1, a validation split of 20% and verbosity mode 1. This way, we’ll finish training quickly but are likely capable of estimating the gradient very accurately during optimization.

Next, let’s start the training process and see what happens.

These are the results from our first attempt:

```
Epoch 1/10
4517/4517 [==============================] - 14s 3ms/step - loss: 332.6803 - mean_squared_error: 246576.6700 - val_loss: 294.8595 - val_mean_squared_error: 151995.6923
Epoch 2/10
4517/4517 [==============================] - 13s 3ms/step - loss: 276.1181 - mean_squared_error: 126065.0225 - val_loss: 305.3823 - val_mean_squared_error: 160556.6063
Epoch 3/10
4517/4517 [==============================] - 13s 3ms/step - loss: 274.3100 - mean_squared_error: 125171.9773 - val_loss: 322.0316 - val_mean_squared_error: 174732.2345
Epoch 4/10
4517/4517 [==============================] - 14s 3ms/step - loss: 273.0496 - mean_squared_error: 124494.1493 - val_loss: 304.1849 - val_mean_squared_error: 158879.7165
Epoch 5/10
4517/4517 [==============================] - 14s 3ms/step - loss: 273.0190 - mean_squared_error: 124420.8973 - val_loss: 326.6588 - val_mean_squared_error: 179274.0880
Epoch 6/10
4517/4517 [==============================] - 14s 3ms/step - loss: 272.5061 - mean_squared_error: 124192.4299 - val_loss: 305.9678 - val_mean_squared_error: 160826.3846
Epoch 7/10
4517/4517 [==============================] - 15s 3ms/step - loss: 271.1735 - mean_squared_error: 124102.1444 - val_loss: 302.8888 - val_mean_squared_error: 153143.9235
Epoch 8/10
4517/4517 [==============================] - 15s 3ms/step - loss: 270.2527 - mean_squared_error: 123426.2535 - val_loss: 304.5966 - val_mean_squared_error: 154317.4158
Epoch 9/10
4517/4517 [==============================] - 14s 3ms/step - loss: 270.5909 - mean_squared_error: 123033.3367 - val_loss: 316.0911 - val_mean_squared_error: 165068.8407
Epoch 10/10
4517/4517 [==============================] - 14s 3ms/step - loss: 268.9381 - mean_squared_error: 121666.2221 - val_loss: 320.5413 - val_mean_squared_error: 166442.5935
```

Our validation loss seems to be in the range of 290-320. That’s relatively bad; we’re off by a couple of hundred million of square feet of water.

And that’s no single droplet only.

Second attempt with MSE as the loss function:

```
Epoch 1/10
4517/4517 [==============================] - 15s 3ms/step - loss: 255334.5861 - mean_absolute_error: 333.2326 - val_loss: 158943.3863 - val_mean_absolute_error: 304.4497
Epoch 2/10
4517/4517 [==============================] - 13s 3ms/step - loss: 129793.7640 - mean_absolute_error: 286.0301 - val_loss: 160327.8901 - val_mean_absolute_error: 308.0849
Epoch 3/10
4517/4517 [==============================] - 14s 3ms/step - loss: 125248.8358 - mean_absolute_error: 280.8977 - val_loss: 170016.9162 - val_mean_absolute_error: 318.3974
Epoch 4/10
4517/4517 [==============================] - 14s 3ms/step - loss: 124579.2617 - mean_absolute_error: 278.7398 - val_loss: 159538.5700 - val_mean_absolute_error: 310.0963
Epoch 5/10
4517/4517 [==============================] - 14s 3ms/step - loss: 123096.8864 - mean_absolute_error: 277.0384 - val_loss: 166921.0205 - val_mean_absolute_error: 315.9326
Epoch 6/10
4517/4517 [==============================] - 14s 3ms/step - loss: 122259.9060 - mean_absolute_error: 274.9807 - val_loss: 166284.8314 - val_mean_absolute_error: 315.1071
Epoch 7/10
4517/4517 [==============================] - 16s 4ms/step - loss: 121631.5276 - mean_absolute_error: 274.2378 - val_loss: 171566.1304 - val_mean_absolute_error: 323.3036
Epoch 8/10
4517/4517 [==============================] - 17s 4ms/step - loss: 120780.4943 - mean_absolute_error: 272.7180 - val_loss: 157775.8531 - val_mean_absolute_error: 305.2346
Epoch 9/10
4517/4517 [==============================] - 15s 3ms/step - loss: 120394.1161 - mean_absolute_error: 272.3696 - val_loss: 171933.4463 - val_mean_absolute_error: 319.7063
Epoch 10/10
4517/4517 [==============================] - 16s 4ms/step - loss: 119243.6368 - mean_absolute_error: 270.3955 - val_loss: 176639.7063 - val_mean_absolute_error: 322.7455
```

Neither a single droplet only.

However, what immediately came to mind is what I once read in FranΓ§ois Chollet’s book Deep Learning with Python: that you should especially be careful with your data splits when you’re using timeseries data (Chollet, 2017).

It crossed my mind that we’re indeed using timeseries data, albeit not in a timeseries way.

However, precisely that may still be problematic. We split the data into training and validation data – and this is how Keras splits the data:

The validation data is selected from the last samples in the x and y data provided, before shuffling.

Source: Keras (n.d.)

Ah, okay. That’s like taking the last 20 percent off this graph for validation while training with the rest:

The point is that most of the 20%. is the situation with a lack of water while much of the first 80%. is from the situation in which water levels were relatively okay. However, this way, we train our model with very different ideosyncrasies in the training versus the validation data:

- The monsoons got less predictable during the years with water shortages. By consequence, so do the water levels. This is a difference from the early years.
- Water management in Chennai could have changed, especially since it is described as one of the major causes for the water crisis (Wikipedia, 2019).
- Perhaps, rainfall has changed due to unexplainable facts – cycles in the weather that we may not know about.
- Perhaps, the demand for water has increased, reducing the lifecycle time of water in the reservoirs.
- And so on.

By consequence, we must take into account time as much as we can.

And strangely, we could do so by randomly shuffling the data, I believe.

Our MLP does not take into account time by design (i.e., although the data is a timeseries, our MLP is not a timeseries model. Perhaps naΓ―vely, it attempts to simply predict the level at one lake based on the current levels in the other three).

Yet, it took it into account by consequence because of how we split our data.

Randomly shuffling the data before training may yield a balance between training and validation data.

For this, we add two lines between `Loading the data`

and `Separating the data into training and testing data`

, as follows:

```
# Load data
dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4))
# Shuffle dataset
np.random.shuffle(dataset)
# Separate features and targets
X = dataset[:, 0:3]
Y = dataset[:, 3]
```

Those are the results when we run the training process again:

```
4517/4517 [==============================] - 16s 3ms/step - loss: 296.1796 - mean_squared_error: 156532.2806 - val_loss: 290.2458 - val_mean_squared_error: 141232.8286
Epoch 2/10
4517/4517 [==============================] - 14s 3ms/step - loss: 282.1418 - mean_squared_error: 133645.8504 - val_loss: 280.9738 - val_mean_squared_error: 134865.3968
Epoch 3/10
4517/4517 [==============================] - 15s 3ms/step - loss: 279.2078 - mean_squared_error: 132291.1732 - val_loss: 281.8184 - val_mean_squared_error: 135522.1895
Epoch 4/10
4517/4517 [==============================] - 15s 3ms/step - loss: 277.4232 - mean_squared_error: 130418.7432 - val_loss: 279.9939 - val_mean_squared_error: 131684.8306
Epoch 5/10
4517/4517 [==============================] - 14s 3ms/step - loss: 275.6177 - mean_squared_error: 130715.3942 - val_loss: 280.5357 - val_mean_squared_error: 130576.4042
Epoch 6/10
4517/4517 [==============================] - 15s 3ms/step - loss: 273.3028 - mean_squared_error: 128172.1251 - val_loss: 272.0446 - val_mean_squared_error: 126942.4550
Epoch 7/10
4517/4517 [==============================] - 16s 4ms/step - loss: 271.7314 - mean_squared_error: 126806.0373 - val_loss: 273.5686 - val_mean_squared_error: 127348.5214
Epoch 8/10
4517/4517 [==============================] - 15s 3ms/step - loss: 270.4174 - mean_squared_error: 125443.8001 - val_loss: 269.9208 - val_mean_squared_error: 125395.7469
Epoch 9/10
4517/4517 [==============================] - 17s 4ms/step - loss: 270.0084 - mean_squared_error: 125520.7887 - val_loss: 274.6282 - val_mean_squared_error: 129173.8515
Epoch 10/10
4517/4517 [==============================] - 17s 4ms/step - loss: 268.4413 - mean_squared_error: 124098.9995 - val_loss: 268.5992 - val_mean_squared_error: 125443.7568
```

They are better indeed – but they aren’t good yet.

Training the model for 250 epochs instead of 10 got me to a validation loss of approximately 240 million square feet, but that’s still too much.

Here’s why I think that the relatively poor performance occurs:

**Unknown factors**interfering with the data. I expect that water levels cannot be predicted by water levels alone and that, given the relatively large distances between the lakes, certain ideosyncratic factors between those sites influence the water levels as well. Primarily, this may be the case because – if I’m not wrong – certain lakes seem to be river-fed as well. This makes the water levels at those dependent on rain conditions upstream, while this may not be the case for all the lakes. Perhaps, taking this into account may make our model better – e.g. by removing the river-fed lakes (although you may wonder, what will remain?).- If I’m wrong with this assumption, please let me know in the comments!

- We didn’t take into account
**time**. We simply predicted the water level at Chembarambakkam based on the levels in the three other lakes. The movements in water levels over the past few days, perhaps weeks, may be important predictors for the water levels instead. Perhaps, making it a true timeseries model may make it better. - We didn’t take into account
**human activity**. The numbers do not say anything about human activity; perhaps, water levels changed due to certain water management activities. If this is the case, it would directly influence the model’s predictive power if it this pattern does not occur in all the lakes. I read here that activities were undertaken in 2008-2009 to reduce the effects of evaporation. This might influence the data. - Finally, we also did not take into account
**weather conditions**. The weather is chaotic and may therefore reduce balance within the data. This is particularly the case because we only have rain data – and no data about, say, sunshine, and by consequence the degree of evaporation. It may be the case that we can improve the performance of the model if we simply add more weather data to it.

And to be frank, one can think about many better approaches to this problem than an MLP – approaches that would make the prediction much more aware of (primarily the temporal) context. For the sake of simplicity, I won’t cover them all, but creating timeseries based models with e.g. CNNs could be an option.

Nevertheless, we have been successful in creating a Multilayer Perceptron in Keras for regression – contrary to the classification one that we created before.

And despite the major crisis that Chennai is currently facing, that was the goal of our post today.

I do still hope though that you’ll be also a little bit more aware now of the challenges that our planet is facing with respect to climate over the years to come. What simply visualizing data for a Keras tutorial can’t do, can it?

The code for this blog is available at GitHub.

Thank you once again for reading my blog. If you have any comments, questions or remarks, or if you have suggestions for improvement, please feel free to leave a comment below I’ll try to review them and respond to them as soon as I can. Particularly, I’m interested in your suggestions for the Chennai Water Management dataset – what can we do with it to make the world a slightly better place? Let creativity loose. Thanks again!

Chollet, F. (2017). *Deep Learning with Python*. New York, NY: Manning Publications.

Grover, P. (2019, May 24). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0

NASA. (2019, June 27). Water Shortages in India. Retrieved from https://earthobservatory.nasa.gov/images/145242/water-shortages-in-india

Keras. (n.d.). Sequential. Retrieved from https://keras.io/models/sequential/

Rajkumar, S. (2019). Chennai Water Management. Retrieved from https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3

Wikipedia. (2013, July 14). Water management in Chennai. Retrieved from https://en.wikipedia.org/wiki/Water_management_in_Chennai#Primary_water_sources

Wikipedia. (2015, May 7). Poondi reservoir. Retrieved from https://en.wikipedia.org/wiki/Poondi_reservoir

The post Creating an MLP for regression with Keras appeared first on Machine Curve.

]]>The post How to create a basic MLP classifier with the Keras Sequential API appeared first on Machine Curve.

]]>One class of algorithms that stands out relatively often is the class of so-called Multilayer Perceptrons. I often like to call them *basic neural network*, since they have the shape that people usually come up with when they talk about neural nets. They aren’t complex, really, while they are much more powerful than the single-neuron ones.

In this blog, I’ll show you how to create a basic MLP classifier with the Keras Sequential API. But before we can do that, we must do one thing. First, we shall cover a little bit of history about MLPs. I always think it’s important to place learning in a historical context, and that’s why I always include brief histories in my blogs.

And then, we’ll code it in Keras and test it with a real dataset. If you’re feeling lucky today, you might also be interested in finding the code on GitHub.

Let’s go!

The Rosenblatt perceptron triggered a fairly big controversy in the field of AI. But before I can proceed with this, we must go back to the 1940s and the 1950s first. It was the age of cybernetics. In this field, although it is possibly better described as a movement than a scientific field, people attempted to study how human beings and machines could work together to advance the world.

As with any fairly new field of science or practice, the cybernetics movement was rather hype-saturated. Although prominent figures such as Alan Turing participated in cybernetic research, dreams often went beyond what was realistic at the time (Rid, 2016). However, that can be said about many things in retrospect…

Two main streams of thought emerged in the 1950s for making the cybernetic dreams a reality (Olazaran, 1996). The first was the *neural net* stream. This stream, in which Frank Rosenblatt played a prominent role, was about automated learning in a network-like fashion: by attempting to mimic the human brain through artificial neural networks, they argued, learning could be automated.

The other stream of thought had a radically different point of view. In this stream, the symbolic one, “symbolic expressions stand for words, propositions and other conceptual entities” (Olazaran, 1996). By manipulating these propositions, possibly linking them together, knowledge about the world could be captured and manipulated – and by consequence, intelligent machines could emerge. One of the most prominent thought leaders in the field of symbolic AI was Marvin Minsky (Olazaran, 1996).

When Rosenblatt demonstrated his perceptron in the late 1950s, he made it quite clear what he thought it would be capable of in many years:

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

A summary of Rosenblatt’s remarks (The New York Times, 1958).

Minsky and other people who thought symbolic AI was the way forward got furious about these claims. With strong rhetoric, they argued that Rosenblatt only introduced a hype and did not stress upon the limitations of the Perceptron enough (Olazaran, 1996).

In fact, they essentially thought that “(…) Frank Rosenblatt’s work was a waste of time” (Olazaran, 1996). And they set out to show it … in the work *Perceptrons*, which was published in the late 1960s.

In this work, they showed that perceptrons had fundamental problems which made learning as envisioned by Rosenblatt impossible, and claimed that no further research should be undertaken in the neural net niche. The main problem was that a single-layer perceptron could not successfully represent the XOR function. Mathematically, this was possible with perceptrons that were stacked into multiple layers, but optimization of those would be way too heavy in terms of computational costs.

The consequences of this attack were large: much funding for neural net projects was withdrawn and no new funding was approved by many organizations. As a result, many people working on neural nets were transferred to other fields of study or entirely abandoned their field in favor of symbolic AI.

This is what is known as the first AI winter. The focus of AI research eventually shifted entirely towards symbolic AI.

However, when symbolic AI was *institutionalized*, as Olazaran calls it, many problems also came to light with the symbolic approach (Olazaran, 1996). That is, when much research attraction was drawn and many paths in which symbolic AI could be applied were explored, various problems were found with the symbolic approach. One of the primary ones was that the relatively fuzzy context in which humans often operate cannot be captured by machines that fully operate on the rules of logic.

The consequence? The same as for neural net research in the 1960s … enter the second AI winter.

Fortunately, the field of neural net research was not abandoned entirely. Particularly, certain scholars invented what is called the *backpropagation algorithm*. By slightly altering the way a perceptron operates, e.g. by having it use a continuous rather than a discontinuous function, much progress could be made. Particularly, researchers were since able to optimize it by using a descending-down-the-hill approach, computing the error backwards throughout the layers. They were now especially able to *train perceptrons that were stacked in multiple layers*, or **multilayer perceptrons**. Finally! One of the primary problems of the 1950s-1960s was overcome.

Minsky and folks were quick to respond with the notion that this revival did not mean that e.g. their remarks about computational costs were no longer accurate. Indeed, they were still right about this, but machine learning by means of neural nets remained here to stay. In the years since, we’ve seen many incremental improvements and a fair share of breakthroughs, of which the deep learning hype is the latest development.

Now that we know a thing or two about how the AI field has moved from single-layer perceptrons to deep learning (albeit on a high level), we can focus on the multilayer perceptron (MLP) and actually code one.

We’ll use Keras for that in this post. Keras is a very nice API for creating neural networks in Python. It runs as an abstraction layer on top of frameworks like TensorFlow, Theano and CNTK and makes creating neural networks very easy.

Under the condition that you know what you’re doing, obviously.

Because now, everyone can mix together some neural network building blocks and create a neural network. Optimizing is however a different story.

All right. Let’s first describe the dataset that we’ll use for creating our MLP.

We use the MNIST database, which stands for Modified National Institute of Standards and Technology (LeCun et al., 1998). It is one of the standard datasets that is used throughout the machine learning community, often for educational purposes.

In simple English, it’s just a database of handwritten numbers that are 28 by 28 pixels. They’ve been used in the early days of neural networks in one of the first practical applications of AI, being a digit recognizer for handwritten numbers. More information on MNIST is available here.

And this is what these numbers look like:

Okay, let’s start work on our MLP in Keras. We must first create a Python file in which we’ll work. As your first step, create a file called `model.py`

and open it in a text or code editor.

Also make sure that your machine is ready to run Keras and TensorFlow. Make sure that it has Python installed as well, preferably 3.6+. You’ll need this to actually run your code.

If you wish to visualize your data, you also need Matplotlib. This is however not mandatory for your model.

Let’s now import the essential Python packages:

```
# Imports
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
```

Why we import the `keras`

package should make sense by now. The same applies to the import of the `mnist`

dataset. For the others, let’s quickly look into why we import them.

First, the `Sequential`

model. It’s one of the two APIs that Keras supports (the other being the `Functional`

API). The Sequential one is often used by beginning ML engineers. It offers less flexibility but makes creating neural networks easier. Especially for educational purposes, like this blog, the Sequential API is a very good choice.

Then, the `Dense`

layer. Keras supports a wide number of layers, such as convolutional ones if one aims to build a Convolutional Neural Network. However, we don’t: our goal is to build a Multilayer Perceptron. Those aren’t built of spectacular layers; rather, it’s simply a stack of so-called densely-connected ones. That means that an arbitrary neuron is connected to all neurons in the subsequent layer. It looks as follows:

Next is the `to_categorical`

util. We don’t need it immediately, but require it later. It has to do with the structure of the MNIST dataset, specifically the number of target classes. Contrary to the single-layer perceptron that we created, which was a binary classification problem, we’re dealing with a multiclass classification problem this time – simply because we have 10 classes, the numbers 0-9.

For those problems, we need a loss function that is called *categorical crossentropy.* In plain English, I always compare it with a purple elephant .

Suppose that the relationships in the real world (which are captured by your training date) together compose a purple elephant (a.k.a. distribution). We next train a machine learning model that attempts to be as accurate as the original data; hence attempting to classify data as that purple elephant. How well the model is capable of doing that is what is called a *loss*, and the loss function allows one to compare one distribution (elephant) with the other (hopefully the same elephant). Cross entropy allows one to compare those. We can’t use the binary variant (it only compares two elephants), but need the *categorical *one (which can compare multiple elephants). This however requires us to ‘lock’ the set of elephants first, to avoid that another one is added somehow. This is called *categorical data*: it belongs to a fixed set of categories (Chollet, 2017).

However, the MNIST targets, which are just numbers (*and numbers can take any value!)*, are not categorical. With `to_categorical`

, we can turn the numbers into categorical data. For example, if we have a trinary classification problem with the possible classes being \(\{ 0, 1, 2 \}\), the numbers 0, 1 or 2 are encoded into categorical vectors. One categorical vector looks as follows:

…or in plain English:

- Class 0: false.
- Class 1: true.
- Class 2: false.

*Categorical data is fixed with respect to the possible outcomes; categorical crossentropy therefore requires your data to be fixed (categorical)*.

And `to_categorical`

serves this purpose.

Next, we can assign some configuration variables:

```
# Configuration options
feature_vector_length = 784
num_classes = 60000
```

One MNIST sample is an image of 28 by 28 pixels. An interesting observation that I made a while ago is that MLPs don’t support multidimensional data like images natively. What you’ll have to do is to *flatten* the image, in the sense that you’ll just take all the rows and put them into a massive row. Since 28 times 28 is 784, our feature vector (which with the Pima dataset SLP was only 8) will contain 784 features (pixels).

The MNIST dataset contains 60.000 images in its training set. Hence, the `num_classes`

is 60.000.

Finally, we can load the data:

```
# Load the data
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# Reshape the data - MLPs do not understand such things as '2D'.
# Reshape to 28 x 28 pixels = 784 features
X_train = X_train.reshape(X_train.shape[0], feature_vector_length)
X_test = X_test.reshape(X_test.shape[0], feature_vector_length)
# Convert into greyscale
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
# Convert target classes to categorical ones
Y_train = to_categorical(Y_train, num_classes)
Y_test = to_categorical(Y_test, num_classes)
```

We’ll use the Keras provided `mnist.load_data()`

to load the MNIST dataset relatively easily. The function returns two tuples: one with training data; the other with testing data. The `X`

elements represent the feature vectors (which at that point in time are still 28×28 pixels); the `Y`

elements represent the targets (at that point still being numbers, i.e. 0-9).

The next step is to `reshape`

the data: we argued that the 28×28 must be converted into 784 to be suitable for MLPs. That’s what we do there – we reshape the features to `feature_vector_length`

for both the training and testing features.

Next, we’ll convert the data into greyscale. This way, when new colors are added to the dataset, the model does not get into trouble – it has simply been trained in a color-agnostic way.

Finally, we’ll do what we discussed before – convert the data into categorical format by means of the `to_categorical`

function. Rather than being *scalars*, such as \(0\) of \(4\), one target *vector *will subsequently look as follows:

Obviously, the target here is 5.

Perhaps you are willing to visualize your features first in order to get a better feeling for them. You can do that by means of `matplotlib`

. If you execute `imshow`

on either a testing or training sample *before* you convert it into MLP-ready data, you can see the data you’ll be working with.

Code:

```
# Imports
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
# Configuration options
feature_vector_length = 784
num_classes = 60000
# Load the data
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# Visualize one sample
import matplotlib.pyplot as plt
plt.imshow(X_train[0], cmap='Greys')
plt.show()
```

Result:

All right, let’s continue … the next step is actually creating the MLP in your code:

```
# Set the input shape
input_shape = (feature_vector_length,)
print(f'Feature shape: {input_shape}')
# Create the model
model = Sequential()
model.add(Dense(350, input_shape=input_shape, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))
```

Question: have you got any idea about the shape of the data that we’ll feed into the MLP once we fit the data?

\((784, )\).

We’ll feed it a one-dimensional feature vector that contains 784 features.

That’s why we assign `feature_vector_length`

converted into tuple format to `input_shape`

and use it later in the `model`

.

As discussed before, the Keras Sequential API is used for creating the model. We’ll next add three hidden layers to our MLP:

- The first has 350 output neurons and takes the input of 784 input neurons, which are represented by an input layer specified by the
`input_shape`

argument. We activate using Rectified Linear Unit (ReLU), which is one of the standard activation functions used today. Below, you’ll see how it activates. - The second has 50 output neurons and activates by means of ReLU. You’ll by now notice that we somehow funnel the information into a very dense format. This way, the model will be capable of learning the most important patterns, which helps generalizing to new data.
- Finally, there’s an output layer, which has
`num_classes`

output neurons and activates by means of`Softmax`

. The number of neurons equals the number of scalars in your output vector. Since that data must be categorical for categorical cross entropy, and thus the number of scalar values in your target vector equals the number of classes, it makes sense why`num_classes`

is used. Softmax, the activation function, is capable of generating a so-called multiclass probability distribution. That is, it computes the probability that a certain feature vector belongs to one class.

Ok, we just configured the model *architecture*… but we didn’t cover yet *how it learns*.

We can configure precisely that by means of the model’s hyperparameters:

```
# Configure the model and start training
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs=10, batch_size=250, verbose=1, validation_split=0.2)
```

As discussed before, we use categorical crossentropy as our loss function (Chollet, 2017). We use the `Adam`

optimizer for optimizing our model. It combines various improvements to traditional stochastic gradient descent (Kingma and Ba, 2014; Ruder, 2016). Adam is the standard optimizer used today (Chollet, 2017).

Accuracy is highly intuitive to humans so we’ll use that alongside our categorical crossentropy loss.

Next, we fit the training data to our model. We choose 10 epochs, or the number of iterations before it stops training, a batch size of 250, verbosity mode 1 and a validation split of 20%. The latter splits the 60.000 training samples into 48.000 used for training and 12.000 for optimization.

All right, let’s go.

Execute your code in Python, in an environment where TensorFlow and Keras are installed:

`python model.py`

It then starts training, which should be similar to this:

```
2019-07-27 20:35:33.356042: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3026 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
48000/48000 [==============================] - 54s 1ms/step - loss: 1.8697 - acc: 0.5851 - val_loss: 0.4227 - val_acc: 0.8801
Epoch 2/10
48000/48000 [==============================] - 72s 1ms/step - loss: 0.3691 - acc: 0.8939 - val_loss: 0.3069 - val_acc: 0.9122
Epoch 3/10
48000/48000 [==============================] - 73s 2ms/step - loss: 0.2737 - acc: 0.9222 - val_loss: 0.2296 - val_acc: 0.9360
Epoch 4/10
48000/48000 [==============================] - 62s 1ms/step - loss: 0.2141 - acc: 0.9385 - val_loss: 0.1864 - val_acc: 0.9477
Epoch 5/10
48000/48000 [==============================] - 61s 1ms/step - loss: 0.1785 - acc: 0.9482 - val_loss: 0.1736 - val_acc: 0.9495
Epoch 6/10
48000/48000 [==============================] - 75s 2ms/step - loss: 0.1525 - acc: 0.9549 - val_loss: 0.1554 - val_acc: 0.9577
Epoch 7/10
48000/48000 [==============================] - 79s 2ms/step - loss: 0.1304 - acc: 0.9620 - val_loss: 0.1387 - val_acc: 0.9597
Epoch 8/10
48000/48000 [==============================] - 94s 2ms/step - loss: 0.1118 - acc: 0.9677 - val_loss: 0.1290 - val_acc: 0.9622
Epoch 9/10
48000/48000 [==============================] - 55s 1ms/step - loss: 0.0988 - acc: 0.9705 - val_loss: 0.1232 - val_acc: 0.9645
Epoch 10/10
48000/48000 [==============================] - 55s 1ms/step - loss: 0.0862 - acc: 0.9743 - val_loss: 0.1169 - val_acc: 0.9676
10000/10000 [==============================] - 21s 2ms/step
Test results - Loss: 0.1073538348050788 - Accuracy: 0.9686%
```

Or, visually:

As you can see, training loss decreases rapidly. This is perfectly normal, as the model always learns most during the early stages of optimization. Accuracies converge after only one epoch, and still improve during the 10th, albeit slightly.

Validation loss is also still decreasing during the 10th epoch. This means that although the model already performs well (accuracies of 96.8%!), it can still improve further without losing its power to generalize to data it has never seen. In other words, our model is still underfit… perhaps, increasing the number of `epochs`

until validation loss increases again might yield us an even better model.

However, this was all observed from validation data. What’s best is to test it with the actual testing data that was generated earlier:

```
# Test the model after training
test_results = model.evaluate(X_test, Y_test, verbose=1)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]}%')
```

Testing against the testing data will ensure that you’ve got a reliable metric for testing the model’s power for generalization. This is because every time, during optimization which is done based on validation data, information about the validation data leaks into the model. Since the validation data is a statistical sample which also deviates slightly from the actual population in terms of, say, mean and variance, you get into trouble when you rely on it too much.

However, for our attempt, the test results are positive:

`Test results - Loss: 0.1073538348050788 - Accuracy: 0.9686%`

Similar – almost 97%! That’s great

All right. We were successful in creating a multilayer perceptron that classifies the MNIST dataset with an extremely high accuracy: we achieved a success rate of about 97% on 10.000 images. That’s pretty cool, isn’t it?

Yep.

But…

…we can do better.

MLPs were very popular years back (say, in the 2000s), but when it comes to image data, they have been overtaken in populary and effectiveness by Convolutional Neural Networks (CNNs). If you wish to create an image classifier, I’d suggest looking at them, perhaps combining them with MLPs in some kind of ensemble classifier. Don’t use MLPs only.

- I trained CNNs before. In my experience, they train a lot faster on the MNIST dataset than the MLP we just built. It’s rather easy to explain this: the more you navigate to the right in your CNN layers, the more abstract your data gets to be. This speeds up the training process. Compare this to MLPs, which learn the entire feature vector; the funneling approach may be effective, but isn’t as effective as CNN sparsity. Another reason to look at CNNs!
- Another observation is that when you wish to use MLPs, image like data must be flattened into a onedimensional feature vector first. Otherwise, you simple cannot use them for image data. CNNs often come with multidimensional convolutional layers, like the
`Conv2D`

and`Conv3D`

ones in Keras. CNNs therefore save you preprocessing time and*computational costs*if you deal with a lot of data. - As we noted before, when you use Softmax and – by consequence – categorical crossentropy, the number of neurons in your final layer must be equal to the number of target classes present in your dataset. This has to do with the fact that you’re converting your data into categorical format first, which effectively converts your target scalar into a target vector with
`num_classes`

scalars (of the values 0 and 1).

I hope you enjoyed this post and have learnt more about MLPs, creating them in Keras, the history of moving from perceptrons to modern algorithms and, finally, why you better use CNNs for image like data. If you’ve got any remaining questions or if you have got remarks whatsoever, please feel free to leave a comment below I’m happy to receive your remarks so that we can together improve this post. Questions will be answered as soon as I can.

Thank you… and happy engineering!

*The code for this work is also available on **GitHub**.*

Chollet, F. (2017). *Deep Learning with Python*. New York, NY: Manning Publications.

Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. Retrieved from https://arxiv.org/abs/1412.6980

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, *86*(11), 2278-2324. doi:10.1109/5.726791

Olazaran, M. (1996). A Sociological Study of the Official History of the Perceptrons Controversy. *Social Studies of Science*, *26*(3), 611-659. doi:10.1177/030631296026003005

Rid, T. (2016). *Rise of the Machines: the lost history of cybernetics*. Scribe Publications.

Ruder, S. (2016). An overview of gradient descent optimization algorithms. Retrieved from https://arxiv.org/abs/1609.04747

The New York Times. (1958, July 8). NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser. Retrieved from https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html

The post How to create a basic MLP classifier with the Keras Sequential API appeared first on Machine Curve.

]]>The post Why you can’t truly create Rosenblatt’s Perceptron with Keras appeared first on Machine Curve.

]]>It was January 1957 when a report was released by Cornell Aeronautical Laboratory. It was written by Frank Rosenblatt and titled *The Perceptron – a Perceiving and Recognizing Automaton*, which aimed to “formulate a brain analogue useful in analysis” (Rosenblatt, 1957).

In his work, he presented the perceptron– a one-neuron neural network that would eventually lie at the basis of many further developments in this field.

Since I’m currently investigating historical algorithms *and* because I use Keras on a daily basis for creating deep neural networks, I was interested in combining both – especially since I saw some blogs on the internet that had applied it too.

Rather unfortunately, I ran into trouble relatively quickly. And it all had to do with the fact that Keras to me seems unsuitable for creating the Perceptron – you can get close to it, but you cannot replicate it exactly.

Why?

That’s what I will cover in this blog. First, I’m going to take a look at the internals of a perceptron. I cover how data is propagated through it and how this finally yields a (binary) output with respect to the preferred class. Subsequently, I’ll try to replicate it in Keras … until the point that you’ll see me fail. I will then introduce the Perceptron Learning Rule that is used for optimizing the weights of the perceptron, based on one of my previous posts. Based on how deep neural networks are optimized, i.e. through Stochastic Gradient Descent (SGD) or a SGD-like optimizer, I will then show you why Keras cannot be used for single-layer perceptrons.

Finally, I will try to *get close* to replication – to see what the performance of single-neuron networks *could* be for a real-world dataset, being the Pima Indians Diabetes Database.

Let’s hope we won’t be disappointed!

Mathematically, a Rosenblatt perceptron can be defined as follows:

\begin{equation} f(x) = \begin{cases} 1, & \text{if}\ \textbf{w}\cdot\textbf{x}+b > 0 \\ 0, & \text{otherwise} \\ \end{cases} \end{equation}However, mathematics is useless unless you understand it – which in my opinion cannot be done without building *intuition* and *visualization*. Only when you can visualize an equation, and thoroughly understand how it works, you can finally enjoy its beauty.

Therefore, let’s precisely do that. This is a generic sketch of the perceptron as it is defined above:

In the maths above, you noticed a weights vector

and an input vector **w**

that are multiplied. Finally, a bias **x**`b`

is added. The class is one if this output is larger than zero. Otherwise, it picks the other class.

Let’s cover the first part – multiplying the vectors – first. When you do that, it’s called a *dot product*. Computing one is actually really simple: the dot product is the sum of the multiplication of the individual vector elements. Visualized, that’s `x1`

multiplied by `w1`

; `x2`

and `w2`

, et cetera – mathematically:

All these individual outputs are summated, as you can see. Subsequently, the bias value is added and the value is passed along to the ‘gateway’ (real name: unit step) function that assigns it either class 0 or class 1. The output passed to the unit step function looks as follows:

The step function:

\begin{equation} f(x) = \begin{cases} 1, & \text{if}\ \textbf{w}\cdot\textbf{x}+b > 0 \\ 0, & \text{otherwise} \\ \end{cases} \end{equation}It is therefore one of the simplest examples of a binary classifier.

All right, let’s see if we can code one. First ensure that you have all necessary dependencies installed, preferably in an Anaconda environment. Those dependencies are as follows:

- A clean Python installation, preferably 3.6+: https://www.python.org/downloads
- Keras:
`pip install keras`

- By consequence, TensorFlow:
`pip install tensorflow`

(go here if you wish to install the GPU version on Windows).- You may also wish to run it on Theano or CNTK, which are supported by Keras, but I only tested it with TF as a backend.

- Numpy:
`pip install numpy`

. - Scipy:
`pip install scipy`

.

Create a new folder somewhere on your machine called `simple-perceptron`

:

Open the folder and create one file: `model.py`

.

We’ll use the Pima Indians Diabetes Database as our dataset. It’s a CC0 (or public domain) dataset that is freely available at Kaggle. It can be described as follows:

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Source: Kaggle

All right, the first step would be to download the dataset, so let’s do that first. Download the dataset to the same folder as `model.py`

and call it `pima_dataset.csv`

.

Now open `model.py`

in a text editor or an IDE. First add the dependencies that you’ll need:

```
# Load dependencies
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
```

Then load your data:

```
# Load data
dataset = np.loadtxt('./pima_dataset.csv', delimiter=',')
# Separate train and test data
X = dataset[:, 0:8]
Y = dataset[:, 8]
```

What you do above is less difficult than it looks. First, you use the `numpy`

library to use the Pima dataset, which is delimited (i.e. the columns are separated) by a comma. Indeed, when you open the CSV file, you’ll see this:

```
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
...........and so on
```

Let’s take the first row.

`6,148,72,35,0,33.6,0.627,50,1`

The numbers \(\{6, 148, …, 50\}\) represent the feature vector \(\mathbf{x_0} = \{6, 148, 72, 35, 0, 33.6, 0.627, 50\}\). This feature vector is part of your training set which is the Pima dataset – or \(\mathbf{x_0} \in X\).

There is however one value left: \(1\). This is actually the *desired outcome*, or the class to which this feature vector belongs. The total number of desired outcomes is 2, as the set is \(Y = \{ 0, 1 \}\) or, in plainer English: \(Y = \{ \text{no diabetes}, \text{diabetes} \}\). Recall why this is the case: the objective of the Pima dataset is to “to diagnostically predict whether or not a patient has diabetes”.

This also explains why you’ll do this:

```
# Separate train and test data
X = dataset[:, 0:8]
Y = dataset[:, 8]
```

In Python, what you’re writing for \(X\) is this: for the entire `dataset`

, take all rows (`:`

) as well as columns 0 up to 8 (excluding 8). Assign the output to `X`

. By consequence, `X`

– or your set of feature vectors – therefore contains the *actual features*, excluding the targets (which are in column 8).

Obviously, it’s now easy to understand what happens for the desired outcomes or target set `Y`

: you’ll take the 8th column for all rows.

Next, create the model and add your Perceptron, which is a Dense layer:

```
# Create the Perceptron
model = Sequential()
model.add(Dense())
```

I now got confused. The Keras docs wanted me to specify an *activation function* and an *initializer*.

So I started looking around for clues, and then I found this:

Based on that, gradient descent can’t be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x).

Source: Yahia Zakaria, StackOverflow, or (Zakaria, 2016).

Today’s neural networks, which are supported by Keras, apparently use an entirely different method for optimization, I found. Whereas the Rosenblatt Perceptron updates the weights by pushing them slightly into the right direction (i.e. the Perceptron Learning Rule), today’s neural networks don’t do that. Instead, they compute the loss with a so-called loss function, which is differentiable. By minimizing this gradient, the algorithms find the way to the best-performing model. We call this (Stochastic) Gradient Descent. Instead of pushing the weights into the right direction, it’s like descending a mountainous path, where your goal is to go to the valley – changing the model weights as you go.

The next question is then: the Perceptron step function outputs class 0 for all values \(\leq 0\) and 1 for the rest. Why cannot this be used as a loss function, then?

Very simple – because the derivative is always zero, except for \(x = 0\). Consider one of the classes as the output of a function, say for class = 1, and you will get:

\begin{equation} \begin{split} f(x) &= 1 \end{split} \end{equation}Since \(x^0\) is 1, we can rewrite \(f(x)\) as:

\begin{equation} \begin{split} f(x) &= 1 \cdot x^0 \\ &= 1 \cdot 1 \\ &= 1 \\ \end{split} \end{equation}And you will see that the derivative is 0:

\begin{equation} \begin{split} f'(x) &= \frac{df}{dx}(1) \\ &= \frac{df}{dx}(1 \cdot x^0) \\ &= 0 \cdot (1 \cdot x^\text{-1}) \\ &= 0 \end{split} \end{equation}What’s even worse is that the derivative is *undefined* for \(x = 0\). This is the case because a function must be differentiable. Since it ‘steps’ from 0 to 1 at \(x = 0\), the function is not differentiable at this point, rendering the derivative to be undefined. This can be visualized as follows, but obviously then for \(x = 0\):

Crap. There goes my plan of creating a Rosenblatt Perceptron with Keras. What to do?

Mathematically, it is impossible to use gradient descent with Rosenblatt’s perceptron – and by consequence, that’s true for Keras too.

But what if we found a function that *is actually differentiable* and highly resembles the step function used in the Rosenblatt perceptron?

We might then be able to pull it off, while accepting *a slight difference compared to the Rosenblatt perceptron*.

But to me, that’s okay.

The first candidate is the Sigmoid function, which can be mathematically defined as:

\begin{equation} \begin{split} sig(t) = \frac{\mathrm{1} }{\mathrm{1} + e^\text{-t} } \end{split} \end{equation}And visualized as follows:

Across a slight interval around \(x = 0\), the Sigmoid function transitions from 0 to 1. It’s a differentiable function and is therefore suitable for this Perceptron.

But can we find an even better one?

Yes.

It’s the *hard Sigmoid* function. It retains the properties of Sigmoid but transitions less quickly.

And fortunately, Keras supports it: `keras.activations.hard_sigmoid(x)`

!

Note that so far, we have this:

```
# Load dependencies
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
# Load data
dataset = np.loadtxt('./pima_dataset.csv', delimiter=',')
# Separate train and test data
X = dataset[:, 0:8]
Y = dataset[:, 8]
# Create the Perceptron
model = Sequential()
model.add(Dense())
```

We can now add the activation function and the initializer. Since *zero initialization* (which is what one can do with the real Rosenblatt Perceptron) is **not a good idea with SGD** (I’ll cover this in another post), I’ll initialize them with the default Keras initializer, being `glorot_uniform`

(or Xavier uniform).

Let’s add the `hard_sigmoid`

activation function to the imports:

`from keras.activations import hard_sigmoid`

Also define it, together with the initializer and the input shape (remember, 8 columns so 8 features):

`model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform'))`

We next compile the model, i.e. initialize it. We do this as follows:

`model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])`

Binary cross entropy is the de facto standard loss function for binary classification problems, so we use it too (Chollet, 2018). Why this is a good one will be covered in another blog. The same goes for the Adam optimizer, which is an extension of default gradient descent, resolving some of its challenges…. and we use accuracy because it is more intuitive than loss

We’ll next fit the data to our pseudo Rosenblatt Perceptron. This essentially tells Keras to start the training process:

`model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2)`

Note that we’ve had to configure some options:

- The
**number of epochs**, or the number of iterations of passing through the data and subsequent optimization before the model stops the training process. - The
**batch size**, or the sample size used during epochs. - We set the output to
**verbose**, so that we’ll see what happens during execution. - I’m also splitting 20% of the dataset into
**validation data**, which essentially reduces overfitting.

For this last reason, we’ll have to clearly inspect the dataset first. If, say, all non-diabetes cases (class 0) came first, followed by the diabetes class (1), we’d have a problem:

The validation data is selected from the last samples in the

Source: Keras docs`x`

and`y`

data provided, before shuffling.

… a.k.a. our validation data would only have diabetic cases in that case, rendering it highly unreliable.

However, inspecting the data at random ensures that the dataset seems to be distributed rather randomly with respect to target class:

```
....
2,71,70,27,0,28.0,0.586,22,0
7,103,66,32,0,39.1,0.344,31,1
7,105,0,0,0,0.0,0.305,24,0
1,103,80,11,82,19.4,0.491,22,0
1,101,50,15,36,24.2,0.526,26,0
5,88,66,21,23,24.4,0.342,30,0
8,176,90,34,300,33.7,0.467,58,1
7,150,66,42,342,34.7,0.718,42,0
1,73,50,10,0,23.0,0.248,21,0
7,187,68,39,304,37.7,0.254,41,1
0,100,88,60,110,46.8,0.962,31,0
0,146,82,0,0,40.5,1.781,44,0
0,105,64,41,142,41.5,0.173,22,0
2,84,0,0,0,0.0,0.304,21,0
8,133,72,0,0,32.9,0.270,39,1
....
```

All right, let’s go! This is our code (it is also available on GitHub):

```
# Load dependencies
from keras.models import Sequential
from keras.layers import Dense
from keras.activations import hard_sigmoid
import numpy as np
# Load data
dataset = np.loadtxt('./pima_dataset.csv', delimiter=',')
# Separate train and test data
X = dataset[:, 0:8]
Y = dataset[:, 8]
# Create the Perceptron
model = Sequential()
model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the Perceptron
model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2)
```

Does it work?

`2019-07-24 18:44:55.155520: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3026 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)`

Yes.

Does it work well?

```
Epoch 225/225
614/614 [==============================] - 0s 111us/step - loss: 5.5915 - acc: 0.6531 - val_loss: 5.7565 - val_acc: 0.6429
```

…and with a different Glorot initialization…

```
Epoch 225/225
614/614 [==============================] - 0s 103us/step - loss: 5.2812 - acc: 0.6596 - val_loss: 6.6020 - val_acc: 0.5844
```

…yes, only slightly. On the validation dataset, the accuracy is only ~60%.

Why is this the case?

It’s the complexity of the dataset! It’s difficult to cram 8 vectors into only one neuron, that’s for sure. However, I’m still impressed with the results, though! *Think about creating shallow networks first before starting with deep ones*, is what may be the actual lesson learnt here.

Altogether, today, you’ve seen how to use Keras to create a perceptron that mimics Rosenblatt’s one. You also saw why a true perceptron cannot be created with Keras because it learns differently. Finally, we showed that this is actually the case and saw our development fail. I hope you’ve learnt something, and please – I am happy to receive your remarks, whether positive or negative – let me know below and I’ll improve!

Happy coding!

Ariosa, R. (2018, April 27). MrRobb/keras-zoo. Retrieved from https://github.com/MrRobb/keras-zoo/blob/master/P%20(Perceptron)/readme.md

Chollet, F. (2017). *Deep Learning with Python*. New York, NY: Manning Publications.

Rosenblatt, F. (1957). *The Perceptron – a Perceiving and Recognizing Automaton*. Retrieved from UMass website: https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf

Zakaria, Y. (2016, November 23). Non-smooth and non-differentiable customized loss function tensorflow. Retrieved from https://stackoverflow.com/a/40758135

The post Why you can’t truly create Rosenblatt’s Perceptron with Keras appeared first on Machine Curve.

]]>The post Linking maths and intuition: Rosenblatt’s Perceptron in Python appeared first on Machine Curve.

]]>And notable, he is.

Rosenblatt is the inventor of the so-called Rosenblatt Perceptron, which is one of the first algorithms for supervised learning, invented in 1958 at the Cornell Aeronautical Laboratory.

The blogs I write on Machine Curve are educational in two ways. First, I use them to structure my thoughts on certain ML related topics. Second, if they help me, they could help others too. This blog is one of the best examples: it emerged from my struggle to identify why it is difficult to implement Rosenblatt’s Perceptron with modern machine learning frameworks.

Turns out that has to do with the means of optimizing one’s model – a.k.a. the Perceptron Learning Rule vs Stochastic Gradient Descent. I’m planning to dive into this question in detail in another blog. This blog describes the work I preformed *before* being able to answer it – or, programming a Perceptron myself, understanding how it attempts to find the best decision boundary.

I will first introduce the Perceptron in detail by discussing some of its history as well as its mathematical foundations. Subsequently, I will move on to the Perceptron Learning Rule, demonstrating how it improves over time. This is followed by a Python based Perceptron implementation that is finally demonstrated with a real dataset.

If you run into questions during the read, or if you have any comments, please feel free to write a comment in the comment box near the bottom I’m happy to provide my thoughts and improve this post whenever I’m wrong. I hope to hear from you!

A Perceptron is a binary classifier that was invented by Frank Rosenblatt in 1958, working on a research project for Cornell Aeronautical Laboratory that was US government funded. It was based on the recent advances with respect to mimicing the human brain, in particular the MCP architecture that was recently invented by McCulloch and Pitts.

This architecture attempted to mimic the way neurons operate in the brain: given certain inputs, they fire, and their firing behavior can change over time. By allowing the same to happen in an artificial neuron, researchers at the time argued, machines could become capable of approximating human intelligence.

…well, that was a slight overestimation, I’d say Nevertheless, the Perceptron lies at the basis of where we’ve come today. It’s therefore a very interesting topic to study deeper. Next, I will therefore scrutinize its mathematical building blocks, before moving on to implementing one in Python.

When you train a supervised machine learning model, it must somehow capture the information that you’re giving it. The Perceptron does this by means of a *weights vector*, or

that determines the exact position of the decision boundary and is learnt from the data.**w**

If you input new data, say in an *input vector*

, you’ll simply have to pinpoint this vector with respect to the learnt weights, to decide on the class.**x**

Mathematically, this is represented as follows:

\begin{equation} f(x) = \begin{cases} 1, & \text{if}\ \textbf{w}\cdot\textbf{x}+b > 0 \\ 0, & \text{otherwise} \\ \end{cases} \end{equation}Here, you can see why it is a binary classifier: it simply determines the data to be part of class ‘0’ or class ‘1’. This is done based on the output of the multiplication of the weights and input vectors, with a bias value added.

When you multiply two vectors, you’re computing what is called a dot product. A dot product is the sum of the multiplications of the individual scalars in the vectors, pair-wise. This means that e.g. \(w_1x_1\) is computed and summated together with \(w_2x_2\), \(w_3x_3\) and so on … until \(w_nx_n\). Mathematically:

\begin{equation} \begin{split} &z=\sum_{i=1}^{n} w_nx_n + b \\ &= w_1x_1 + … + w_nx_n + b \\ \end{split} \end{equation}When this output value is larger than 0, it’s class 1, otherwise it’s class 0.

Visually, this looks as follows:

All right – we now have a mathematical structure for automatically deciding about the class. Weights vector

and bias value **w***b** *are used for setting the decision boundary. We did however not yet cover how the Perceptron is updated. Let’s find out now!

Rosenblatt did not only provide the model of the perceptron, but also the method for optimizing it.

This however requires that we first move the bias value into the weights vector.

This sounds strange, but it is actually a very elegant way of making the equation simpler.

As you recall, this is how the Perceptron can be defined mathematically:

\begin{equation} f(x) = \begin{cases} 1, & \text{if}\ \textbf{w}\cdot\textbf{x}+b > 0 \\ 0, & \text{otherwise} \\ \end{cases} \end{equation}Of which \(\textbf{w}\cdot\textbf{x}+b\) could be written as:

\begin{equation} \begin{split} &z=\sum_{i=1}^{n} w_nx_n + b \\ &= w_1x_1 + … + w_nx_n + b \\ \end{split} \end{equation}We now add the bias to the weights vector as \(w_0\) and choose \(x_0 = 1\). This looks as follows:

This allows us to rewrite \(z\) as follows – especially recall that \(w_0 = b\) and \(x_0 = 1\):

\begin{equation} \begin{split} & z = \sum_{i=0}^{n} w_nx_n \\ & = w_0x_0 + w_1x_1 + … + w_nx_n \\ & = w_0x_0 + w_1x_1 + … + w_nx_n \\ & = 1b + w_1x_1 + … + w_nx_n \\ & = w_1x_1 + … + w_nx_n + b \end{split} \end{equation}As you can see, it is still equal to the original way of writing it:

\begin{equation} \begin{split} &z=\sum_{i=1}^{n} w_nx_n + b \\ &= w_1x_1 + … + w_nx_n + b \\ \end{split} \end{equation}This way, we got rid of the bias \(b\) in our main equation, which will greatly help us with what we’ll do now: update the weights in order to optimize the model.

We’ll use what is called the *Perceptron Learning Rule* for that purpose. But first, we need to show you how the model is actually trained – by showing the pseudocode for the entire training process.

We’ll have to make a couple assumptions at first:

- There is the weights vector
`w`

which, at the beginning, is uninitialized. - You have a set of training values, such as \(T = \{ (x_1, d_1), (x_2, d_2), …, (x_n, d_n) \}\). Here, \(x_n\) is a specific feature vector, while \(d_n\) is the corresponding target value.
- We ensure that \(w_0 = b\) and \(x_0 = 1\).
- We will have to configure a
*learning rate*or \(r\), or by how much the model weights improve. This is a number between 0 and 1. We use \(r = 0.1\) in the Python code that follows next.

This is the pseudocode:

- Initialize the weights vector

to zeroes or random numbers.**w** - For every \((x_n, d_n)\) in \(D\):
- Compute the output value for the input vector \(x_n\). Mathematically, that’s \(d’_n: f(x_n) = w_nx_n\).
- Compare the output value \(d’_n\) with target value \(d_n\).
- Update the weights according to the Perceptron Learning Rule: \(w_\text{n,i}(t+1) = w_\text{n,i}(t) + r \cdot (d_n – d’_n) \cdot x_\text{n,i}\) for all features (scalars) \(0 \leq i \leq|w_n|\).

Or, in plain English:

- First initialize the weights randomly or to zeroes.
- Iterate over every feature in the data set.
- Compute the output value.
- Compare if it matches, and ‘push’ the weights into the right direction (i.e. the \(d_n – d’_n\) part) slightly with respect to \(x_\text{n,i}\), as much as the learning rate \(r\) allows.

This means that the weights are updated for every sample from the dataset.

This process may be repeated until some criterion is reached, such as a specific number of errors, or – if you are adventurous – full convergence (i.e., the number of errors is 0).

Now let’s see if we can code a Perceptron in Python. Create a new folder and add a file named `p.py`

. In it, let’s first import numpy, which we’ll need for some number crunching:

`import numpy as np`

We’ll create a class that is named `RBPerceptron`

, or Rosenblatt’s Perceptron. Classes in Python have a specific structure: they must be defined as such (by using `class`

) and can contain Python definitions which must be coupled to the class through `self`

. Additionally, it may have a constructor definition, which in Python is called `__init__`

.

So let’s code the class:

```
# Basic Rosenblatt Perceptron implementation
class RBPerceptron:
```

Next, we want to allow the engineer using our Perceptron to configure it before he or she starts the training process. We would like them to be able to configure two variables:

- The number of epochs, or rounds, before the model stops the training process.
- The learning rate \(r\), i.e. the determinant for the size of the weight updates.

We’ll do that as follows:

```
# Constructor
def __init__(self, number_of_epochs = 100, learning_rate = 0.1):
self.number_of_epochs = number_of_epochs
self.learning_rate = learning_rate
```

The `__init__`

definition nicely has a self reference, but also two attributes: `number_of_epochs`

and `learning_rate`

. These are preconfigured, which means that if those values are not supplied, those values serve as default ones. By default, the model therefore trains for 100 epochs and has a default learning rate of 0.1

However, since the user can manually provide those, they must also be set. We need to use them globally: the number of epochs and the learning rate are important for the training process. By consequence, we cannot simply keep them in the context of our Python definition. Rather, we must add them to the instance variables of the class. This can be done by assigning them to the class through `self`

.

All right, the next part – the training definition:

```
# Train perceptron
def train(self, X, D):
# Initialize weights vector with zeroes
num_features = X.shape[1]
self.w = np.zeros(num_features + 1)
# Perform the epochs
for i in range(self.number_of_epochs):
# For every combination of (X_i, D_i)
for sample, desired_outcome in zip(X, D):
# Generate prediction and compare with desired outcome
prediction = self.predict(sample)
difference = (desired_outcome - prediction)
# Compute weight update via Perceptron Learning Rule
weight_update = self.learning_rate * difference
self.w[1:] += weight_update * sample
self.w[0] += weight_update
return self
```

The definition it self must once again have a `self`

reference, which is provided. However, it also requires the engineer to pass two attributes: `X`

, or the set of input samples \(x_1 … x_n\), as well as `D`

, which are their corresponding targets.

Within the definition, we first initialize the weights vector as discussed above. That is, we assign it with zeroes, and it is `num_features + 1`

long. This way, it can both capture the features \(x_1 … x_n\) as well as the bias \(b\) which was assigned to \(x_0\).

Next, the training process. This starts by creating a `for`

statement that simply ensures that the program iterates over the `number_of_epochs`

that were configured by the user.

During one iteration, or epoch, every combination of \((x_i, d_i)\) is iterated over. In line with the pseudocode algorithm, a prediction is generated, the difference is computed, and the weights are updated accordingly.

After the training process has finished, the model itself is returned. This is not necessary, but is relatively convenient for later use by the ML engineer.

Finally, the model must also be capable of generating predictions, i.e. computing the dot product \(\textbf{w}\cdot\textbf{x}\) (where \(b\) is included as \(w_0\)).

We do this relatively elegantly, thanks to another example of the Perceptron algorithm provided by Sebastian Raschka: we first compute the dot product for all weights except \(w_0\) and subsequently add this one as the bias weight. Most elegantly, however, is how the prediction is generated: with `np.where`

. This allows an engineer to generate predictions for a batch of samples \(x_i\) at once. It looks as follows:

```
# Generate prediction
def predict(self, sample):
outcome = np.dot(sample, self.w[1:]) + self.w[0]
return np.where(outcome > 0, 1, 0)
```

All right – when integrated, this is our final code.

You can also check it out on GitHub.

```
import numpy as np
# Basic Rosenblatt Perceptron implementation
class RBPerceptron:
# Constructor
def __init__(self, number_of_epochs = 100, learning_rate = 0.1):
self.number_of_epochs = number_of_epochs
self.learning_rate = learning_rate
# Train perceptron
def train(self, X, D):
# Initialize weights vector with zeroes
num_features = X.shape[1]
self.w = np.zeros(num_features + 1)
# Perform the epochs
for i in range(self.number_of_epochs):
# For every combination of (X_i, D_i)
for sample, desired_outcome in zip(X, D):
# Generate prediction and compare with desired outcome
prediction = self.predict(sample)
difference = (desired_outcome - prediction)
# Compute weight update via Perceptron Learning Rule
weight_update = self.learning_rate * difference
self.w[1:] += weight_update * sample
self.w[0] += weight_update
return self
# Generate prediction
def predict(self, sample):
outcome = np.dot(sample, self.w[1:]) + self.w[0]
return np.where(outcome > 0, 1, 0)
```

All right, let’s now test our implementation of the Perceptron. For that, we’ll need a dataset first. Let’s generate one with Python. Go to the same folder as `p.py`

and create a new one, e.g. `dataset.py`

. Use this file for the next steps.

We’ll first import `numpy`

and generate 50 zeros and 50 ones.

We then combine them into the `targets`

list, which is now 100 long.

We’ll then use the normal distribution to generate two samples that do not overlap of both 50 samples.

Finally, we concatenate the samples into the list of input vectors `X`

and set the desired targets `D`

to the targets generated before.

```
# Import libraries
import numpy as np
# Generate target classes {0, 1}
zeros = np.zeros(50)
ones = zeros + 1
targets = np.concatenate((zeros, ones))
# Generate data
small = np.random.normal(5, 0.25, (50,2))
large = np.random.normal(6.5, 0.25, (50,2))
# Prepare input data
X = np.concatenate((small,large))
D = targets
```

It’s always nice to get a feeling for the data you’re working with, so let’s first visualize the dataset:

```
import matplotlib.pyplot as plt
plt.scatter(small[:,0], small[:,1], color='blue')
plt.scatter(large[:,0], large[:,1], color='red')
plt.show()
```

It should look like this:

Let’s next train our Perceptron with the entire training set `X`

and the corresponding desired targets `D`

.

We must first initialize our Perceptron for this purpose:

```
from p import RBPerceptron
rbp = RBPerceptron(600, 0.1)
```

Note that we use 600 epochs and set a learning rate of 0.1. Let’s now train our model:

`trained_model = rbp.train(X, D)`

The training process should be completed relatively quickly. We can now visualize the Perceptron and its decision boundary with a library called mlxtend – once again the credits for using this library go out to Sebastian Raschka.

If you don’t have it already, install it first by means of `pip install mlxtend`

.

Subsequently, add this code:

```
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X, D.astype(np.integer), clf=trained_model)
plt.title('Perceptron')
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
```

You should now see the same data with the Perceptron decision boundary successfully separating the two classes:

There you go, Rosenblatt’s Perceptron in Python!

Bernard (2018, December). Align equation left. Retrieved from https://tex.stackexchange.com/questions/145657/align-equation-left

Raschka, S. (2015, March 24). Single-Layer Neural Networks and Gradient Descent. Retrieved from https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#artificial-neurons-and-the-mcculloch-pitts-model

Perceptron. (2003, January 22). Retrieved from https://en.wikipedia.org/wiki/Perceptron#Learning_algorithm

The post Linking maths and intuition: Rosenblatt’s Perceptron in Python appeared first on Machine Curve.

]]>The post Commoditizing AI? The state of automated machine learning. appeared first on Machine Curve.

]]>However, hidden from popular belief, are we on the verge of a radical transformation in machine learning and its subset practice of deep learning?

A transformation in the sense that we are moving towards automated machine learning – making hardcore ML jobs obsolete?

Perhaps so, as recent research reports indicate that research into automated ML tools is intensifying (Tuggener et al., 2019). It triggered me: can I lose my job as a ML engineer even *before* the field has stopped to be hot?

Let’s find out. In this blog, I’ll take a brief look into so-called *AutoML *tools as well as their developments. I first take a theoretical path and list the main areas of research into automating ML. I’ll then identify a few practical tools that I think are most promising today. Finally, I’ll discuss how this may in my opinion affect our jobs as ML engineers.

Data scientists have the sexiest job of the 21st Century, at least that’s what they wrote some years back. However, the job is really complex, especially when it comes to training machine learning models. It encompasses many things…

The first step is getting to know your data. What are its ideosyncrasies? What is important in the dataset? Which features do you think are most discriminative with respect to the machine learning problem at hand? Those are questions that must be answered by data scientists before one can even think about training a ML model.

Then, next question – which type of model must be used? Should we use Support Vector Machines with some kernel function that allows us to train SVMs for non-linear datasets? Or should we use neural networks instead? If so, what type of neural network?

Ok, suppose that we chose a certain class of neural networks, say Convolutional Neural Networks. You’ll then have to decide about the network architecture. How many layers should be used? Which activation functions must be added to these layers? What kind of regularization do I apply? How many densely-classified layer must accompany my convolutional ones? All kind of questions that must be answered by the engineer.

Suppose that you have chosen both a *model class* and an *architecture*. You’ll then move on and select a set of hyperparameters, or configuration options. Example ones are the degree with which a model is optimized every iteration, also known as the learning rate. Similarly, you choose the optimizer, and the loss function to be used during training. And there are other ones.

All right, but how do I even start when I already feel overwhelmed right now?

Quite easy. Very likely, you do not know the answers to all these questions in advance. Often, you therefore use the experience you have to guide you towards an intuitively suitable algorithm. Subsequently, you experiment with various architectures and hyperparameters – slightly guided by what is found in the literature, perhaps, but often based on common sense.

And worry not: it’s not strange that difficult jobs are made easier. In fact, this is very common. In the 1990s and later, the World Wide Web caused a large increase in access to information. This made difficult jobs, such as collecting insights on highly specific topics, much easier and – often – obsolete. This process can now also be observed in the field of machine learning.

Will AI become a commodity? Let’s see where we stand now, both in theory and in practice.

What becomes clear from the paper written by Tuggener et al. (2019) is that much research is currently being performed into automating “various blocks of the machine learning pipeline”, i.e. from the beginning to the end. They suggest that these developments can be grouped into these distinct categories:

- Automating feature engineering.
- Meta-learning.
- Architecture search.
- Hyperparameter optimization.
- Combined Model Selection and Hyperparameter Optimization (CASH).

The first category is automated **feature engineering**. Every model harnesses feature vectors and, together with their respective targets in the case of supervised learning, attempts to identify patterns in the data set.

However, not every feature in a feature vector is, so to say, *discriminative* enough.

That is, it blinds the model from identifying relevant patterns rather than making those patterns clearer.

It’s often best to remove these features. This is often a tedious job, since an engineer must predict which ones must be removed, before retraining the models to see whether his or her prediction is right.

Various approaches towards automating this problem exist today. For example, it can be considered to be a reinforcement learning problem, where an intelligent agent learns to recognize good and bad features. Other techniques combine features before feeding them into the model, assessing their effectiveness. Another approach attempts to compute the information gain for scenarios where features are varied. Their goal is to maximize this gain. However, they all have in common that they *only focus on the feature engineering aspects*. That’s however only one aspect of the ML pipeline.

In another approach, named **meta-learning**, the features are not altered. Rather, a meta-model is trained that has learnt from previous training processes. Such models can take as input e.g. the number and type of features as well as the algorithms and then generate a prediction with respect to what optimization is necessary.

As Tuggener et al. (2019) demonstrate, many such algorithms are under active development today. The same observation is made by Elshawi et al. (2019).

Similarly under active research scrutiny these days is what Tuggener et al. (2019) call **architecture search**. In essence, finding the best-performing model can be considered to be a search problem with the goal of finding the right model architecture. It’s therefore perhaps one of the most widely used means for automating ML these days.

Within this category, many sub approaches to searching the most optimal architecture can be observed today (Elshawi et al., 2019). At a very high level, they are as follows:

- Searching randomly. It’s a naΓ―ve approach, but apparently especially this fact benefits finding model architectures.
- Reinforcement learning, or training a dumb agent by means of “losses” and “rewards” to recognize good paths towards improvement, is an approach that is used today as well.
- By optimizing the gradient of the
*search problem*, one can essentially consider finding the architecture to be a meta problem. - Evolutionary algorithms that add genetic optimization can be used for finding well-performing architectures.
- Bayesian optimization, or selecting a path to improvement from a Gaussian distribution, is also used in certain works.

I refer to the original work (Elshawi et al., 2019 – see the references list below) for a more detailed review.

Suppose that you have chosen a particular model type, say a Convolutional Neural Network. As you’ve read before, you then face the choice of hyperparameter selection – or, selecting the model’s configuration elements.

It includes, as we recall, picking a suitable optimizer, setting the learning rate, et cetera.

This is essentially a large search problem to be solved.

If **hyperparameter optimization** is used for automating machine learning, it’s essentially this last part of the ML training process that is optimized.

But is it enough? Let’s introduce CASH.

If you combine the approaches discussed previously, you come to what is known as the CASH approach: combining model selection and hyperparameter optimization.

Suppose that you have a dataset for which you wish to train a machine learning model, but you haven’t decided yet about an architecture.

Solving the CASH problem would essentially mean that you find the optimum data pipeline for the dataset (Tuggener et al., 2019) – including:

- Cleaning your data.
- Feature selection and construction, where necessary.
- Model selection (SVMs? Neural networks? CNNs? RNNs? Eh, who knows?)
- Hyperparameter optimization.
- Perhaps, even ensemble learning, combining the models into a better-performing ensemble.

According to Tuggener et al. (2019) this would save a massive amount of time for data scientists. They argued that a problem which their data scientists worked hard on for weeks could be solved by automated tooling in 30 minutes. Man, that’s progress.

All these theoretical contributions are nice, but I am way more curious about how they are applied in practice.

What systems for automating machine learning are in use today?

Let’s see if we can find some and compare them.

The first system I found is called Cloud AutoML and is provided as a service by Google. It suggests that it uses Google’s *Neural Architecture Search*. This yields the insight that it therefore specifically targets neural networks and attempts to find the best architecture with respect to the dataset. It focuses on computer vision, natural language processing and tabular data.

Cloud AutoML is however rather pricy as it apparently costs $20 per hour (Seif, 2019). Fortunately, for those who have experience with Keras, there is now a library out there called AutoKeras – take a look at it here. It essentially turns the Keras based way of working into an AutoML problem: it performs an architecture search by means of Bayesian optimization and network morphism. Back to plain English now, but if you really wish to understand it deeper – take a look at (Jin et al., 2018).

I do – and will dive deeper into it ASAP. Remind me of this, please!

A post by Oleksii Kharkovyna at Medium/TowardsDataScience suggests that there are various other approaches to automated ML in use today. Check it out here.

The field of machine learning seems to be democratizing rapidly. Whereas deep knowledge on algorithms, particularly deep neural networks these days, was required in the past, that seems to be less and less the case.

Does this mean that no ML engineers are required anymore?

No. Not in my view.

However, what I’m trying to suggest here is a number of things:

- Do not stare yourself blind at becoming an expert in model optimization. It’s essentially a large search problem that is bound to be democratized and, by consequently, automated away.
- Take notice of the wide array of automated machine learning tools and get experience with them. You may be asked to use them in the future. It would be nice if you already had some experience – it would set you apart from the rest
- Become creative! These automated machine learning solutions are simply the solvers of large search problems. However, translating business problems into a machine learning task still requires creativity and tactical and/or strategic awareness. This is still a bridge too far for those kind of technologies.

Data science may still be the sexiest job of the 21st Century, but be prepared for some change. Would you agree with me? Or do you disagree entirely? I would be glad to know. Leave your comments in the comment section below I’ll respond with my thoughts as soon as I can.

Elshawi, R., Maher, M., & Sakr, S. (2019, June). Automated Machine Learning: State-of-The-Art and Open Challenges. Retrieved from https://arxiv.org/abs/1906.02287

Kharkovyna, O. (2019, May 22). Top 10 Data Science & ML Tools for Non-Programmers – Towards Data Science. Retrieved from https://towardsdatascience.com/top-10-data-science-ml-tools-for-non-programmers-d12ce6dcccc

Jin, H., Song, Q., & Hu, X. (2018, June). Auto-Keras: An Efficient Neural Architecture Search System. Retrieved from https://arxiv.org/abs/1806.10282

Seif, G. (2019, February 23). AutoKeras: The Killer of Google’s AutoML – Towards Data Science. Retrieved from https://towardsdatascience.com/autokeras-the-killer-of-googles-automl-9e84c552a319

Tuggener, L., Amirian, M., Rombach, K., LΓΆrwald, S., Varlet, A., Westermann, C., & Stadelmann, T. (2019, July). Automated Machine Learning in Practice: State of the Art and Recent Results. Retrieved from https://arxiv.org/abs/1907.08392

The post Commoditizing AI? The state of automated machine learning. appeared first on Machine Curve.

]]>The post CNNs and feature extraction? The curse of data sparsity. appeared first on Machine Curve.

]]>Specifically, since utility mapping harnesses a geophysical technique called Ground Penetrating Radar, which produces image-like data, I investigated the effectiveness of Convolutional Neural Networks for this purpose. Since utility mapping is effectively a classification problem with respect to utility material type, that’s what made CNNs worthwhile.

Later more on my thesis work, but today I want to share a peculiar observation with you: **that I have the feeling that feature compression deteriorates model performance when you’re using CNNs.**

Since deep learning practitioners such as Chollet claim “to input data into CNNs as raw as possible”, you may wonder why this blog is written in the first place.

So let’s look backwards for a bit before we’ll try to explain the behavior I observed during my research.

Primarily, approaches harnessing machine learning for improving the utility mapping process have used Support Vector Machines for this purpose. SVMs, which were popular many years ago i.e. before deep learning was cool, had one big shortcoming: they could not handle dimensionality well. That is, if you had an image, you had to substantially downsample it prior to feeding it to the model. Otherwise, it wouldn’t work.

By consequence, many feature extraction approaches were investigated for utility mapping that all had in common that they wanted to reduce this *curse of dimensionality*. Examples are signal histograms (reducing dimensionality because many signal backscatters could be grouped into histogram bins) or the Discrete Cosine Transform (which essentially transforms the data input into the frequency spectrum, making it usable for signal compression such as the JPEG format).

…so I thought: let’s try and see if they also work with CNNs, and I trained CNNs with histograms, DCTs and raw data.

Fun fact: the first two didn’t work with accuracies averaging 50-60%. The latter one *did* work and achieved ~80% with only 2500 data points.

I think I have been able to intuitively derive the reasons for this problem based on logical reasoning, but let’s first see if we can reproduce this behavior once more.

Do we remember that fancy numbers dataset?

Indeed, it’s the MNIST dataset: “a training set of 60,000 examples, and a test set of 10,000 examples”. It contains handwritten digits, thus number from 0-9.

To give you a baseline of what a CNN can do with such a dataset, you will next see the result of training a CNN based on a default Keras example script:

```
Epoch 1/12
60000/60000 [==============================] - 24s 404us/step - loss: 0.2616 - acc: 0.9201 - val_loss: 0.0745 - val_acc: 0.9779
Epoch 2/12
60000/60000 [==============================] - 15s 250us/step - loss: 0.0888 - acc: 0.9731 - val_loss: 0.0427 - val_acc: 0.9864
Epoch 3/12
60000/60000 [==============================] - 15s 244us/step - loss: 0.0667 - acc: 0.9797 - val_loss: 0.0356 - val_acc: 0.9878
Epoch 4/12
60000/60000 [==============================] - 14s 239us/step - loss: 0.0559 - acc: 0.9835 - val_loss: 0.0308 - val_acc: 0.9901
Epoch 5/12
60000/60000 [==============================] - 14s 238us/step - loss: 0.0478 - acc: 0.9858 - val_loss: 0.0318 - val_acc: 0.9901
Epoch 6/12
60000/60000 [==============================] - 13s 212us/step - loss: 0.0434 - acc: 0.9870 - val_loss: 0.0288 - val_acc: 0.9908
Epoch 7/12
60000/60000 [==============================] - 13s 218us/step - loss: 0.0392 - acc: 0.9877 - val_loss: 0.0312 - val_acc: 0.9904
Epoch 8/12
60000/60000 [==============================] - 14s 236us/step - loss: 0.0350 - acc: 0.9891 - val_loss: 0.0277 - val_acc: 0.9909
Epoch 9/12
60000/60000 [==============================] - 14s 232us/step - loss: 0.0331 - acc: 0.9897 - val_loss: 0.0276 - val_acc: 0.9906
Epoch 10/12
60000/60000 [==============================] - 15s 243us/step - loss: 0.0318 - acc: 0.9901 - val_loss: 0.0269 - val_acc: 0.9913
Epoch 11/12
60000/60000 [==============================] - 13s 219us/step - loss: 0.0284 - acc: 0.9914 - val_loss: 0.0296 - val_acc: 0.9899
Epoch 12/12
60000/60000 [==============================] - 12s 200us/step - loss: 0.0263 - acc: 0.9918 - val_loss: 0.0315 - val_acc: 0.9903
Test loss: 0.03145747215508682
Test accuracy: 0.9903
```

That’s pretty good performance: it was right in approximately 99% of cases using the test set in only 12 epochs, or rounds of training. Could be worse… although it’s a very simple computer vision indeed

In order to demonstrate what I mean with *worse performance when your data is sparser*, I’m going to convert the MNIST samples into a sparsened version. I’ll use the Discrete Cosine Transform for this, also called the DCT.

The DCT is a signal compression technique which, according to Wikipedia, “expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies”.

I’m specifically using the `scipy.fftpack`

DCT, type 2, which is the de facto default DCT in the scientific community. It can be written as follows:

```
N-1
y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N.
n=0
```

This is what the numbers subsequently look like visually:

You see that they can still be distinguished, but that the signal is more compact now (or diluted). This property, called *signal compaction*, allows one to literally downsample the DCT without losing predictive power.

Now let’s see what happens if you average the matrices across one of the axes:

We have substantially sparser feature vectors now: in fact, every number is now represented by 28 instead of 784 features.

Let’s redo the experiment. Note that this time, I had to change all references to 2D image data, e.g. the `Conv2D`

and the `MaxPooling2D`

layers, into their 1D variants – we namely removed one dimension from the data, and the 2D variants simply don’t work anymore.

The convolution operation with learning filters itself, however, remains similar. This is the result:

```
Epoch 1/12
60000/60000 [==============================] - 23s 380us/step - loss: 2.5680 - acc: 0.1103 - val_loss: 2.3011 - val_acc: 0.1135
Epoch 2/12
60000/60000 [==============================] - 11s 183us/step - loss: 2.3026 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 3/12
60000/60000 [==============================] - 12s 196us/step - loss: 2.3021 - acc: 0.1126 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 4/12
60000/60000 [==============================] - 11s 190us/step - loss: 2.3015 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 5/12
60000/60000 [==============================] - 10s 174us/step - loss: 2.3016 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 6/12
60000/60000 [==============================] - 11s 186us/step - loss: 2.3014 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 7/12
60000/60000 [==============================] - 11s 185us/step - loss: 2.3013 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 8/12
60000/60000 [==============================] - 11s 192us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 9/12
60000/60000 [==============================] - 11s 184us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 10/12
60000/60000 [==============================] - 10s 163us/step - loss: 2.3015 - acc: 0.1125 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 11/12
60000/60000 [==============================] - 10s 166us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Epoch 12/12
60000/60000 [==============================] - 11s 191us/step - loss: 2.3014 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135
Test loss: 2.3010036102294924
Test accuracy: 0.1135
```

Absolutely terrible performance. Unworthy of CNNs!

And this is indeed what I also experienced during my research.

In my research, I drew this conclusion with respect to the loss of performance when using the DCT:

*I think you blind the convolutional filters to the ideosyncrasies of the data.*

Or, in laymen’s terms, you make the CNN blind to the unique aspects represented by the numbers… despite the fact that they are already *in there*.

**Why is this the case?**

In my opinion, this can be explained by looking at the internals of a convolutional layer. It works as follows. You specify a number of filters which, during training, learn to recognize unique aspects of the image-like data. They can then be used to classify new samples – quite accurately, as we have seen with raw MNIST data. This means that the convolutional layer *already makes your data representation sparser*. What’s more, this effect gets even stronger when layers like Max Pooling are applied – which is precisely what I did above.

But when you downsample the data first by e.g. applying the DCT, *you thus effectively apply sparsening twice.* My only conclusion can thus be that by consequence, the convolutional filters can no longer learn the unique aspects within the image-like data, as they are hidden in the data set made compact. Only then, I literally found out why people always suggest to input your image data into CNNs as untransformed as possible.

**Then why did this work with SVMs?**

Previous scientific works on supporting utility mapping with machine learning achieved promising results when applying dimensionality reduction techniques like the DCT before training their models, such as SVMs.

Yet, it didn’t work with CNNs.

Besides the architectural differences between them, one must also conclude that *CNNs make data essentially sparser while SVMs do not*. Consequently, for the latter you actually needed to apply those compression techniques for them to work in the first place, while for the first it makes the models perform worse.

An interesting insight – and a reminder to always set an average- to well-performing baseline first before you start training variations

Did you run into this problem too? I’m eager to know. Please feel free to leave a comment. I’m happy to respond Thanks for reading!

The post CNNs and feature extraction? The curse of data sparsity. appeared first on Machine Curve.

]]>The post Can neural networks approximate mathematical functions? appeared first on Machine Curve.

]]>…and it made the authors wonder about what neural networks can achieve, since pretty much anything can be translated into models and by consequence mathematical formulae.

When reading the paper, I felt like experimenting a little with this property of neural networks, and to try and find out whether with sufficient data functions such as \(x^2\), \(sin(x)\) and \(1/x\) can be approximated.

Let’s see if we can!

For the experiment, I used the following code for approximating \(x^2\):

```
# Imports
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
# Load training data
x = -50 + np.random.random((25000,1))*100
y = x**2
# Define model
model = Sequential()
model.add(Dense(40, input_dim=1, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x, y, epochs=15, batch_size=50)
predictions = model.predict([10, 5, 200, 13])
print(predictions) # Approximately 100, 25, 40000, 169
```

Let’s take the code above apart first, before we move on to the results.

First, I’m importing the Python packages that I need for successfully running the experiment. First, I’m using `numpy`

, which is the numerical processing package that is the de facto standard in data science today.

Second, I’m using `keras`

, which is a deep learning framework for Python and runs on TensorFlow, Theano and CNTK. It simply abstracts much of the pain away and allows one to create a deep learning model in only a few lines of code.

And it runs on GPU, which is very nice.

Specifically, for Keras, I’m importing the `Sequential`

model type and the `Dense`

layer type. The Sequential model type requires the engineer to ‘stack’ the individual layers on top of each other (as you will see next), while the Dense or Densely-connected layer means that each individual neuron is connected to all neurons in the following layer.

Next, I load the training data. Rather simply, I’m generating 25.000 numbers in the range [-50, 50]. Subsequently, I’m also generating the targets for the individual numbers by applying `x**2`

or \(x^2\).

Then, I define the model – it’s a Sequential one with three hidden layers: all of them are Dense with 40, 20 and 10 neurons, respectively. The input layer has simply one neuron (every `x`

is just a number) and the output layer has only one as well (since we regress to `y`

, which is also just a number). Note that all layers use `ReLU`

as an activation function except for the last one, standard with regression.

Mean squared error is used as a loss function, as well as Adam for optimization, all pretty much standard options for deep neural networks today.

Next, we fit the data in 15 epochs and generate predictions for 4 values. Let’s see what it outputs under ‘The results’.

I used the same code for \(sin(x)\) and \(1/x\), however I did change the assignment of \(y\) as follows, together with the expected values for the predictions:

**sin(x):**\(y = np.sin(x)\); expected values approximately -0.544, -0.959, -0.873 and 0.420.**1/x:**\(y = 1/x\); expected values approximately 0.10, 0.20, 0.005 and 0.077.

For \(x^2\), these were the expected results: `100, 25, 40000, 169`

.

Those are the actual results:

```
[[ 101.38112 ]
[ 25.741158]
[11169.604 ]
[ 167.91489 ]]
```

Pretty close for most ones. Only for `40000`

, the model generated a wholly wrong prediction. That’s not strange, though: the training data was generated in the interval [-50, 50]; apparently, 100, 25 and 169 are close enough to be properly regressed, while 40000 is not. That makes intuitive sense.

Let’s now generate predictions for all the `x`

s when the model finishes and plot the results:

```
import matplotlib.pyplot as plt
plt.subplot(2, 1, 1)
plt.scatter(x, y, s = 1)
plt.title('y = $x^2$')
plt.ylabel('Real y')
plt.subplot(2, 1, 2)
plt.scatter(x, predictions, s = 1)
plt.xlabel('x')
plt.ylabel('Approximated y')
plt.show()
```

When you plot the functions, you get pretty decent results for \(x^2\):

For \(sin(x)\), results are worse:

What you see is that it approximates the sine function quite appropriately for a *very small domain*, e.g. [-5, +3], but then loses track. We might improve the estimation by feeding it with *more* samples, so we increase the number of random samples to 100.000, still at the interval [-50, 50]:

That’s already much better, but still insufficient. Perhaps, the cause is different – e.g. we may achieve better results if we used something like sin(x) as an activation function. However, that’s something for a next blog.

And finally, this is what \(1/x\) looks like:

That one’s getting closer again, but you can stee that it is not yet *highly accurate.*

The experiment was quite interesting, actually.

First, I noticed that you need more training data than I expected. For example, with only 1000 samples in my training set, the approximation gets substantially worse:

Second, not all the functions could be approximated properly. Particularly, the sine function was difficult to approximate.

Third, I did not account for overfitting whatsoever. I just let the models run, possibly introducing severe overfitting to the function at hand. But – to some extent – that was precisely what we wanted.

Fourth, perhaps as a result of (3), the models seem to perform quite well *around* the domain of the training data (i.e. the [-50, +50] interval), but generalization remains difficult. On the other hand, that could be expected; the `40000`

value for the first \(x^2\) was anything but \(

-50 < x < 50\).

Altogether, this was a nice experiment for during the evening, showing that you can use neural networks for approximating mathematical functions – if you take into account that it’s slightly more complex than you imagine at first, it can be done.

The post Can neural networks approximate mathematical functions? appeared first on Machine Curve.

]]>The post This Person Does Not Exist – how does it work? appeared first on Machine Curve.

]]>In this tech blog, we dive into the deep to find out how this is possible. You will see that we will be covering a game of a machine learning technique known as a GAN – a generative adversarial network. We’ll look into the relatively short history of this way of thinking about machine learning. In doing so, we take a short side step towards game theory. Finally, we will look at the specific case of *This Person Does Not Exist* and the building blocks that together compose the machine learning aspects of the website.

Sounds difficult? Not too difficult, if you take some time to read this blog. And don’t worry, I will do my best to discuss GANs in layman’s terms. Please let me know in the comments whether I succeeded in that – I can only learn from your responses

It’s possible that you play a game in which only one reward can be shared over all participants. Playing chess and playing tennis are perfect examples of such a game: one person wins, which means that the other one loses. Or, in case of playing chess, it’s a tie. If you would note the scores for all players in those situations and subtract them from one another, you would get the following:

**1-0:**Player 1 (+1 win), player 2 (-1 win) = together 0 win;**0-1:**Player 1 (-1 win), player 2 (+1 win) = together 0 win;**Tie:**Player 1 (Β½ win), player 2 (Β½ win) = together 0 win.

In all cases, those type of games yields a *sum of zero* with respect to the distribution of scores. It therefore won’t surprise you that such a game is called a zero-sum game. It’s one of the most important elements from a mathematical field known as game theory, because besides games like chess it can also be applied to more complex systems. Unfortunately, war, to give just one example, is often also a zero-sum game.

All right, let’s continue with the core topic of this blog: the website This Person Does Not Exist. We’ve seen what a zero-sum game is, but now we will have to apply it in the area of machine learning. The website was made by using a technique known as a *generative adversarial network*, also known as GAN. We’ll have to break that term into its distinct parts if we would like to know what it means:

**Generative:**it makes something;**Adversarial:**something battles against each other in some kind of game;**Network:**two neural networks, in this case.

In short: a GAN is composed of two neural networks which, by playing against each other and trying to let each other lose, make something.

And ‘something’ is pretty varied these days. After modern applications of GANs emerged in 2014, networks have been developed which can produce pictures of the interior, of shoes, bags and clothing. But related networks are now also capable of *playing videos ahead of time*, which means: to upload 10 seconds of video, allowing the model to predict the next two. Another one: in 2017, a work was published in which the development of GANs that can make pictures older is discussed. Its application can be extended to missing children, who have possibly grown older but whose case was never resolved.

GANs are thus a new technique in the arsenal of a machine learning engineer which spawns a wide range of new applications. Not only *predictive power*, like with other models, but also some *creative power!*

But then, how does a GAN work exactly?

Schematically, you see its inner working next.

It all starts with what we call a *noise vector*. A vector is an abstract representation of what you can also consider to be some sort of list. In machine learning, data is converted into numbers in nine out of ten cases. A noise vector could therefore also be seen as a list of random numbers. The vector, or the list, is input to the first neural network, which is called the *generator network.* This generator is capable of converting a large amount of noise into a larger and more accurate picture, layer after layer. But it’s a *fake* one, though!

The fake picture is fed into the second neural network, which is also known as the *discriminator*. The network, which has been trained with real pictures, is capable of doing the opposite: breaking down the image in individual components to determine the category to which the picture belongs. In the case of a GAN, the categories are *fake* and *real*. In a way, you can thus see the generator as the criminal and the discriminator as the cop, which has to catch the criminal.

How good catching the criminal works is what we known when we finish one epoch – a round of training. For every sample from the validation set, for which a target (fake or real) is available, it is determined how much the predicted value differs from the real one. We call this the *loss*.

Just with any other neural network, this loss value can be used to optimize the model. Optimizing the model is too complex for this blog, but with a very elegant mathematical technique one simply calculates the shortest path from the mountain top (the worst loss value) towards the valley (the best loss value). Based on one training epoch the model is capable of adapting both the *generator* and the *discriminator* for improvement, after which a new epoch can start.

Perhaps, you can imagine that whatever the generator produces is dependent on the discriminator. With every machine learning model, the goal is to maximize the gain; which also means minimizing the loss. When the discriminator becomes better and better in predicting whether an image is real or fake (and consequently yields in higher loss), the generator must improve time after time to get away with its attempt to fool the cop (making the loss lower). The discriminator, however, gets better and better in predicting *real pictures*, which we fed to this neural network. Consequently, if the generator wants to keep up with the discriminator, it means that the generator must make itself better and better in generating images that look like the real ones in the discriminator.

And the recent consequence of those developments within GANs are the pictures on ThisPersonDoesNotExist. It also explains why we’re speaking about an *adversarial network*, in which two neural networks play a zero-sum game against each other… what one wins in terms of loss, is what the other loses.

Yet, the story does not end there. Generative adversarial networks work in some kind of cop-and-criminal-relationship in order to produce very interesting results. But *This Person Does Not Exist* had a different goal: showing that it is possible to generate very accurate but also very large (1024 x 1024 pixels and larger) pictures can be generated at some speed.

That’s exactly what the bottleneck of GANs was at the time. Early GANs worked quite well, but were not too accurate (resulting in vague pictures) or could only make smaller images. In 2018, NVIDIA’s AI research team proposed a solution: the ProGAN network, which composes the generator in a very specific way. It is different in the sense that it buils the picture layer after layer, where the layers get bigger and more accurate. For example, the first layer is 4 by 4 pixels, the second 8 by 8, and so on. The interesting part of this way of working is that every new layer can benefit from the less granular results of the previous ones. In fact, is does not have to find out everything on its own. As we all know, *extending something that already exists* is much easier than starting out of the blue. ProGAN was thus a small breakthrough in the field of generative adversarial networks.

But that still doesn’t end the story. The GAN that is built into This Person Does Not Exist is named StyleGAN, and is an upgrade of ProGAN. NVIDIA’s AI team added various new elements, which allows practitioners to control more aspects of the network. For example, they can better separate the generator and the discriminator, which ensures less dependence of the generator on the training set. This allows one to, for example, reduce discrimination in the generated pictures. Nevertheless, separating those remains a challenge, which spawns a wide array of research opportunities for generative adversarial networks for the coming years!

All in all, we saw that GANs allow the introduction of creativity in machine learning. That’s simply a totally new approach to machine learning. I am very curious about the new application ares that we will see over the next period. I’ll keep you up to date…

The post This Person Does Not Exist – how does it work? appeared first on Machine Curve.

]]>The post Why you shouldn’t use a linear activation function appeared first on Machine Curve.

]]>While there exist other activation functions such as Swish, it has been hard over the years for them to catch up with both the *improvements in predictive power* required as well as the *generalization over training sets*. Whereas the high performance of ReLU for example generalizes well over various machine learning problems, this hasn’t been the case with many other activation functions.

And there’s another question people are asking a lot: **why can’t I use a linear activation function when I’m training a deep neural network?** We’ll take a look at this question in this blog, specifically inspect the optimization process of deep neural networks. The answer is relatively simple **– using a linear activation function means that your model will behave as if it is linear**. And that means that it can no longer handle the complex, non-linear data for which those deep neural nets have boosted performance those last couple of years.

When you’re building a deep neural network, there are three terms that you’ll often hear:

- A gradient;
- Backpropagation, and finally…
- Gradient descent, often the stochastic version (SGD) – or SGD like optimizers.

Let’s take a look at the training process of a neural network, so that we’ll understand the necessity of those three before we move on to studying the behavior of linear activation functions.

As you know, training a deep neural network goes iteratively, using epochs. This means that small batches of training data are input into the network, after which the error is computed and the model is optimized. If all the training data has been input once, an epoch has passed and the same process starts again – until the second, third, fourth, and so on – epochs have passed.

Suppose that we’re at epoch 0 (or 1, if you like). The weights of the model have been initialized randomly, or pseudo-randomly. You input your first batch of training data into the model. Obviously, it will perform very poorly, and the loss – the difference between the actual targets and the predictions for this training data – will be huge. It needs to be improved if we want to use it in real life.

One way of doing so is by using *gradients* and *backpropagation*, the latter of which stands for “backwards propagation of errors”. While the data has been propagated forwards, the error can be computed backwards. This is done as follows:

- We know which loss function we used and how it is instantiated. For this function, we can compute its derivative. That is, we can compute its
*gradient*i.e. how much it changes at some particular location. If we do that for our current spot on the loss curve, we can estimate where to move to in order to improve that particular weight. - Backpropagation allows us to descend the gradient with respect to
*all the weights*. By chaining the gradients found, it can compute the gradient for any weight – and consequently, can compute improvements with respect to the errors backwards towards the most upstream layer in the network. - The
*optimizer*, i.e. SGD or the SGD like optimizer such as Adam, is subsequently capable of altering the weights slightly in an attempt to improve overall network performance.

And this often causes a really fast drop in loss at first, while it gets stable over time:

As you know, the dot product between the weight vector and the input (or transformed input) vector produced by the neuron itself is linear. It flows through an activation function to, generally, make it non-linear. But neural networks don’t care what kind of function you choose for activating neuron output.

You can thus choose to use f(x) = x, i.e. the linear function, as your activation function.

But this is often a really bad idea.

And it all has to do with the gradient of this linear activation function:

Yep, it’s 1.

f'(x) when f(x) = x?

f'(x) = 1 * x^0 = 1 * 1 = 1.

**You will thus find the same gradient for any neuron output when you use the linear activation function, namely 1.**

And this impacts neural network training in two fundamental ways:

- You cannot apply backpropagation to find how your neural weights should change based on the errors found. This observation emerges from the simple notion that gradients are no longer dependent on the input values (and by consequence, the errors) – they’re always the same. There’s thus simply no point in attempting to find where to improve your model.
- Your model becomes a linear model because all layers chained can be considered to be a
*linear combination*of individual linear layers. You’ll thus at best get some good performance on*linear data*. Forget good performance for non-linear data.

And that’s why you shouldn’t use linear activation functions

The post Why you shouldn’t use a linear activation function appeared first on Machine Curve.

]]>The post Could chaotic neurons reduce machine learning data hunger? appeared first on Machine Curve.

]]>Today’s deep learning models are very data hungry. It’s one of the fundamental challenges of deep artificial neural networks. They don’t learn like humans do. When we learn, we create rules of logic based on first time observations which we can use in the future. Deep neural networks cannot do this. By consequence, they need large amounts of data to learn superficial representations of their target classes.

And this is a problem for the data scenarios where you’ll have very little data or when you have a very skewed distribution over the classes. Can we do something about this?

In their work, the authors recognize that deep learning has so far been really promising in many areas. They however argue that although neural networks are loosely inspired by the human brain, they do not include its chaotic properties. That is, they remain relatively predictable over time – for the input, we know its output in advance. Human brains, according to the authors, also contain chaotic neurons, whose predictability reduces substantially after some time… and whose behavior *appears* to become random (but, since they are chaotic, they are not).

The main question the authors investigate in their work is as follows: **what if we create a neuronal architecture based on chaotic neurons?** Does it impact the success rate of learning with very small datasets, and perhaps positively? Let’s find out.

Let’s see if we can intuitively – that is, with a minimum amount of mathematics and merely stimulating one’s sense of intuition – find out how it works

Suppose that **X** is the *m x n* matrix representing the inputs of our training set. Every row then represents a feature vector. Suppose that our matrix has 4 columns, thus n = 4. Our feature vector can then be represented as follows:

By design, the network proposed by the authors must have four input neurons, one per feature.

The authors call each of those neurons a Chaotic Generalized Luroth Series neuron (GLS), which take real inputs between [0, 1) and map them to a real output value between [0, 1) as follows.

\begin{equation} T(x) = \begin{cases} \frac{x}{b}, & \text{if}\ 0 <= x < b \\ \frac{(1-x)}{(1-b)}, & \text{if}\ b <= x < 1 \\ \end{cases} \end{equation}For the [0, 1] domain, it visually looks as follows:

Since this function is *topologically transitive*, chaotic behavior is introduced in model behavior. I do not have the background to fully grasp this behavior – but it is one of the essential characteristics of chaos, at least in mathematical terms. So for this work, let’s just assume that it is, so we can focus on its implications for machine learning

Neurons generally fire immediately, which emerges from their deterministic nature. That is, they are often continuous functions which take an input which is then mapped to another space, possibly in the same dimension. For example, `f(x) = x`

is such a function. Mathematically, there is no delay between input and output.

The chaotic neurons proposed by the authors behave differently.

They do not cease firing immediately. Rather, their chaotic nature ensures that they fire for some time, and oscillate around some values, before they grind to a halt. This is visualized below. The neuron oscillates until its value approximates the input, then returns the number of milliseconds until that moment as its output.

The formulae and the precise pseudo-code algorithm can be found in the paper.

Training the network goes differently than we’re used to. There is no backpropagation and there is no gradient descent. Rather, it looks somewhat like how Support Vector Machines attempt to build a weight vector. The authors propose to train the network as follows:

- Normalize the input data to the domain of [0, 1).
- For every cell in the input data, compute the value for the neuron.
- Once this is completed, you have another matrix, but then filled with
*firing times*. Split this matrix into multiple ones, grouped by class. - Compute a so-called
*representation vector*for the matrices. That is, compute the mean vector for all the vectors available in the class matrices.

This representation vector represents the ‘average’ input vector for this class. It can be used to classify new inputs. Let’s see how this works.

According to the authors, one would classify new inputs as follows:

- Normalize the input data to the domain of [0, 1).
- For every cell in the input vector, compute the output of the respective neuron.
- For the vector with neuron outputs, compute the cosine similarities with respect to the representation vectors for the matrices.
- Take the
`argmax`

value and find the class you’re hopefully looking for.

In their work, the authors suggest that they achieve substantial classification performance on *really small sub samples* of the well-known MNIST and Iris datasets. Those datasets are really standard-ish data sets when you’re interested in playing around with machine learning models.

And with substantial performance, I really mean substantial: **they claim that combining chaotic behavior with neurons allows one to get high performance with really small data sets**. For example, they achieved 70%+ accuracies on the MNIST data set with > 5 samples, and accuracies of

β 80% with β 20 samples. Note: the authors *do suggest that when the number of samples increase*, regular deep learning models will eventually perform better. But hey, let’s see what we find for this type of model in small data scenarios.

Rather unfortunately, the authors did not provide code which means that I had to implement the feature extractor, training algorithm and testing algorithm myself. Fortunately, however, the authors provided pseudo-code for this, which was really beneficial. Let’s take a look at what happened.

According to the paper, there are two parameters that must be configured: `b`

and `q`

. `b`

is used to compute the chaotic map and determines the tipping point of the function (see the visualization above, where b was approximately 0.46). `q`

, on the other hand, is the starting point for the neuron’s chaotic behavior, and represents neural membrane potential. In my architecture it’s the same for all neurons since that is what the authors used, but an extension to their work may be customized `q`

s for each neuron. The `error`

rate was 0.1, in line with the paper.

All right, after implementing the architecture, I could begin with testing. I tested model performance on the MNIST dataset.

The MNIST dataset is a relatively straight-forward dataset which contains handwritten numbers. It’s a great dataset if one intends to learn building machine learning models for image classification and it’s therefore one of the standard data sets. It looks as follows:

First, I created a fancy little test protocol in order to attempt to show that it can both predict and generalize. It is as follows β

- I used the
`mnist`

data set available by default in Keras. From the`x_train`

sample, I always drew random samples for training, with replacement. - I trained multiple times with varying numbers of training samples per class, but with an always equal number of samples per class. I trained the model with 1, 2, … 21 samples per class, to see how its performance differs.
- I randomly drew 500 samples per class from the
`x_train`

sample for testing. It may be the case that some of those overlap with the actual training data. This is obviously considered to be poor practice, and yes, shame on me. But it was relatively easy to make it work this way What’s more, in the ultimate worst case, only 4.2% of the test samples would overlap. But since we’re drawing 500 samples from about 5-7k per class, and this 4.2% only occurs in the*worst case*scenario when training with 21 samples if all 21 overlap (21/500 β 4.2%), I think this won’t be too problematic.

And then, there was a setback. I simply could not get it to work with the MNIST dataset. Well, the network worked, but its performance was poor: I achieved accuracies of 20% at max:

Then I read in the paper that it “may be the case that certain values of q may not work, but we can always find a `q`

that works”.

My problem thus transformed into a search problem: find a value for `q`

and possibly for `b`

that works. The result of this quest is a piece of Python code which iterates over the entire [0, 1) spectrum for both `b`

(deltas of 0.05) and `q`

(deltas of 0.01) to allow me to find the optimum combination.

This is the result:

So indeed, it seems to be the case that model performance is very sensitive to the configurable parameters. The `q`

I had configured seemed to produce a very low accuracy. Slightly altering the value for `q`

yielded an entirely different result:

Wow! I could pretty much reproduce the findings of the authors. An decreasingly increasing accuracy with respect to the number of samples, achieving some kind of plateau at > 20 samples for training. Even the maximum accuracy of about 78% gets close to what the authors found.

Next up is the Iris dataset, which is another common dataset used by the machine learning community for playing around with new ideas. I let the search algorithm find optimum `b`

and `q`

values while it was configured to use 5 samples for training (which is similar to the authors’ work), using 45 samples for testing (the Iris dataset I used contains 50 samples per class). First, I normalized the values into the [0, 1) interval, since otherwise the neurons cannot handle them.

The search plot looks promising, with maximum accuracies of β 98,5%:

By zooming into this plot, I figured that one of the maximum accuracies, possibly the highest, occurs at `q = 0.50`

and `b β 0.55`

. Let’s train and see what happens:

We can see that it performs well. Once again, we can support the authors’ findings However, we must note that performance seems to deteriorate slightly when a relatively large number of samples is used for training (> 5 samples, which is > 10% of the entire number of samples available per class).

All right. We just tested the model architecture with two data sets which the authors also used. For any machine learning problem, an engineer would be interested in how well it generalizes to different data sets… so the next obvious step was to train the model on another data set, not used by the authors.

A dataset readily available within the Keras framework is the CIFAR-10 dataset. It contains many images for ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). It looks as follows:

The first step is running the Python code for finding the most optimum combination of `q`

and `b`

.

Unfortunately, the maximum accuracies found by the search algorithm are only about 30% – and they are rather consistent in this behavior. This means that for CIFAR-10, the chaotic model performs worse than when the prediction is made at random. That’s not what we want.

I’m however not exactly sure why this behavior occurs. I do however have multiple hypotheses. First, if you inspect the data visualizations for MNIST and CIFAR-10 above, you’ll see that the MNIST dataset is highly contrast rich, especially compared to the CIFAR-10 dataset. That is, we can clearly see what the number is. This distinction is relatively more obscure in the CIFAR-10 dataset. It may be that the model cannot handle this well. In that case, we’ve found our first possible bottleneck for the chaos theory inspired neural network: *it may be that it cannot handle well data relatively poor in contrast between areas of interest and areas of non-interest.*

Second, the MNIST dataset provides numbers that have been positioned in the relative center of the image. That’s a huge benefit for machine learning models. Do note for example that CNNs are so effective because the convolution operation allows them to be invariant to the position of the object. That is, they don’t care where in the image the object is. *Hypothesis two:* it may be that this chaos theory inspired network, in line with more traditional machine learning models, is sensitive to the precise location of objects.

We did however see that the chaos theory inspired neural architecture performs relatively well on the Iris dataset. In order to see how well it generalizes with respect to those kind of datasets (i.e., no images), I finally also tested it on the Pima Indians Diabetes dataset. It is a CC0 dataset usable for getting experience with machine learning models and contains various medical measurements and a prediction about whether patients will haveto face diabetes:

Source: Pima Indians Diabetes Dataset

This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ve years.

The dataset is relatively imbalanced. Class 0, ‘no diabetes’, is present 500 times, whereas class 1, i.e. when one is predicted to get diabetes, is present only 267 times. Nevertheless, we should still have enough samples for training and testing.

Similar to the Iris dataset, I first normalized the individual values into the [0, 1) interval. This should not change the underlying patterns whereas the dataset can now be input into the GLS neurons. Let’s inspect the results for searching good `q`

s and `b`

s. I’ll run it with 15 samples for training and 50 for testing.

Once again, I’m impressed with the results of the network, this time on a dataset which was not tested by the authors previously. It seems that `b = 0.78`

and `q = 0.47`

must yield good results, and indeed:

With my experiments, I could reproduce the results reported by the authors in their paper A Novel Chaos Theory Inspired Neuronal Architecture. I was also able to reproduce these results on another dataset (i.e. the Pima Indians Diabetes Dataset), but failed to reproduce those findings on another (i.e., the CIFAR-10 dataset). I feel that the relative lack of contrast between object and non-object in the CIFAR-10 dataset results in low performance, together with the variable positions of objects of interest in this dataset. Consequently, I feel like the work produced by the authors is a really great start… while more work is required to make this work with real-world image datasets, of which CIFAR-10 is a prime example. However, I’ll be happy to test with more non-image datasets in the future… to further investigate its performance

If you’ve made it so far, I would like to thank you for reading this blog – I hope you’ve found it as interesting as me. It is fun to play around with new ideas about how to improve machine learning – and it’s even more rewarding to find that the results reported in the original work could be reproduced. If you feel like I’ve made any mistakes, if you have questions or if you have any remarks, please feel free to leave a comment below. They are highly appreciated and I’ll try to answer them as quickly as I can. Thanks again and happy engineering!

Harikrishnan, N., & Nagaraj, N. (2019). A Novel Chaos Theory Inspired Neuronal Architecture. Retrieved from https://arxiv.org/pdf/1905.12601.pdf

How to Load and Visualize Standard Computer Vision Datasets With Keras. (2019, April 8). Retrieved from https://machinelearningmastery.com/how-to-load-and-visualize-standard-computer-vision-datasets-with-keras/

MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. (n.d.). Retrieved from http://yann.lecun.com/exdb/mnist/

CIFAR-10 and CIFAR-100 datasets. (n.d.). Retrieved from https://www.cs.toronto.edu/~kriz/cifar.html

pima-indians-diabetes.csv. (n.d.). Retrieved from https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

Iris Species. (n.d.). Retrieved from https://www.kaggle.com/uciml/iris

The post Could chaotic neurons reduce machine learning data hunger? appeared first on Machine Curve.

]]>