Understanding separable convolutions

Last Updated on 30 March 2021

Over the past years, convolutional neural networks (CNNs) have led to massive achievements in machine learning projects. The class of deep learning models has specifically boomed in computer vision, spawning many applications such as snagging parking spaces with a webcam and a CNN.

That’s great!

But those networks come at a cost. Training them is relatively costly. Not necessarily in money, because computing power is relatively cheap (the most powerful deep learning instance at AWS costs \$33/hour in February 2021), but in time. When you have a massive dataset -which is a necessity when you aim to achieve extremely high performance- you will face substantial training times. It’s not uncommon to see that training a deep learning model takes two weeks when the dataset is really big.

This is especially unfavorable when your goal is to test whether your model works and, thus, when you want to iterate quickly.

Although the landscape is slowly changing with GPUs that are becoming exponentially powerful, training convolutional neural networks still takes a lot of time. The main culprit: the number of multiplications during the training process.

• Why traditional convolutions yield good performance, but require many computational resources.
• How spatially separable convolutions can reduce the computational requirements, but that they work in only a minority of cases.
• Why depthwise separable convolutions resolve this problem and achieve computational efficiency.

Let’s take a look! 🚀

Update 05/Feb/2021: ensure that the article is up to date.

Summary: how separable convolutions improve neural network performance

Convolutional Neural Networks have allowed significant progress to be made in the area of Computer Vision. This is especially true for really deep networks with many convolutional layers. These layers, however, require significant resources to be trained. For example, one convolutional layer trained on 15x15x3 pixel images will already require more than 45.000 multiplications to be made… per image!

Spatially separable convolutions help solve this problem. They are convolutions that can be separated across their spatial axis, meaning that one large convolution (e.g. the original Conv layer) can be split into smaller ones that when convolved sequentially produce the same result. By consequence, the number of multiplications goes down, while getting the same resul.t

The downside of these convolutions is that they cannot be used everywhere since only a minority of kernels is spatially separable. To the rescue here are depthwise separable convolutions. This technique simply splits convolutions differently, over a depthwise convolution and a pointwise convolution. The depthwise convolution applies the kernel to each individual channel layer only. The pointwise convolution then convolves over all channels at once, but only with a 1×1 kernel. As you can see in the image, you get the same result as with the original Conv layer, but at only 20% of the multiplications required. A substantial reduction!

If you wish to understand everything written above in more detail, make sure to read the rest of this article as well 🚀

Understanding separable convolutions requires to understand traditional ones first. Because I often try to favor development use of deep learning over pure theory, I had to look into the inner workings of those traditional layers again. Since this provides valuable insights (or a valuable recap) about convolutions and I think you’ll better understand separable ones because of it, I’ll include my review first.

Let's pause for a second! 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

By consequence, we’ll firist look into traditional convolutions. This is such a convolution:

Specifically, it’s the inner workings of the first convolutional layer in your neural network: it takes an RGB image as its input.

RGB image and channels

As you know, RGB images can be represented by their width, by their height and by their channels.

Channels?

Yes, channels: each RGB image is composed of three channels that each describe the ‘colorness’ of the particular pixel. They do so at the levels red, green and blue; hence, it’s called an RGB image. Above, you’ll therefore see the input represented by a cube that itself is composed of the three RGB channels of width W and height H.

Kernels

As you see, the convolutional layer also contains N so-called kernels. A kernel is a very small piece of ‘memory’ that through training becomes capable of deriving particular features from the image. Kernels are typically 1×1, 3×3 or 5×5 pixels and they ‘slide’ over the image:

What they essentially do is that element-wise multiplications are computed between the filter and the image currently under inspection.

That is, suppose that your filter is 3×3 pixels and currently in the upper left corner of your image. Pixel (1,1) of the image is multiplied with kernel element (1,1); (1,2) with (1,2), and so forth. All those scalar values are summated together and subsequently represent one scalar in the feature map, illustrated on the right in the image above.

Kernels and multiple channels

When N=1, we arrive at the situation above: a two-dimensional box is slided over the image that has one channel and the result is a summary of the image.

What confused me was what happened when there are multiple channels, like in the image we’ve seen before:

The kernel itself here is 3x3x3, there are N of them; yet, the feature map that is the result of the convolution operation is HxWxN.

I then found this video which perfectly explained what happens:

In essence, the fact that the kernel is three-dimensional (WxHxM, with M=3 in the RGB situation above) effectively means that a cube is convolving over the multichanneled image. Equal to the pair-wise multiplications above, the three-dimensional multiplications also result in a scalar value per slide. Hence, a WxHxM kernel results in a feature map third dimension of M, when three kernels are used.

Very often, your neural network is not composed of one convolutional layer. Rather, a few of them summarize your image to an abstract representation that can be used for classification with densely classified layers that behave like MLPs.

However, a traditional convolution is expensive in terms of the resources that you’ll need during training.

Never miss new Machine Learning articles ✅

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

We’ll investigate next why this is the case.

Suppose that your training set contains 15×15 RGB pixel images (3 channels!) and that you’re using 10 3x3x3 pixel kernels to convolve over your training data.

In one convolution on one input image (i.e., 3x3x3 slide over the first 3x3x3 pixels of your RGB image, you’ll do 3x3x3 = 27 multiplications to find the first scalar value.

However, we chose to use 10 kernels, so we’ll have 270 multiplications for the first 3×3 pixels of your image.

Since we’re not using padding, the kernel will have to slide over 13 (15-3+1 = 13) patches, both horizontally and vertically. Hence, per image, we’ll have to make 270 x 13 x 13 = 45630 multiplications.

We can generalize this to the following formula when we’re not using padding:

Multiplications per image = Kernel width x Kernel height x Number of channels x Number of kernels x Number of vertical slides x Number of horizontal slides.

Say that the MNIST dataset added to Keras contains ~60k images, of which ~48k are training data, you get the point: convolutions are expensive – and this was only the first convolutional layer.

Why I’m covering separable convolutions in this blog today is because they might be the (partial) answer to these requirements for computational complexity. They will do the same trick while requiring much fewer resources. Let’s start with spatially separable convolutions. Following those, we cover depthwise separable convolutions. For both, we’ll show how they might improve the resource requirements for your machine learning projects, and save resources when you’re developing convolutional neural nets.

Spatially separable convolutions

Spatially separable convolutions, sometimes briefly called separable convolutions (Chollet (2017), although this does not fully cover depthwise separable convolutions), are convolutions that can be separated across their spatial axes.

That is, they can be split into smaller convolutions that, when convolved sequentially, produce the same result.

In A Basic Introduction to Separable Convolutions, Chi-Feng Wang argues that “[o]ne of the most famous convolutions that can be separated spatially is the Sobel kernel, used to detect edges”:

$$\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ 1 \end{bmatrix} \times \begin{bmatrix} -1 & 0 & 1 \end{bmatrix}$$

Convolution with normal kernel

Suppose that you’re performing a normal convolution operation with this kernel on a 15×15 pixel grayscale image (hence, 1 channel), and only use one kernel and no padding.

Remember the formula?

Multiplications per image = Kernel width x Kernel height x Number of channels x Number of kernels x Number of vertical slides x Number of horizontal slides.

Join hundreds of other learners! 😎

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

Or: 3x3x1x1x13x13 = 1521 multiplications.

Spatially separated kernel

With the above kernel, you would first convolve the 3×1 kernel and subsequently the 1×3 kernel. This yields for both kernels:

3×1 kernel: 3x1x1x1x13x15 = 585 multiplications.

1×3 kernel: 1x3x1x1x15x13 = 585 multiplications.

585 + 585 = 1170 multiplications.

Yet, you’ll have the same result as with the original kernel!

Spatially separable kernels can thus yield the same result with fewer multiplications, and hence you require fewer computational resources.

The problem with spatially separable kernels

Then why use traditional convolution at all, you would say?

Well, this is perfectly illustrated in A Basic Introduction to Separable Convolutions.

The point is that only a minority of kernels is spatially separable. Most can’t be separated that way. If you would therefore rely on spatially separable kernels while training a convolutional neural network, you would limit the network significantly. Likely, the network won’t perform as well as the one trained with traditional kernels, even though it requires fewer resources.

Depthwise separable convolutions might now come to the rescue 😉

Depthwise separable convolutions

A depthwise separable convolution benefits from the same characteristic as spatially separable convolutions, being that splitting the kernels into two smaller ones yields the same result with fewer multiplications, but does so differently. Effectively, two operations are performed in depthwise separable convolutions – sequentially (Geeks for Geeks, 2019):

1. Depthwise convolutions;
2. Pointwise convolutions.

Depthwise convolutions

As we’ve seen above, normal convolutions over volumes convolve over the entire volume, i.e. over all the channels at once, producing a WidthxHeightx1 volume for every kernel. Using N kernels therefore produces a WidthxHeightxN volume called the feature map.

In depthwise separable convolutions, particularly the first operation – the depthwise convolution – this does not happen in that way. Rather, each channel is considered separately, and one filter per channel is convolved over that channel only. See the example below:

Here, we would use 3 one-channel filters (M=3), since we’re interpreting an RGB image. Contrary to traditional convolutions, the result is no end result, but rather, an intermediate result that is to be interpreted further in the second phase of the convolutional layer, the pointwise convolution.

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to MachineCurve's free Machine Learning update today! You will learn new things and better understand concepts you already know.

We send emails at least every Friday. Welcome!
By signing up, you consent that any information you receive can include services and special offers by email.

Pointwise convolutions

From the intermediate result onwards, we can then continue with what are called pointwise convolutions. Those are filters of 1×1 pixels but which cover all the M intermediate channels generated by the filters, in our case M=3.

And since we’re trying to equal the original convolution, we need N of them. Remember that a convolution over a volume produces a SomeWidth x SomeHeight x 1 volume, as the element-wise multiplications performed over three dimensions result in a one-dimensional scalar value. If we would thus apply one such pointwise filter, we would end up with a Hfm x Wfm x 1 volume. As the original convolution produced a Hfm x Wfm x N volume, we need N such pointwise filters.

I visualized this process below:

Depthwise separable convolutions altogether

When taken altogether, this is how depthwise separable convolutions produce the same result as the original convolution:

First, using depthwise convolutions using M filters, an intermediate result is produced, which is then processed into the final volume by means of the pointwise convolutions. Taking those volumes together, M volume x N volume, yields that the operation is equal to the original kernel volume: (3x3x1 times 1x1xM = 3x3xM = 3x3x3, the volume of our N original kernels indeed). Since we have N such filters, we produce the same result as with our N original kernels.

How many multiplications do we save?

We recall from convolving with our traditional kernel that we required3x3x3x10x13x13 = 45630 multiplications to do so successfully for one image.

How many multiplications do we need for one image when we’re using a depthwise separated convolutional layer? How many multiplications do we save?

Remember that we used a 15×15 pixel image without padding. We’ll use the same for the depthwise separable convolution. We split our calculation into the number of multiplications for the depthwise and pointwise convolutions and subsequently add them together.

All right, for the depthwise convolution we multiply the number of convolutions in one full range of volume convolvingtimes the number of channels times the number of multiplications per convolution:

• Number of convolutions in one full range of volume convolving is Horizontal movements x Vertical movements:
• Horizontal movements = (15 – 3 + 1) = 13
• Vertical movements = (15 – 3 + 1) = 13
• One full range of convolving has 13 x 13 = 169 individual convolutions.
• The number of channels is 3, so we do 3 full ranges of volume convolving.
• The number of multiplications per individual convolution equals 3x3x1 since that’s the volume of each individual filter.

Hence, the number of multiplications in the depthwise convolutional operation is 13 x 13 x 3 x 3 x 3 x 1 = 4563.

For the pointwise convolution, we compute the number of convolutions in one full range of volume convolving over the intermediate result times the number of filters times the number of multiplications per convolution:

• Number of convolutions in one full range of volume convolving is Horizontal movements x Vertical movements:
• Horizontal movements = 13, since our kernel is 1x1xM;
• Vertical movements = 13 for the same reason;
• Note that the intermediate result was reduced from 15x15x3 to 13x13x3, hence the movements above are 13.
• One full range of convolving therefore has 13 x 13 = 225 individual convolutions.
• The number of filters in our case is N, and we used N = 10 in the original scenario.
• The number of multiplications per convolution in our case is 1x1xM, since that’s our kernel volume, and M = 3 since we used 3 channels, hence 3.

So for the pointwise convolution that’s 13 x 13 x 10 x 1 x 1 x 3 = 5070.

Together, that’s 5070 + 4563 = 9633 multiplications, down many from the original 45630!

That’s a substantial reduction in the number of multiplications, while keeping the same result!

Recap

Today, we’ve seen how spatially separable and depthwise separable convolutions might significantly reduce the resource requirements for your convolutional neural networks without – in most cases – giving in on accuracy. If you’re looking to optimize your convolutional neural network, you should definitely look into those!

In the discussion, we’ve seen how it’s more likely that you find those improvements with depthwise separable convolutions, since not many kernels can be split spatially – being a drawback for your convnets. However, even with depthwise separable convolutions, you’ll likely find substantial optimization.

I hope that this blog was useful to understand those convolutions more deeply – writing about them has at least helped me gain understanding. I therefore definitely wish to thank the articles I reference below for providing many valuable insights and, when you’re interested in separable convolutions, I definitely recommend checking them out!

References

Wang, C. (2018, August 14). A Basic Introduction to Separable Convolutions. Retrieved from https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728

Geeks for Geeks. (2019, August 28). Depth wise Separable Convolutional Neural Networks. Retrieved from https://www.geeksforgeeks.org/depth-wise-separable-convolutional-neural-networks/

Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2017.195

Do you want to start learning ML from a developer perspective? 👩‍💻

Blogs at MachineCurve teach Machine Learning for Developers. Sign up to learn new things and better understand concepts you already know. We send emails every Friday.
By signing up, you consent that any information you receive can include services and special offers by email.

9 thoughts on “Understanding separable convolutions”

1. Simon

Hi Chris! First of all: thanks for the great article – when looking for introductory material on depthwise separable convolutions, I found it summarizes the concept pretty well. I have one question or maybe misunderstanding though, regarding the equality of the results between “standard” convolutions and depthwise separable ones:
You say that both produce the same result. I mean, that is obviously true regarding the shape of the output (if one chooses one’s kernel dimensions and numbers correctly), but is it also true regarding the values of the output? In other words: can I find, for every set of standard convolution kernels, a set of depthwise seperable kernels that produce equivalent output for arbitrary input? I mean, the latter would imply that every standard convolution kernel is depthwise separable – but this is actually not the case, right?

1. Chris

Hi Simon,

Let me first point out that since two instantiations of a neural net (say, a normal ConvNet and a depthwise separable ConvNet) are initialized pseudo-randomly, whether with Xavier/He init as an extension, no two neural nets will produce the exact same result – rather, they approximate the minimum loss achievable so that their performance is pretty much ‘equal’.

Now, this answer cuts a lot of corners – and not answer your actual, more formal question: “can I find, for every set of standard convolution kernels, a set of depthwise seperable kernels that produce equivalent output for arbitrary input?”.

I thought about it and performed some Google searches, and I cannot see why this is not the case, invalidating your assumption “but this is actually not the case, right?”. Depthwise separation of a standard Conv kernel would, instead of convolving over all the channels at once, convolve on a per-channel basis (depthwise convolution), and subsequently produce the pointwise convolution, to produce an output the same shape (given two nets, not necessarily the same values, as my cut-corners answer suggests). Still, while this almost never happens in practice, strictly speaking, that should also make it possible to have two sets of values (i.e. one for the depthwise and one for the pointwise convolution) that produce the exact same result – i.e. true equivalence. However, I come from a developer background – and have no strict mathematical roots. Perhaps, one more advanced in mathematics can help out here. Or, perhaps, could you provide a bit more argumentation as to why you think it’s not the case? I might be overlooking something.

(Your point of view wrt “not the case” is valid for the spatially separable convolutions, though, I’m just not sure about the depthwise ones.)

Regards, Chris

1. Hi Chris!!!! Landed on your page while googling for basics of depthwise separable (DPS) convolution. I did go through some other materials before reading yours and infact I landed here when I was googling to find an answer to the exact same Q which Simon has posted. Based on my understanding I have a leaning more towards Simon’s comment “but this is actually not the case, right?”. Below I am giving out a supportive argument of why I think it is true. This is not rigorous math but more on intuitive counter examples.

Lets take a case where input image is 5X5X3 and output is 3X3X3.
Method1 Regular Convolution:
Kernel(A) size would be (3X3X3)X3. The last three is to match the output channel. Just for convenience referring this kernel as A_1, A_2, A_3. where A_n is of 3X3X3 size. A_n when convolved with input gives nth channel output whose size would be 3X3. Stacking 3 such per-channel outputs would give us the complete output.

Method2 DPS:
DepthConv: Kernel size (3X3) X 3. The last 3 matches input channels Referring them B_1,B_2,B_3. Where B_n is of size 3X3.
Pointwise Conv: Kernel size (1X1X3) X 3. The last 3 matches output channels. Lets refer them C_1, C_2, C_3 where C_n is of size (1X1X3).

Lets say there exist one example in which C_1 = [1,1,1]. This would imply that B_1,B_2,B_3 when stacked one behind other will be same as A_1. Now if the claim that both method1 & method2 are exact then B_1,B_2,B_3 when stacked and convolved with C_2 should yield A_2 and similarly C_3 should yield A_3 this is possible only if channel-1 component of A_1/A_2/A_3 (which are all 3X3) should be such that one is a simple multiple of the other like A_11 = N * A_21 where N is a scalar constant and A_11 is the channel-1 of A_1 which is 3X3.

This argument is somewhat similar to the spatial separability that is explained above where for separability one row(or column) of the matrix is a simple multiple of the other.

What is your thought on this.

1. Chris

Hi Hari,

I must be honest – you’ve made a complex argument that is a bit hard to follow.
However, if I read my answer from then carefully, I argue that while the methods do not necessarily have to provide the exact same result – they can do so.
Does this align with your viewpoint?

Best,
Chris

1. Hari

Sorry for making it complex. Let me make one more attempt of where my argument comes from …. let input be 3X3X3 and output be 3X3 now the second step in DPS i.e., pointwise convolution would have [1,1,1] as its only entry which is a 1X1X3 matrix. If you think of the calculations in this case it is exactly same as normal convolution. If this concept appeals to you. You can give my previous post another shot and see if my post makes sense.

I agree with the view point that it can provide same result but only for a subset of matrices and those sub-set would have to satisfy spatial separability in 2D space.

2. Onno

Hi Chris,

Thanks a lot for this great article! When studying convolutions I was wondering why a convolution of each channel separately is not common praxis and after reading your points this seemed even more interesting.

In terms of generality, however, I think some expressiveness is lost and the model would lose some of its complexity. In that regard, I tend to agree with the argument made by Hari and would like to try to illustrate my thoughts.

Please let me know what you think and also please don’t understand this as criticism of your great work 🙂

Suppose we have a 5x5x3 image (X) as above and we apply two filters to it, which gives us output Z(X) of shape 3x3x2. We will call the first output channel Z_1(X) and the second Z_2(X).

Similarly to what Hari described, we have two methods to get there:

Method1 – Regular Convolution:
Two kernels (A_1 and A_2) each of size (3X3X3) yield output Z(X) when convoluted with X and stacked.

Method2 – DPS:
DepthConv: Kernel size (3X3) X 3. The last 3 matches input channels. Again let’s call them B_1, B_2 and B_3.
Pointwise Conv: Two kernels (C_1 and C_2) of size (1X1X3).

Before going into the calculations, I would like to introduce some notation.
– # will be the symbol for convolution i.e. A_1#X meanse filter A_1 is convolved with the input image
-[i] will mean the i-th channel i.e. X[1] will be the first channel of the input image and C_1[1] will simply be the first value of C_1
-When referring to parts of a tensor I will use parentheses i.e. X(1, 1) means all three chanels of the top-left pixel; X[1](1:3, 1:3) are the top-left 3×3 values of the first channel

Now let’s get back to actually doing convolutions. I would like to look into only the top-left pixels of both output channels, but of course the procedure generalizes to all other pixels.

First pixel of first output channel (convolution of A_1 and top-left 3×3 pixels):
Z_1(X)[1, 1] = A_1[1] # X[1](1:3, 1:3) + A_1[2] # X[2](1:3, 1:3) + A_1[3] # X[3](1:3, 1:3)

Now let’s look at the value of the same pixel calculated with method 2:
Z_1(X)[1, 1] = C_1[1]*B_1 # X[1](1:3, 1:3) + C_1[2]*B_2 # X[2](1:3, 1:3) + C_1[3]*B_3 # X[3](1:3, 1:3)

Assuming both methods give the same results means the following equations hold:

A_1[1] = C_1[1]*B_1
A_1[2] = C_1[2]*B_2
A_1[3] = C_1[3]*B_3

Now let’s calculate the second output channel in the same way.
Z_2(X)[1, 1] = A_2[1] # X[1](1:3, 1:3) + A_2[2] # X[2](1:3, 1:3) + A_3[3] # X[3](1:3, 1:3)

And again the value of the same pixel calculated with method 2:
Z_2(X)[1, 1] = C_2[1]*B_1 # X[1](1:3, 1:3) + C_2[2]*B_2 # X[2](1:3, 1:3) + C_2[3]*B_3 # X[3](1:3, 1:3)

Assuming the two methods are equal gives the following
A_2[1] = C_2[1]*B_1
A_2[2] = C_2[2]*B_2
A_2[3] = C_2[3]*B_3

Now looking at the two sets of equations, we can show that they imply a linear dependence of filters A_1 and A_2:

A_1[1] = C_1[1]*B_1 = C_1[1]/C_2[1] * C_2[1] *B_1 = k * A_2[1]

k is a constant here, to be precise C_1[1]/C_2[1]. The same statement can be made for channels 2 and 3 of A_1 and A_2 as well.

Overall, my conclusion would be that applying depthwise separable convolutions is equivalent to applying a standard convolution when all filters of the standard approach are linearly dependent on chanel-level.

Another question would be how closely a depthwise separable convolution is able to approximate the result of a standard convolution, in the case of non-linearly dependent filters.

I hope I managed to explain my thgoughts. Please let me know what you think! Maybe I missed something.

Best,
Onno