[Repost] Lightweight Neural Network "Tour" (II) — MobileNet, from V1 to V3

Recommended Words

The concept is explained very clearly

Original Link

This article is transcoded by SimpRead, original article link zhuanlan.zhihu.com

Main Text

Since it was proposed by Google in 2017, MobileNet can be regarded as the Inception among lightweight networks, undergoing generation after generation of updates. It has become a must-learn path for lightweight networks.

MobileNet V1

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications paper link: https://arxiv.org/abs/1704.04861 included in: CVPR2017

In April 2017, Google proposed MobileNetV1, a lightweight neural network focused on mobile devices. There has always been a controversy about why MobileNetV1 and Xception network blocks are the same, both heavily using depthwise separable convolution?

Actually, there is a little story here: MobileNetV1 introduced on http://arxiv.org shows:

It can be seen that MobileNetV1 was submitted as version v1 in April 2017, but

Xception was proposed as v1 version as early as October 2016. So, did MobileNet “copy” Xception? Actually not. In the Xception v1 version paper, there is a sentence:

And who is Andrew Howard? That’s right, he is the author of MobileNetV1. In the Xception v3 paper version, that sentence changed to:

The truth is: In June 2016, Google proposed MobileNetV1, but for various reasons it was not uploaded to arxiv until April 2017. Coincidentally, another Google team simultaneously proposed Xception. Hence, the controversy about two architectures both based on depthwise separable convolution.

Alright, keyboard warriors can put down their keyboards and take a rest.

Enough nonsense, let’s start the main text (seems like every article has some preamble nonsense!!!)

Actually, the introduction of MobileNetV1 (hereinafter V1) can be summarized in one sentence: MobileNetV1 simply replaces the standard convolution layers in VGG with depthwise separable convolution.

So, what is depthwise separable convolution?

Depthwise Separable Convolution

Depthwise separable convolution (depthwise separable convolution), according to historical records, can be traced back to the 2012 paper Simplifying ConvNets for Fast Learning, where the authors proposed the concept of separable convolution (shown in image (a) below):

Dr. Laurent Sifre expanded separable convolution to depthwise during his Google internship in 2013, described in detail in his PhD thesis Rigid-motion scattering for image classification, interested readers can refer to the thesis.

Separable convolution mainly has two types: spatially separable convolution and depthwise separable convolution.

Spatially Separable

As the name suggests, spatially separable convolution splits a large convolution kernel into two smaller ones, for example splitting a 3×3 kernel into a 3×1 and a 1×3 kernel:

Since spatially separable convolution is beyond the scope of MobileNet, we won’t discuss it here.

Depthwise Separable Convolution

Depthwise separable convolution means splitting a standard convolution into a depthwise convolution and a pointwise convolution.

Let’s first look at a standard convolution operation:

An input feature map of 12×12×3 is convolved with a 5×5×3 kernel to get an 8×8×1 output feature map. If there are 256 feature maps, we get an 8×8×256 output feature map.

That’s the job of the standard convolution. What about depthwise and pointwise convolutions?

Depthwise Convolution

Unlike standard convolution networks, we split the convolution kernel into single-channel form, and convolve each input channel independently without changing the input depth, resulting in an output feature map with the same number of channels as the input. As shown above: input is a 12×12×3 feature map, after a 5×5×1×3 depthwise convolution, the output is 8×8×3 feature map. The input and output channel dimension remains 3. This brings up a question: with such few channels, can the feature map extract enough effective information?

Pointwise Convolution

Pointwise convolution is 1×1 convolution. Its main function is to expand or reduce the dimensionality of feature maps, illustrated below:

During depthwise convolution, we get an 8×8×3 output feature map. Using 256 kernels of size 1×1×3 to convolve the input feature map produces an output feature map identical to standard convolution, 8×8×256.

Comparison between standard convolution and depthwise separable convolution processes:

Why use depthwise separable convolution?

The answer is simple: If there is a method that can achieve almost the same results but with fewer parameters and less computation, would you use it?

Depthwise separable convolution is such a method. Let’s calculate the parameter and computational complexity of standard convolution (only considering multiply-add operations, MAdd):

Parameters of Standard Convolution

The convolution kernel size is Dk×Dk×M, with N kernels, so the parameters of standard convolution are:

Computation of Standard Convolution

The kernel size is Dk×Dk×M, with N kernels, each performed Dw×Dh times, thus the computation is:

After standard convolution, let’s calculate the parameters and computation of depthwise separable convolution:

Parameters of Depthwise Separable Convolution

Parameters of depthwise separable convolution consist of depthwise convolution and pointwise convolution parts:

Depthwise convolution kernel size is Dk×Dk×M; Pointwise convolution kernel size is 1×1×M, with N filters, so total parameters are:

Computation of Depthwise Separable Convolution

Computation is also composed of depthwise convolution and pointwise convolution:

Depthwise convolution kernel is Dk×Dk×M, performed Dw×Dh times; pointwise convolution kernel is 1×1×M, with N filters, also performed Dw×Dh times; hence computation is:

Overall:

Parameter count and multiply-add operations decrease to roughly

Since typically a 3×3 kernel is used, this reduces the computation roughly to one-ninth to one-eighth of the original.

Example

Suppose the input is a 224×224×3 image. In a certain layer of VGG, the input feature map size is 112×112×64, convolution kernel is 3×3×128, so the standard convolution computation is:

3×3×128×64×112×112 = 924,844,032

Depthwise separable convolution computation is:

3×3×64×112×112 + 128×64×112×112 = 109,985,792

In this layer, the computational ratio of MobileNetV1’s depthwise separable convolution to standard convolution is:

109,985,792 / 924,844,032 = 0.1189

Consistent with the one-ninth to one-eighth calculation above.

V1 Convolution Layer

The left image shows the standard convolution layer, the right the V1 convolution layer, with dashed lines indicating the differences. V1 convolution layer first applies a 3×3 depthwise convolution for feature extraction, followed by a BN layer, then a ReLU layer, then a pointwise convolution, and finally BN and ReLU. This corresponds exactly to depthwise separable convolution, splitting the left standard convolution into a depthwise convolution and a pointwise convolution on the right.

Wait — what’s this mixed in here??? What is ReLU6?

ReLU6

The left image shows the standard ReLU, which leaves any value greater than zero unchanged; the right shows ReLU6, which caps any input value over 6 to return 6. ReLU6 “has a boundary”. The authors believe ReLU6 as a nonlinear activation function has stronger robustness under low-precision computation. (Here “low precision” doesn’t mean float16, but rather fixed-point arithmetic.)

Now a question: What impact do standard convolution and depthwise separable convolution layers have on the results?

Experiment follows.

It can be seen that using depthwise separable convolution reduces parameters and computation to about one-ninth to one-eighth that of standard convolution, but the accuracy only drops a very small 1%.

V1 Network Architecture

The MobileNet network architecture is shown above. First is a 3×3 standard convolution with stride 2 for downsampling, then stacks of depthwise separable convolutions, with some depthwise convolutions using stride 2 for downsampling. Then average pooling reduces features to 1×1, followed by a fully connected layer determined by number of prediction classes, and finally a softmax layer. The entire network has 28 layers, including 13 depthwise convolution layers.

Experimental Results

The V1 paper contains additional tweaks to the V1 network, not elaborated here; interested readers can refer to the original paper.

How good is V1? The authors compared V1 with large networks GoogleNet and VGG16:

It can be found that as a lightweight network, V1 has less computation than GoogleNet and roughly the same number of parameters, yet achieves better classification performance than GoogleNet — thanks to depthwise separable convolution. VGG16 has 30 times more computation and parameters than V1 but improves results by less than 1%.

For object detection on the COCO dataset:

Furthermore, the authors analyzed the parameter and computation distribution of the network as shown below. The computation is mostly concentrated in the 1×1 convolutions. Parameters are mainly in 1×1 convolutions and also partly in fully connected layers.

MobileNet V2

MobileNetV2: Inverted Residuals and Linear Bottlenecks paper link: https://arxiv.org/abs/1704.04861 included in: CVPR2018

After MobileNetV1 (hereinafter: V1), now we discuss MobileNetV2 (hereinafter: V2). To better discuss V2, let’s first review V1:

Review MobileNet V1

The core idea of V1 is to use depthwise separable convolution. Under the same number of parameters, compared with standard convolution, it can reduce computation several times, thus speeding up network operation.

The block of V1 is shown as below:

First, a 3×3 depthwise separable convolution extracts features, then a 1×1 convolution expands channels. Stacking such blocks, MobileNetV1 reduces parameters and computation, speeds up the network, and achieves results close to standard convolution, which looks very promising.

However!

Some users found in practice that the kernels in the depthwise convolution part tend to be pruned away: after training, many depthwise convolution kernels are empty:

Why is that?

The authors believe this is the fault of the ReLU activation function. (Who would have thought the popular ReLU activation function would betray and revolt???)

What does ReLU do?

The V2 paper also explains this. (The paper is not very easy to understand, so I’ll briefly explain based on some interpretations combined with my thoughts. If there’s anything incorrect, please point it out, thank you!)

This is an example embedding the ReLU transformation of a low-dimensional manifold into a high-dimensional space.

Here we drop the manifold concept and explain it simply.

Suppose in 2D space there’s a spiral formed by m points Xm (input), mapped to an n-dimensional space by random matrix T, then ReLU is applied:

Where Xm is mapped to n-dimensional space by T:

Then using the inverse matrix T-1, y is mapped back to 2D space:

Entire process shown as:

In other words, a “thing” in n-dimensional space undergoes ReLU, then (using T’s inverse T-1) it is recovered, and the difference after ReLU versus the input is compared.

It can be seen:

When n = 2 or 3, a large portion of information is lost compared to Input. When n = 15 to 30, a considerable amount of information is retained.

That is, performing ReLU on low-dimensional data easily causes information loss, while applying ReLU in high-dimensional space causes less loss.

This explains why many depthwise convolution kernels are empty. By identifying the problem, it can be better solved. To address this, since ReLU causes information loss, replace ReLU with a linear activation function.

Linear bottleneck

Of course, not all activation layers can be replaced by linear ones, so we quietly replace only the last ReLU6 with a Linear one. ( Why replace only the last ReLU6 and not the first or second? That will be explained later. )

Separable with linear bottleneck

The authors call this part the linear bottleneck — the namesake of the paper.

Expansion layer

Another problem is that depthwise convolution itself does not change channel count; output channels equal input channels. If the input channels are low, the DW depthwise convolution works at low dimensionality, which is not effective. So we need to “expand” channels. Since PW pointwise convolution (1×1 convolution) can be used for dimension expansion and reduction, we use PW convolution before DW depthwise convolution to expand channels (expansion factor t, t=6), and then perform convolution in a higher-dimensional space for feature extraction:

That is, regardless of input channels, after the first PW pointwise convolution to expand channels, depthwise convolution works in a higher 6× dimension space.

Inverted residuals
------------------Reviewing the network structure of V1, we find that V1 resembles a straightforward VGG-style network. We want to reuse our features like ResNet, so we introduced the shortcut structure, making the block of V2 look like the following diagram:

Let’s compare V1 and V2:

We can see that both adopt the 1×1 → 3×3 → 1×1 pattern and use the Shortcut structure. But the differences are:

  • ResNet first reduces dimensions (0.25 times), then convolves, and then increases dimensions.
  • MobileNetV2 first increases dimensions (6 times), then convolves, and then reduces dimensions.

Coincidentally, the block of V2 is the opposite of the block in ResNet, so the author named it Inverted residuals, which is the term in the paper title Inverted residuals.

V2’s block

So far, the biggest innovation of V2 ends here. Let’s summarize V2’s block:

Let’s compare the blocks of V1 and V2:

The left is the block of V1, without Shortcut and with the final ReLU6.

The right is V2, which adds 1×1 expansion, introduces Shortcut, and removes the final ReLU, replacing it with Linear. When the stride is 1, it first performs 1×1 convolution to expand dimensions, then uses depthwise convolution to extract features, and finally applies Linear pointwise convolution to reduce dimensions. The input and output are added together to form the residual structure. When the stride is 2, because the input and output sizes do not match, no shortcut structure is added; otherwise, the rest remains the same.

V2’s network structure

If the stride of the 28×28×32 layer is 2, the output should be 14×14, which seems to be an error. According to the author’s paper, I made a correction:

Experimental results

Image Classification

The image classification experiments were mainly conducted on the above networks. ShuffleNet is the V1 version using grouped convolution and shuffling, also employing a similar residual structure (c) in (b).

The results are as follows:

Detailed comparison:

Object Detection

SSDLite

In object detection, the authors first proposed SSDLite. It modifies the SSD structure by replacing all standard convolutions in the SSD prediction layers with depthwise separable convolutions. The authors state that this greatly reduces parameters and computational cost, making computation more efficient. Comparison of SSD and SSDLite:

Applied to object detection tasks, comparison between V1 and commonly used detection networks:

It can be seen that SSDLite based on MobileNetV2 surpasses YOLOv2 on the COCO dataset, while its model size is 10 times smaller and speed is 20 times faster.

Semantic Segmentation

Segmentation results:

V1 VS V2

It can be seen that although V2 has many more layers than V1, FLOPs, parameters, and CPU latency are all better than V1.

Comparison of V1 and V2 on Google Pixel 1 for Image Classification tasks:

MobileNetV2 models can achieve the same accuracy faster across the overall speed range.

Results for object detection and semantic segmentation:

In summary, MobileNetV2 offers a highly efficient model for mobile devices and can serve as a basis for many visual recognition tasks.

However!

In my practical applications of V1 and V2, V1’s performance is slightly better. The previous gluonCV result chart is similar to my implementation:

I don’t know why.

MobileNet V3

Having finished V1 and V2, now let’s come to MobileNetV3 (hereafter V3).

Paper link of Searching for MobileNetV3: https://arxiv.org/pdf/1905.02244.pdf

MobileNetV3 was proposed by Google on March 21, 2019. First, the eye-catching part is the title of this paper—the word “searching” reveals the core concept of V3’s paper—using Neural Architecture Search (NAS) to build V3. Although I have never worked with NAS, I already smell the money.

“Sorry, money can really…”

Since I have no experience with NAS, I will talk about other aspects of V3 besides NAS.

First, results:

We can see that under the same computational budget, V3 achieves the best ImageNet results.

What did V3 do?

Related technologies in MobileNetV3

    1. The network architecture is based on NAS-generated MnasNet (which performs better than MobileNetV2).
    1. Introduces MobileNetV1’s depthwise separable convolution.
    1. Introduces MobileNetV2’s inverted residual block with linear bottlenecks.
    1. Introduces a lightweight attention model (SE) based on squeeze and excitation structure.
    1. Uses a new activation function h-swish(x).
    1. In the network architecture search, combines two technologies: resource-constrained NAS (platform-aware NAS) and NetAdapt.
    1. Modifies the last stage of the MobileNetV2 network.

Point 0: About MnasNet, also based on NAS, I’m not very familiar. If interested, you can refer to Professor Qu Xiaofeng’s answer How to evaluate Google’s latest model MnasNet? - Qu Xiaofeng’s answer - Zhihu, which is very well-written! So for now, consider MnasNet as a model with better accuracy and real-time performance than MobileNet.

Points 1 and 2 were already discussed in the previous MobileNetV1 and V2 sections, so no repetition here.

Point 3: SE module is introduced mainly to strengthen the network’s learning ability by exploiting inter-channel feature relationships. Let’s not go into details now; this will be discussed in depth later in the “Deep Review of Classic Networks” series. For those interested, check this article.

Activation function h-swish

swish

h-swish is an improvement based on swish, which was originally proposed in the Google Brain 2017 paper Searching for Activation Functions (again with “Searching for!!!”).

The authors of the swish paper believe Swish has the properties of being unbounded above and bounded below, smooth, and non-monotonic. It performs better than ReLU on deep models. Simply replacing ReLU units with Swish can improve ImageNet top-1 classification accuracy by 0.9% for MobileNet and NASNetA and 0.6% for Inception-ResNet-v.

V3 also uses swish as a replacement for ReLU, which can significantly improve neural network accuracy. However, the authors believe this nonlinear activation, while improving accuracy, has notable costs in embedded environments because computing the sigmoid function on mobile devices is expensive. Hence, they proposed h-swish.

h-swish

h-swish approximates swish by a piecewise linear function to harden it. The author chooses ReLU6-based implementation because nearly all software and hardware frameworks can optimize ReLU6 effectively. Furthermore, it eliminates potential numerical precision loss due to different sigmoid approximations under specific patterns.

The figure below shows Sigmoid and swish in hard and soft forms:

We can simply consider the hard form as a low-precision approximation of the soft form. The authors believe that compared to other nonlinearities, swish allows reducing filter numbers to 16 while maintaining the same accuracy as 32 filters used with ReLU or swish, saving 3 milliseconds and 10 million MAdds of computation.

Moreover, the authors suggest as networks deepen, the cost of applying nonlinear activations decreases, thus better reducing parameters. They found most benefits of swish come from using it in deeper layers. Hence, in V3’s architecture, h-swish (HS) is only used in the latter half of the model.

Neural Architecture Search (NAS)

As I am not familiar, I’ll keep this brief.

It mainly combines two techniques: resource-constrained NAS (platform-aware NAS) and NetAdapt.

Resource-constrained NAS searches for block-level architectures under constraints of computation and parameters, called block-wise search.

NetAdapt fine-tunes the number of convolutional kernels in each layer after specific modules are decided, called layer-wise search.

Once the model is found through architecture search, it is observed that some of the last and earlier layers have costly computations. So the authors decided to modify these architectures to reduce latency of these slow layers while maintaining accuracy. These modifications obviously exceed the current search scope.

Modifications to the final stage of V2

The authors believe the current model is based on the inverted residual structure from V2 and its variants (as shown below). The last layers are constructed using 1×1 convolutions, facilitating expansion to higher-dimensional feature spaces. The benefit is richer features for predictions but at the cost of extra computation and latency.

Therefore, the improvement needs to reduce latency while preserving high-dimensional features. First, still place the 1×1 layer after the final average pooling. Now, the last group of features is no longer 7×7 (red box in V2 structure below) but calculated as 1×1 (yellow box in V3 structure below).

The benefit is that computing the features costs almost nothing in terms of computation and latency. The redesigned structure is as follows:

Latency is reduced by 10ms, speed increased by 15%, and 30 million MAdds of computation are saved, all without accuracy loss.

V3’s block

Summarizing the above, V3’s block structure is as follows:

Compared with V2’s block:

MobileNetV3 network structure

MobileNetV3 defines two models: MobileNetV3-Large and MobileNetV3-Small. V3-Large is designed for high resource scenarios, while V3-Small targets low resource usage. Both are based on the NAS discussed above.

MobileNetV3-Large

MobileNetV3-Small

As mentioned earlier, only in deeper layers does using h-swish provide significant benefits. So in the above network models, regardless of size, the authors use h-swish only in the latter half of the model.

Test results of large and small V3 on Google Pixel 1/2/3:

Experimental results

Image Classification

Detection

Semantic Segmentation

The experimental results speak for themselves.

By the way, one point worth mentioning is that training V3 used a 4x4 TPU Pod with batch size 409… (left tears of poverty)

Why is MobileNet so fast?

While writing this article, I came across Why MobileNet and Its Variants (e.g. ShuffleNet) Are Fast?, which raised the same question for me. This article mainly discusses this from the structural aspect, from depthwise separable convolutions to grouped convolution parameter calculation, and such. Since it has been discussed in previous articles, I will not repeat here; interested readers can review previous articles.

Here, let’s consider the issue from a different angle: in terms of runtime.

The figure below comes from the doctoral dissertation of Jia Yangqing, the author of Caffe:

This figure shows GPU and CPU runtime consumption in different layers of AlexNet. We can clearly see that regardless of GPU or CPU, the most important “time killer” is conv, the convolutional layers. In other words, to improve network speed, one must improve convolutional layer computation efficiency.

Focusing on MobileNetV1, let’s look at resource distribution:

It can be seen that 95% of MobileNet’s compute is spent on 1×1 convolutions. So what are the benefits of 1×1 convolution?

We know that convolution operations are multiply-add computations as shown:

When operated on computers, data needs to be stored in memory according to “row-major order”:

Thus, the computations of feature map elements y11, y12, y21, y22 proceed as follows:

In convolution calculation, the solid lines mark memory accesses during the multiply process (corresponding to data multiplication). We can see that this process is very scattered and chaotic. Doing direct convolution computation like this is inefficient.

At this point, the im2col operation is applied.

im2col

In a nutshell, im2col is an operation that sacrifices spatial duplication (expanding by about K×K times) to convert feature maps into large matrices for convolution computation.

The idea is quite simple:
For each iteration, arrange the needed data into column vectors, and then stack them to form matrices (concatenated by channels in the column direction).
For an input feature map of size Ci×Wi×Hi and convolution kernel K×K, with output size Co×Wo×Ho,
The input feature map is transformed into a matrix of size (K∗K)×(Ci∗Wo∗Ho), and the convolution kernel is reshaped into Co×(K∗K) matrix.

Then, convolution is completed by calling GEMM (General Matrix Multiply) library to accelerate matrix multiplication. Because the data is arranged according to computation needs, during calculation feature map data is accessed sequentially, greatly improving convolution speed. (Not only GEMM, but also FFT (Fast Fourier Transform))

A different representation can help better understanding. The image is from High Performance Convolutional Neural Networks for Document Processing:

This way, we can more clearly see the definition of convolution operation (upper part of the image). Memory access is very irregular, resulting in poor performance. The Im2col() rearranges data in a memory-access-regular manner, and although the Im2col operation introduces a lot of data redundancy, the performance gain from using Gemm outweighs the disadvantage of this redundancy.

So the standard convolution process roughly looks like this:

Now let’s come back to 1×1 convolution, which is a bit special. According to what we said earlier, the original storage structure of 1×1 convolution and the structure for im2col are shown in the following figure:

We can see the matrices are exactly the same. A comparison between standard convolution operation and 1×1 convolution operation is shown below:

In other words, 1x1 convolution does not require the im2col process, so it can have faster implementations at the lower level and can be directly computed, greatly saving time and space on data rearrangement.

Of course, this is not absolute, because MobileNet’s speed is closely related to the optimization level of the CONV1x1 operation. If customized hardware is used (for example, implementing 3x3 convolution units directly on FPGA), then im2col loses its meaning and instead increases overhead.

Going back to the earlier resource distribution of MobileNet, the 95% of 1×1 convolutions and the optimized network architecture is why MobileNet can be so fast.

Reference

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Searching for MobileNetV3

Xception: Deep Learning with Depthwise Separable Convolutions

Simplifying ConvNets for Fast Learning

a-basic-introduction-to-separable-convolution

CNN 模型之 MobileNet

网络解析(二):MoblieNets 详解

轻量级网络 --MobileNet 论文解读

https://mp.weixin.qq.com/s/O2Bhn66cWCN_87P52jj8hQ

http://machinethink.net/blog/mobilenet-v2/

如何评价 Google 最新的模型 MnasNet? - 曲晓峰的回答 - 知乎

Learning Semantic Image Representations at a Large Scale 贾扬清博士论文

im2col 的原理和实现

在 Caffe 中如何计算卷积? - 贾扬清的回答 - 知乎

漫谈卷积层

High Performance Convolutional Neural Networks for Document Processing

Why MobileNet and Its Variants (e.g. ShuffleNet) Are Fast