EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

image

This paper was published in May 2019. It was cited 2,489 times in total.

Abstract

Convolutional Neural Networks(ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. They propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.

They use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets.

1. Introduction

Scaling up ConvNets is widely used to achieve better accuracy. The common way is to scale up ConvNets by their depth(#layers) or width(#channels) or image resolution. Though it is possible to scale two or three dimensions arbitrarily, it requires tedious manual tuning to find sub-optimal accuracy and efficiency.

Their method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients. They called it, compound scaling method.

2. Compound Model Scaling

image

2.1. Scaling Dimensions

The optimal \(d\)(depth), \(w\)(width), \(r\)(resolution) depend on each other and the values change under different resource constraints. Due to this difficulty, conventional methods mostly scale ConvNets in one of these dimensions.

Depth (\(d\)): Intuitively, deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. However, deeper networks occurs the vanishing gradient problem. For example, ResNet-1000 has similar accuracy as ResNet-101 even though it has much more layers.

Width (\(w\)): Scaling network width is commonly used for small size models. Wider networks can capture more fine-grained features and are easier to train. However, extremely wide but shallow networks can be hard to capture higher level features.

Resolution (\(r\)): With higher resolution images, ConvNets can capture more fine-grained patterns. However, the accuracy gain diminishes for very high resolutions.

image

The above Figure shows that scaling up any dimension of \(d\), \(w\), \(r\) improves accuracy, but the accuracy gain diminishes for bigger models.

2.2. Compound Scaling

They empirically observe that \(d\), \(w\), \(r\) are not independent. Intuitively, for higher resolution images, we should increase network depth for capturing similar features that include more pixels in bigger images. We should also increase network width when resolution is higher.

image

If we only scale network width \(w\) without changing \(d\) and \(r\), the accuracy saturates quickly. In order to pursue better accuracy and efficiency, it is critical to balance all dimension of \(w\), \(d\), \(r\) during ConvNet scaling.

In this paper, they propose a new compound scaling method, which use a compound coefficient \(\phi\) to uniromly scales network width, depth, and resolution in a principled way:

image

where \(\alpha, \beta, \gamma\) are constants.

The FLOPS of a regular convolution op is proportional to \(d\), \(w^{2}\), \(r^{2}\). So with \(\phi\), it increase total FLOPS by \((\alpha \cdot \beta^{2} \cdot \gamma^{2})^{\phi} \approx 2^{\phi}\).

3. EfficientNet Architecture

Since model scaling does not change layer in baseline network, having a good baseline network is also critical.

They use a multi-objective neural architecture search that optimizes both accuracy and FLOPS. If you want to study more about neural architecture search, please refer to [2].

They use \(ACC(m) \times [FLOPS(m)/T]^{w}\) as the optimization goal, where \(ACC(m)\) and \(FLOPS(m)\) denote the accuracy and FLOPS of model \(m\). \(T\) is the target FLOPS and \(w\) is a hyperparameter for controlling the trade-off between accuracy and FLOPS.

image

Starting from the baseline EfficientNet-B0, they apply compound scaling method with two steps:

They split the method to two steps because searching for \(\alpha, \beta, \gamma\) directly around a large model, search cost is too expensive.

3.1. Train

They train EfficientNet models on ImageNet using RMSProp optimizer with decay 0.9 and momentum 0.9, batch norm momentum 0.99, weight decay 1e-5, initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. They also use SiLU activation. They linearly increase dropout ratio from 0.2 for EfficientNet-B0 to 0.5 for B7.

4. Experiments

image

Compared to other single-dimension scaling methods, compound scaling method improves the accuracy.

image image

Scaled EfficientNet models achieve better accuracy with much fewer parameters and FLOPS than other ConvNets.

image

Why their compound scaling method is better than others? As shown in the above figure, the model with compound scaling tends to focus on more relevant regions with more object details.

5. Conclusion

They propose a simple and highly effective compound scaling method, which enables us to easily scale up a baseline ConvNet while maintaining model efficiency. They demonstrate that a mobile-size EfficientNet model can be scaled up very effectively, surpassing state-of-the-art accuracy with fewer parameters and FLOPS on ImageNet.

References

[1] EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

[2] MnasNet: Platform-Aware Neural Architecture Search for Mobile


Comments

친구랑 얘기 중에 EfficientNet이 언급되었고 이전에 보았던 MobileNet하고 비슷한 느낌인 것 같아서 논문을 읽게 되었다. Compound Scaling Method라는 새로운 방법을 알게 되어서 좋았고 계속해서 논문 읽는 것이 습관화되면 좋겠다. 글을 적다보니 리뷰 느낌이 아닌 Summary가 되었는데 뭐 나름 좋다고 생각한다. 논문을 리뷰하다가 자칫 주관적인 생각이 들어가 저자의 의도를 해칠 가능성이 있는 것보단 주요한 쟁점들만 따로 summary하는 것도.. 나쁘지 않다고 본다.