1. Introduction
1.1 Motivation
The existing methods for object detection mainly have two problems.
(i) Most previous works have developed network structures for cross-scale feature fusion. However, they usually contribute to the fused output feature unequally.
(ii) While previous works mainly rely on bigger backbone networks or larger input image sizes for higher accuracy, scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and efficiency.
1.2 Goal
For (i), they propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion. And, in order to solve (ii), they propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time.
Then, they call object detector that has these optimization methods as EfficientDet.
2. BiFPN
First, they formulate the multi-scale feature fusion problem that aims to aggregate features at different resolutions. And then, introduce ideas of BiFPN.
2.1 Problem Formulation
: A list of multi-scale features : The feature at level
: Transformation- (1) which is effective to aggregate different features
- (2) which outputs a list of new features
For an example of a multi-scale feature fusion, FPN [T. Y. Lin et al, 2017] has the conventional top-down pathway. The way of fusion describes in the below figure.

2.2 Cross-Scale Connections

The above figure represents multi-scale feature fusion method of previous approaches.
The problem of existing methods for feature fusion is:
- FPN: one-way information flow
- PANet: better accuracy, but the high cost of more parameters and computations
- NAS-FPN: requiring thousands of GPU hours
To improve model efficiency, this paper proposes several optimizations for cross-scale connections:
- Remove those nodes that only have one input edge.
- Why?
- One input edge with no feature fusion has less contribution to the feature network.
- This leads to a simplified bi-directional network.
- Why?
- Add an extra edge from the original input to the output node if they're at the same level.
- Why?
- This makes to fuse more features without adding much cost.
- Why?
- Treat each bidirectional path as one feature network and repeat the same layer multiple times.
- Why?
- ...
- Why?
2.3 Weighted Feature Fusion
When fusing features with different resolutions, a common way is to first resize them to the same resolution and then sum them up. At this point,
all previous methods treat all input features equally without distinction.
In this paper, in order to contribute to the output feature unequally from the input feature, the authors add an additional weight for each input and let the network learn the importance of each input feature.
Fast normalized fusion
,where
The final BiFPN integrates both the bidirectional cross scale connections and the fast normalized fusion. As a concrete example, they describe the two fused features at level 6 for BiFPN shown in Fig 2 (d):

3. EfficientDet
Based on BiFPN, they developed a detection model named EfficientDet.
3.1 EfficientDet Architecture

Fig 3. shows the overall architecture of EfficientDet, which is the one-stage detectors paradigm. They use ImageNet-pretrained EfficientNets as the backbone network.
BiFPN serves as the feature network and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed to a class and box network. The class and box network weights are shared across all levels of features.
3.2 Compound Scaling
Aiming at optimizing both accuracy and efficiency, they would like to develop a family of models that can meet a wide spectrum of resource constraints. A key challenge here is how to scale up a baseline EfficientDet model.
Therefore, they propose a new compound scaling method for object detection, which uses a simple compound coefficient
Backbone network
They reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6.
BiFPN network
They linearly increase BiFPN depth
Especially, they perform a grid-search on a list of values {1.2, 1.25, 1.3, 1.35, 1.4, 1.45}, and pick the best value 1.35 as the BiFPN width scaling factor. Formally, BiFPN width and depth are scaled with the following equation:
Box/class prediction network
They fix their width to be always the same as BiPFN (i.e.,
Input image resolution
Since feature levels 3-7 are used in BiFPN, the input resolution must be dividable by
Following Eq. (1), (2), (3) with different

4. Experiment Setting
- Dataset: COCO 2017 detection datasets.
- Optimizer: SGD optimizer with momentum 0.9 and weight-decay 4e-5.
- Learning rate: linearly increased from 0 to 0.16 in the first training epoch then annealed down using cosine decay rule.
- Epoch size: 300 (D7/D7x is 600)
- Batch size: 128
- Batch norm: The synchronized batch norm is added after every convolution with batch norm decay 0.99 and epsilon 1e-3.
- Activation: SiLU (Swish-1) activation and exponential moving average with decay 0.9998.
- Loss: commonly-used focal loss with
and and aspect ratio {1/2, 1, 2}. - Data augmentation: horizontal flipping and scale jittering [0.1, 2.0], which randomly resizes images between 0.1x and 2.0x of the original size before cropping.
- Evaluation method: soft-NMS
5. Experiment Result

Reference
Tan, Mingxing, Ruoming Pang, and Quoc V. Le. "Efficientdet: Scalable and efficient object detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
'AI paper review' 카테고리의 다른 글
Rate-Perception Optimized Preprocessing for Video Coding 논문 리뷰 (1) | 2023.12.31 |
---|---|
LoRA: Low-Rank Adaptation of Large Language Models 논문 리뷰 (0) | 2023.05.16 |
Segment Anything 논문 리뷰 (0) | 2023.04.07 |
GPT-1: Improving Language Understanding by Generative Pre-Training 논문 리뷰 (0) | 2023.02.13 |
The Forward-Forward Algorithm: Some Preliminary Investigations 논문 리뷰 (0) | 2023.01.28 |