Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo
Why Data Augmentation is Needed
Given the limited set of data, one can train the model by applying variations to the existing image. Such modification to the image increases the overall size of the dataset, thereby contributing to the model’s robustness. Thus, data augmentation strategies that enhance localization and generalization performance have been suggested to ameliorate the performance of convolutional neural network classifiers.
ResNet-50 | Mixup | Cutout | CutMix | |
Image | ![]() | ![]() | ![]() | ![]() |
Label | Dog 1.0 | Dog 0.5 Cat 0.5 | Dog 1.0 | Dog 0.6 Cat 0.4 |
ImageNet Cls (%) | 76.3 (+0.0) | 77.4 (+1.1) | 77.1 (+0.8) | 78.4 (+2.1) |
ImageNet Loc (%) | 46.3 (+0.0) | 45.8 (-0.5) | 46.7 (+0.4) | 47.3 (+1.0) |
Pascal VOC Det (mAP) | 75.6 (+0.0) | 73.9 (-1.7) | 75.1 (-0.5) | 76.7 (+1.1) |
The results of Mixup, Cutout, and our CutMix on ImageNet classification, Image Net localization, and Pascal VOC 07 detection tasks are displayed. By implementing CutMix, the performance has been improved.
Mixup
Mixup is a strategy that admixes two samples in which the ground truth label of the new sample is given by the combination of one-hot labels. The strategy mentioned above is implemented by interpolating both the images and labels: Mixup variants perform feature-level interpolation and other types of transformations, leading to the increased performance of classification. Despite the diversity generated by Mixup, samples under this method usually have flaws such as local ambiguity and unnatural characteristics of the model. These traits are highly likely to deteriorate localization capabilities of the model and allow for the lack of deep analysis in particular on the localization ability and transfer-learned performances.
Cutout
Overfitting is a phenomenon when a statistical model models its training data too tightly. The application of the Cutout regularization is highly recommended to avoid overfitting. Cutout has been proposed to reinforce the generalization and localization performance of CNN’s. On the other hand, previous methods for cutout miss informative bits from training images by overlaying a patch of either black pixels or random noise. Such a phenomenon is not favorable, as it may lead to information loss and inefficiency during training.
Introduction of CutMix
From then on, the need for the new data augmentation strategy arises. CutMix, a crossover of Cutout and Mixup, is a novel data augmentation strategy. This technique compensates for the disadvantages of the existing data augmentation methods. CutMix also produces new samples by cutting and pasting patches within mini-batches, thereby giving rise to performance boosts in many computer vision tasks. It is the primary advantage of CutMix that the additional cost of the sample generation is negligible. Also, CutMix is superior to the aforementioned methods. Because CutMix outperforms Mixup and Cutout in the following four aspects: image classification, weakly supervised object localization, object detection image captioning, and improvement of the model’s robustness and alleviation of the over-confident issue. This novelty bolsters the model robustness against input corruptions and its out-of-distribution detection performances.
How CutMix is Performed
Original Samples | ![]() | ![]() | |
Input Image | ![]() | ![]() | ![]() |
CAM for ‘St. Bernard’ | ![]() | ![]() | ![]() |
CAM for ‘Poodle’ | ![]() Mixup | ![]() Cutout | ![]() CutMix |
Class activation mapping (CAM) visualizations on ‘Saint Bernard’ and ‘Miniature Poodle’ samples under Mixup, Cutout, and CutMix. One should take note of the result that CutMix can benefit from the mixed region on an image, whereas Mixup and Cutout cannot

Top-1 test error plot for CIFAR100 (left) and ImageNet (right) classification. As the training reaches the end, CutMix is likely to avoid overfitting and has lower test errors than the baseline.
In order to implement the CutMix strategy, one should make sure that patches are cut and pasted among training images in which the ground truth labels are also mixed proportionally to the area of the patches to the number of pixels of combined images. Doing so allows retaining the regularization effect of the regional dropout. There is no uninformative pixel during training, making training efficient, whereas retaining the advantages of regional dropout to attend to non-discriminative parts of objects. The additional patches further increase localization ability by requiring the model to identify the object from a partial view. The training and inference budgets remain unchanged. Given such features of CutMix, it consistently surpasses the state-of-the-art augmentation strategies on not only CIFAR and ImageNet classification tasks, but also the ImageNet weakly supervised localization task. Furthermore, unlike pre-existing augmentation methods, this CutMix-trained ImageNet classifier, when used as a pre-trained model, leads to consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. Last but not least, CutMix improves model robustness against input corruptions and its out-of-distribution detection performances.
Experiments
Throughout the research, several experiments are conducted to assess CutMix’s ability to boost localization and generalization of a trained model on multiple tasks. Throughout the experiments, we verify that CutMix outdoes other state-of-the-art regularization methods in the tasks above. From that stage, we further analyze the inner mechanisms behind such superiority.
All experiments were implemented and evaluated on NAVER Smart Machine Learning (NSML) platform with PyTorch.
Method | Top-1 Error | Model file |
---|---|---|
PyramidNet-200 [CVPR’17] (baseline) | 16.45 | model |
PyramidNet-200 + Cutmix | 14.23 | model |
PyramidNet-200 + Shakedrop [arXiv’18] + Cutmix | 13.81 | – |
PyramidNet-200 + Mixup [ICLR’18] | 15.63 | model |
PyramidNet-200 + Manifold Mixup [ICML’19] | 16.14 | model |
PyramidNet-200 + Cutout [arXiv’17] | 16.53 | model |
PyramidNet-200 + DropBlock [NeurIPS’18] | 15.73 | model |
PyramidNet-200 + Cutout + Labelsmoothing | 15.61 | model |
PyramidNet-200 + DropBlock + Labelsmoothing | 15.16 | model |
PyramidNet-200 + Cutout + Mixup | 15.46 | model |
The comparison of state-of-the-art regularization methods on CIFAR-100 and impact of CutMix on CIFAR-100
Method | Top-1 Error | Model file |
---|---|---|
ResNet-50 [CVPR’16] (baseline) | 23.68 | model |
ResNet-50 + Cutmix | 21.40 | model |
ResNet-50 + Feature Cutmix | 21.80 | model |
ResNet-50 + Mixup [ICLR’18] | 22.58 | model |
ResNet-50 + Manifold Mixup [ICML’19] | 22.50 | model |
ResNet-50 + Cutout [arXiv’17] | 22.93 | model |
ResNet-50 + AutoAugment [CVPR’19] | 22.40* | – |
ResNet-50 + DropBlock [NeurIPS’18] | 21.87* | – |
ResNet-101 + Cutmix | 20.17 | model |
The ImageNet classification results based on ResNet-50 and ResNet-101 model and the impact of CutMix on ImageNet classification for ResNet-50 and ResNet-101 (* denotes the results reported in the original papers)
Backbone | ImageNet Cls (%) | ImageNet Loc (%) | CUB200 Loc (%) | Detection (SSD) (mAP) | Detection (Faster-RCNN) (mAP) | Image Captioning (BLEU-4) |
---|---|---|---|---|---|---|
ResNet50 | 23.68 | 46.3 | 49.41 | 76.7 | 75.6 | 22.9 |
ResNet50+Mixup | 22.58 | 45.84 | 49.3 | 76.6 | 73.9 | 23.2 |
ResNet50+Cutout | 22.93 | 46.69 | 52.78 | 76.8 | 75 | 24.0 |
ResNet50+CutMix | 21.60 | 46.25 | 54.81 | 77.6 | 76.7 | 24.9 |
The impact of CutMix on transfer learning of pre-trained model to other tasks, object detection and image captioning
Conclusion
We have demonstrated that CutMix is simple, easy to apply to many images, free of computational overheads, yet surprisingly effective. On ImageNet classification, applying CutMix to ResNet- 50 and ResNet-101 achieves +2.08% and +1.70% top-1 accuracy improvements. On CIFAR classification, CutMix can also significantly enhance the performance of baseline by +2.22% and lead to the state-of-the-art top-1 error 14.23% performance. On weakly supervised object localization (WSOL), CutMix can improve localization accuracy and achieve comparable localization performance to state-of-the-art WSOL methods without applying any WSOL techniques. Furthermore, simply using CutMix- ImageNet pre-trained model as the initialized backbone of the object detection and image captioning improves the overall performance. In contrast to other augmentation methods, CutMix rather boosts the model’s robustness and resolves uncertainty issues.
Pingback: ICCV 2019 – Clova AI Research Blog