What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis (ICCV 2019 Oral)

Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, Hwalsuk Lee

arXiv Github

Motivations for this Research

RegularIrregular

Examples of regular (IIIT5k, SVT, IC03, IC13) and irregular (IC15, SVTP, CUTE) real-world datasets

Referred to as scene text recognition (STR), reading text in natural scenes, as shown above, has been an essential task in many industrial practices. In recent years, researches have proposed an increasing number of new scene text recognition (STR) models, with each model claimed to have widened the boundary of technology.

While existing methods have pushed the boundary of technology, means for holistic and fair comparison have been, in large, missing in the field because of the inconsistent choices of training and evaluation datasets. It is not easy to determine whether and how much the new module has improved upon the current art, because of the different assessment and testing environments that make it challenging to compare reported numbers at face value.

Problem: Inconsistent Comparison

The table exhibits the performance of existing STR models with their inconsistent training and evaluation settings. This inconsistency hinders the fair comparison among those methods. We present the results reported by the original papers and also show our reimplemented results in a unified and consistent setting. In the last row, we also show the best model we have found, which shows competitive performance to state-of-the-art methods.
The top accuracy for each benchmark is shown in bold.

We examine different training and evaluation datasets used by prior works and point out the discrepancies. We aim to highlight how each work differs in constructing and using their datasets and scrutinize the bias in comparing performance among different works.

Training dataset: It is costly to label scene text images when training an STR model and difficult to obtain enough labeled data. Alternatively using real data, most STR models have used synthetic datasets for training. The two most popular synthetic datasets used in recent STR papers are as follows: MJSynth (MJ) and SynthText (ST)

Prior works have used different combinations of MJ, ST, and or other sources. This tendency calls into question whether the improvements are due to the contribution of the proposed module or that of better or more extensive training data. In response to this issue, we describe the influence of the training datasets to the final performance on the benchmarks. Thus, practitioners can efficiently analyze their researches when they indicate the training datasets used and compare models by implementing the uniform training set.

Evaluation dataset: As evaluation datasets differ among various methods, this condition causes performance discrepancy. Different works use a discrete subset of the IC13 dataset as part of their evaluation set, possibly incurring a performance disparity of more than 15%. Such discrepancy inhibits fair comparison of performance among different models.

Thus, it is important to compare results by using the same datasets because differences in the datasets can incur a performance gap. We have implemented MJ+ST throughout our experiment and tested the performance of all patterns.

The Four-Stage STR Framework

Visualization of an example flow of scene text recognition

Since STR is similar to computer vision tasks such as object detection, and sequence prediction tasks, it has benefited from high-performance convolutional neural networks (CNN’s) and recurrent neural networks (RNN’s). Convolutional-Recurrent Neural Network (CRNN), which is the first combined application of CNN and RNN for STR, extracts CNN features from the input text image and reconfigures them with an RNN for robust sequence prediction. Multiple variants have been proposed to boost performance: transformation modules have been proposed to normalize text images and rectify arbitrary text geometries. Improved CNN feature extractors have been incorporated to treat complex text images with high intrinsic dimensionality and latent factors, such as font style and cluttered background). As people have become more concerned with inference time, some methods have even omitted the RNN stage. Attention-based decoders have been introduced to improve character sequence prediction.

Developed through commonalities among independently proposed STR models, a unified scene text recognition (STR) framework that we suggest has four stages: transformation, feature extraction, sequence modeling, and prediction. 

  1. Transformation stage 
    The input image is transformed into the normalized image. The module uses thin-plate spline (TPS) transformation, a variant of the spatial information network (STN) to reduce the burden on invariant representation concerning the unaltered, irregular text images in natural scenes. Our framework allows for the selection or de-selection of TPS.
  2. Feature extraction stage
    A CNN abstract an input image and outputs a visual feature map that focuses on the attributes relevant to character recognition, while minimizing irrelevant features such as font, color, size, and background. We study three architectures of VGG, RCNN, and ResNet.
  3. Sequence modeling stage 
    This stage captures the contextual information within a sequence of characters for the next stage to predict each character more robustly, rather than independently. Our framework allows for the selection or de-selection of BiLSTM.
  4. Prediction stage 
    The model of the stage predicts the output character sequence from the identified features of a given image. There are two options for estimation: connectionist temporal classification (CTC) and attention-based sequence prediction (Attn)

Analysis of Tradeoffs for Module Combinations I

The STR module combinations illustrate the two types of tradeoffs exhibited by STR module combinations. Stars indicate previously proposed models. Circular dots represent new module combinations evaluated by our framework. Red solid curves indicate the tradeoff frontiers found among the combinations. Module combinations along the tradeoff frontiers are labeled in ascending order of accuracy.

#Trans.Feat.Seq.Pred.Acc.
%
Time
ms
params
× 106
#Trans.Feat.SeqPred.Acc.
%
Time
ms
params
× 106
T1NoneVGGNoneCTC69.51.35.6P1NoneRCNNNoneCTC75.47.71.9
T2NoneResNetNoneCTC80.04.746.0P2NoneRCNNNoneAttn78.524.12.9
T3NoneResNetBiLSTMCTC81.97.848.7P3TPSRCNNNoneAttn80.62644.6
T4TPSResNetBiLSTMCTC82.910.949.6P4TPSRCNNBiLSTMAttn82.330.17.2
T5TPSResNetBiLSTMAttn84.027.649.6P5TPSResNetBiLSTMAttn84.027.649.6

T1-T5: Accuracy versus time tradeoff curve and its frontier combinations
P1-P5: Accuracy versus memory tradeoff curve and its frontier combinations
Modules in red denote those that have been changed from the combination directly before the other module; those modules
exhibits increased performance over the previous combination while minimizing the added time or memory cost.

Accuracy-SpeedAccuracy-Memory
• On the frontier: Rosetta, STAR-net• On the frontier: R2AM
• Inside: the other four prior models• Inside: the other five of previously proposed models
• T1: the model with the minimum length of time to process.• P1: the model with the least amount of memory consumption
• T1 → T5: Each shift sequentially increases the complexity of the overall STR model, resulting in an ameliorated performance at the cost of computational efficiency.

• P1 → P5: the tradeoff between memory and accuracy
Each shift sequentially increases the accuracy at the cost of memory.
• For example, ResNet, BiLSTM, and TPS introduce relatively moderate overall slow down (1.3ms→10.9ms), while considerably boosting accuracy (69.5%→82.9%). The last change, Attn, on the other hand, only improves the accuracy by 1.1% and incurs a massive cost in efficiency (27.6 ms). • Compared to VGG in T1, RCNN in P1-P4 gives a favorable accuracy-memory tradeoff and calls for a small number of different, repeated CNN layers. Transformation, sequential, and prediction modules are not significantly conducive to memory consumption (1.9M→7.2M parameters).

• While being mostly lightweight, these modules provide accuracy improvements (75.4%→82.3%). On the contrary, the last change, ResNet, increases the accuracy by 1.7% at the expense of improved memory consumption from 7.2M to 49.6M floating-point parameters.

A researcher concerned about memory consumption should select specialized transformation, sequential, and prediction modules with freedom; he or she should refrain from using heavy feature extractors, such as ResNets.

Analysis of Tradeoffs for Module Combinations II

Scatterplots with the most speed-and-memory-critical modules, namely the prediction and feature extraction modules, respectively.

There are distinct clusters of combinations based on the prediction and feature extraction modules. For the accuracy-speed tradeoff, we identify CTC and Attn* clusters. For the accuracy-memory tradeoff, we observe that the feature extractor has the most significant influence on memory.

Note that the most significant modules for each criterion differ. Thus, given a different set of application scenarios and constraints, a practitioner should attentively select different module combinations for the optimal tradeoffs, depending on the priority of the research.

* The addition of Attn significantly slows the overall STR model.

Module-Wise Analysis

StageModuleRegular Accuracy (%)Irregular Accuracy (%)Time
ms/image
params
× 106
Trans.None
TPS
85.6
86.7(+1.1)
65.7
69.1(+3.4)
N/A
3.6
N/A
1.7
Feat.VGG
RCNN
ResNet
84.5
86.2(+1.7)
88.3(+3.8)
63.9
67.3(+3.4)
71.0(+7.1)
1.0
6.9
4.1
5.6
1.8
44.3
Seq.None
BiLSTM
85.1
87.6(+2.5)
65.2
69.7(+4.5)
N/A
3.1
N/A
2.7
Pred.CTC
Attn
85.5
87.2(+1.7)
66.1
68.7(+2.6)
0.1
17.1
0.0
0.9

Study of modules at the four stages concerning total accuracy, inference time, and the number of parameters. The accuracies are acquired by taking the mean of the results of the combinations, including that module. The inference time and the number of parameters are measured individually.

We investigate module-wise performances in terms of accuracy, speed, and memory demand by calculating the marginalized accuracy of each module. This computation is processed by averaging out the combination, including the module in the table. It requires additional resources, time, or memory to upgrade a module at each stage, yet enabling performance improvements. The table illustrates that the performance improvement in irregular datasets is about two times that of regular benchmarks throughout all stages. To compare accuracy improvement with time usage, a sequence of ResNet, BiLSTM, TPS, and Attn is the most efficient upgrade order of the modules from a base combination of None-VGG-None-CTC. This order is identical to that for the accuracy-time frontier combinations (T1→T5).

On the other hand, in an accuracy-memory perspective, the order of RCNN, Attn, TPS, BiLSTM, and ResNet is the most efficient, as the order of the accuracy-memory frontiers is (P1→P5). Note that the efficient order of modules for time is the reverse of that for memory. This finding shows that different properties of modules provide different choices in practice. Furthermore, the module ranks in the two perspectives are the same as the order of the frontier module changes, showing that under all combinations, each module contributes to the performance to a similar degree.

Qualitative Analysis

Each module facilitates the identification of text by solving targeted difficulties of STR tasks. The figure displays samples that are only correctly recognized when specific modules are upgraded, each row showing a module upgrade at each stage of our framework (e.g., from VGG to ResNet backbone). Before the upgrade, the presented samples have been unsuccessful, but have become recognizable afterward.

The following examples offer insights into the contribution points of the modules in real-world applications.

Trans.1 (None→TPS)
Feat.2 (VGG→ResNet)
Seq.3 (None→BiLSTM)
Pred.4 (CTC→Attn)

Challenging examples for the STR combinations without a specific module. All STR combinations without the notated modules failed to recognize text in the examples, but upgrading the module solved the problem.

  1. TPS transformation normalizes curved and perspective texts into a standardized view. Predicted results exhibit dramatic improvements especially for “POLICE” in a circled brand logo and “AIRWAYS” in a perspective view of a storefront sign.
  2. Advanced feature extractor, ResNet, results in better representation power, improving on cases with substantial background clutter (“YMCA,” “CITY ARTS”) and unseen fonts (“NEUMOS”).
  3. BiLSTM leads to better context modeling by adjusting the receptive field; it can ignore unrelatedly cropped characters (“I” at the end of “EXIT,” “C” at the end of “G20”).
  4. Attention-including implicit character-level language modeling finds a missing or occluded character, such as “a” in “Hard,” “t” in “to,” and “S” in “HOUSE.”

Failure Case Analysis*

We have probed into failure cases of all 24 combinations. Because our framework originates from commonalities among proposed STR models, and our best model showed competitive performance with previously proposed STR models, the presented failure cases pose a common challenge for the field. 

Among 8,539 examples in the benchmark datasets, 644 images (7.5%) are not correctly recognized by any of the 24 models considered. We have found six common failure cases, as shown below. The following are a discussion about the challenges of the cases and guideline for future research directions: vertical texts, special characters, heavy occlusions, low resolution, and label noise

Samples of failure cases on all combinations of our framework

For example, calligraphic fonts and font styles for brands, such as “Coca Cola,” or shop names on streets, such as “Cafe,” still pose challenges. Such a diverse expression of characters requires a new feature extractor providing generalized visual features. Another possible approach is regularization because the model might be over-fitting to the font styles in a training dataset.

* You may refer to our Github, as we have made all failure cases available for future researches on corner cases of the STR problem. 

Conclusion

Despite significant advances in scene text recognition (STR) models, the models have been compared according to inconsistent benchmarks, leading to difficulties in determining whether and how a proposed module improves the STR baseline model. In response to this issue, our research analyzes the contribution of the existing STR models that were impeded under the previous inconsistent experiment settings. It also introduces a comprehensive framework for not only primary STR methods but also consistent datasets: seven benchmark evaluation datasets and two training datasets (MJ and ST). We have provided a fair comparison among the essential STR methods, found modules that have the highest accuracy, speed, and size gains, and thoroughly analyzed module-wise contributions to typical challenges in STR as well as the remaining failure cases.