Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, Hwalsuk Lee
Motivations for this Research
Examples of regular (IIIT5k, SVT, IC03, IC13) and irregular (IC15, SVTP, CUTE) real-world datasets
Referred to as scene text recognition (STR), reading text in natural scenes, as shown above, has been an essential task in many industrial practices. In recent years, researches have proposed an increasing number of new scene text recognition (STR) models, with each model claimed to have widened the boundary of technology.
While existing methods have pushed the boundary of technology, means for holistic and fair comparison have been, in large, missing in the field because of the inconsistent choices of training and evaluation datasets. It is not easy to determine whether and how much the new module has improved upon the current art, because of the different assessment and testing environments that make it challenging to compare reported numbers at face value.
Problem: Inconsistent Comparison
The table exhibits the performance of existing STR models with their inconsistent training and evaluation settings. This inconsistency hinders the fair comparison among those methods. We present the results reported by the original papers and also show our reimplemented results in a unified and consistent setting. In the last row, we also show the best model we have found, which shows competitive performance to state-of-the-art methods.
The top accuracy for each benchmark is shown in bold.
We examine different training and evaluation datasets used by prior works and point out the discrepancies. We aim to highlight how each work differs in constructing and using their datasets and scrutinize the bias in comparing performance among different works.Read More