Jaejun Yoo, Youngjung Uh, Sanghyuk Chun, Byeongkyu Kang, Jung-Woo Ha
What is Photorealistic Style Transfer and Why is It Needed?
Photorealistic stylization results. Given (a) an input pair (top: content, bottom: style), the results of (b) WCT (c) PhotoWCT and (d) our model are shown. Every result is produced without any post-processing. While WCT and PhotoWCT suffer from spatial distortions, our model successfully transfers the style and preserves the fine details.
Photorealistic style transfer is a task to transfer an image to the other one by using either of the two images as the style reference and applying on the image the temporal changes such as from day to night, and from summer to winter. A model should simultaneously meet two contradictory objectives to achieve photorealism. However, doing so is indeed tricky, because it should precisely reconstruct the spatial information of an image while stylizing the scene in the appropriate amount. For example, to transfer an image from day to night, the sky and lights of the windows should become bright, whereas the fine structures of the roads and skyscrapers should remain intact.
However, artistic style transfer, such as whitening and coloring transforms (WCT), generally has a severe distortion issue. This problem results from the method’s lossy network architecture, which is not favorable in photorealistic stylization.
Drawbacks of the Existing SOTA Methods
Luan et al. (Deep Photo Style Transfer, DPST)* has introduced a regularizer for photorealism on the traditional optimization-based approach. However, solving the optimization problem requires substantial computational costs, limiting its use in practice. To overcome this issue, Li et al. have recently proposed a photorealistic variant of WCT, PhotoWCT**, which substitutes unpooling for the upsampling components of the VGG decoder. By providing the max-pooling mask, PhotoWCT is designed to compensate for the information loss during the encoding step and decrease the spatial distortion. Although such an approach was valid, the introduction of the mask was not able to resolve the information loss that arises from the max-pooling of the VGG network. Furthermore, a new series of post-processing procedures that need the original image to patch up the result has to be performed to refine the remaining artifacts. These post-processing steps require demanding computation and time, but they generate another set of unfavorable blurry artifacts and hyper-parameters to set manually.
* F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer.
** Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization.
arXiv preprint arXiv:1802.06474, 2018
Our Solution: WCT2
We aim to resolve the problem by introducing a theoretically valid correction on the downsampling and upsampling operations, instead of providing partial amendments. Therefore, to achieve photorealism, we propose a wavelet corrected transfer based on whitening and coloring transforms, WCT2. More specifically, our network could preserve information of the content and faithfully transfer the style by implementing wavelet pooling and unpooling and progressively stylizing an image.
|• severe distortions due to max-pooling and nearest upsampling (lossy operations) |
• multi-level stylization
• unrealistic results (artistic texture, which is not favored in photorealistic stylization)
|• solving iterative regularized optimization problem (heavy computation, slow)|
• unrealistic results (cartoon-like artifact)
|• providing a max-pooling mask to the decoder; i.e., unpooling (semi-lossy operations)|
• multi-level stylization (artifact amplification)
• heavy post processing steps using graph Laplacian matrix (slow)
• unrealistic results (over-smoothed artifact)
|• wavelet corrected transfer (lossless operation in theory)|
• progressive stylization: less error amplification + lighter model)
• no post processing and cumbersome computation (fast)
• realistic results
Wavelet Pooling and Unpooling
First of all, the WCT2 method replaces the pooling and unpooling operations in the VGG encoder and decoder with wavelet pooling and unpooling.
One of the most critical characteristics of wavelet pooling is the possibility of the original signal being correctly reconstructed by mirroring its operation; i.e., wavelet unpooling. Followed by a summation, wavelet unpooling fully recovers the original signal by performing a component-wise transposed-convolution.* WCT2 can stylize an image with minimal information loss and noise amplification. On the other hand, max-pooling does not have its exact inverse so that the encoder-decoder structured networks used in the WCT and PhotoWCT cannot fully restore the signal.
* For further details, you may refer to the supplementary documents in our Github.
Thus, WCT2 can fully reconstruct the signal without any post-processing steps, involving a minimal amount of information loss. The disassembled wavelet features provide interesting interpretations of the feature space such as component-wise stylization, making the stylization results from wavelet pooling more favorable than those from max-pooling.
Comparison between max-pooling and wavelet pooling. Given (a) an input pair (inset: style), we compare the results of (b) PhotoWCT without post-processing, (c) ours and (d) ours but stylize only the LL component. Note that the edges are left unstylized (inside the red box).
We first examine the effects of using wavelet pooling. The major disadvantage of PhotoWCT is the loss of spatial information by max-pooling, whereas WCT2 preserves fine details. In wavelet pooling, the low-frequency component captures smooth surface and texture while high-frequency components detect edges. This feature enables our model to separately, but effectively control the stylization effect by selecting a component. More specifically, it implies that applying WCT to LL* of the encoder affects overall texture or surface while applying WCT to high-frequency components (i.e., LH, HL, HH) stylized edges. Thus, when we stylize all components, our model transfers the given style to the entire building. In contrast, if we do not perform WCT on the high-frequency components, the boundaries of windows remain unchanged.
* The first wavelet components of . For simplicity, we denote the output of each kernel as LL, LH, HL, and HH, respectively. Here,
Using the LL component of our wavelet pooling only is equivalent to using the average pooling. Many studies have consistently reported that replacing the max-pooling operation with average pooling yields slightly more appealing results. This tendency can be explained in our framework that the model is using the partial information (LL) of the wavelet decomposed feature domain. Also, because each frequency component of the content feature is transformed into its corresponding component of style feature, we can obtain a similar advantage as we do by using spatial correspondences.
WCT2 applies progressive stylization rather than adhere to the existing multi-level strategy implemented in the existing SOTA methods. WCT and PhotoWCT recursively transform features in a multi-level manner from coarse to fine. On the other hand, progressive stylization continuously transforms features throughout a single pass. There are two significant advantages of using progressive stylization instead of the other two methods. First of all, because we only use a single decoder during training and the inference time, our model is efficient and straightforward. In contrast to our progressive stylization, the multi-level stylization requires the training of a decoder for each level without sharing parameters, inefficient in terms of the number of parameters and training procedure. This overhead remains in the inference time as well because the model requires passing multiple encoder and decoder pairs to stylize an image. Second, by recursively encoding and decoding the signal with the lossy VGG networks, artifacts are amplified during the multi-level stylization. Because of wavelet operations and progressive stylization, our model does not have such a problem, and even more, it shows little error amplification when the multi-level strategy is implemented.
Photorealistic stylization results. Given (a) an input pair (top: content, bottom: style), the results of (b) deep photo style transfer (DPST), (c) and (d) PhotoWCT, and (e) WCT2 are shown. PhotoWCT (full) denotes the results after applying two post-processing steps proposed by the authors. Note that WCT2 does not need any post-processing.
Photorealistic video stylization results (from day-to-sunset). Given a style and video frames (top), we show the results by WCT2 (middle) and PhotoWCT (bottom) without semantic segmentation and post-processing. Despite the lack of a segmentation map, WCT2 shows photorealistic results while keeping temporal consistency. On the other hand, PhotoWCT generates spotty and varying artifacts over frames, which harm the photorealism.
Please refer to our supplementary video to check blinking artifacts of PhotoWCT.
SSIM index (higher is better) versus Style loss (lower is better). The ideal case is the top-right corner (red dot). Dashed lines depict the gap between before and after the post-processing steps, i.e., smoothing. The red asterisk denotes the baseline WCT2 with concatenation.
|128 ✕ 128||135.2|| |
2.7 + 2.5
|256 ✕ 256||306.9||3.2 + 9.2||3.2|
|512 ✕ 512||1020.7||3.6 + 40.2||3.8|
|768 ✕ 768||2264.0||3.8 + 101.8||4.2|
|896 ✕ 896||2988.6||3.8 + OOM||4.4|
|1024 ✕ 1024||3887.8||3.9 + OOM||4.7|
User study results. The percentage indicates the preferred model outputs out of 1640 responses. Note that we compare our results with PhotoWCT (full) that applies two post-processing steps proposed by the authors while we do not perform any post-processing for WCT2.
| ||DPST||PhotoWCT |
Runtime comparison of DPST, PhotoWCT (full) and ours in seconds. Every result is an average of ten-rounds run on a single NVIDIA P40 GPU. OOM denotes an out-of-memory error. Due to OOM, PhotoWCT fails to process over 896 × 896 image resolution.