VGG-19 Stage 1-5
!
"
#
BFA
Inverse VGG
INSL
WCT
(a) Multi-level stylization (PhotoWCT) (b) Multi-stylization on Decoder and INSLs (Ours)
(c) Content
(f) Vanilla + MS-Dec (e) Vanilla
(d) Style
(g) Vanilla + MS-Dec + MS-INSL
VGG-19 Stage 1 Inverse
WCT
#
VGG-19 Stage 1-2 Inverse
WCT
VGG-19 Stage 1-2 Inverse
WCT
VGG-19 Stage 1-3
Inverse
WCTVGG-19 Stage 1-3
Inverse
WCT
VGG-19 Stage 1-4
Inverse
WCTVGG-19 Stage 1-4
Inverse
WCT
VGG-19 Stage 1-5
Inverse
WCT
!
"
!
"
VGG-19 Stage 1-5
Inverse
WCT
!
"
Figure 4: Multi-stylization Comparison. (a) is the multi-
level stylization strategy used by WCT/PhotoWCT, which
adopts five distinct auto-encoders in cascade to make style
transfer. (b) is the architecture of our method. Please note
that (b) equals to the auto-encoder in the top blue box in
terms of computation cost. From (e) to (g), we progressively
apply style transfer modules (i.e. WCT) at the bottleneck,
decoder, and INSLs, where MS-Dec and MS-INSL denote
placing transfer module at decoder and INSLs respectively.
As demonstrated in (e-g), MS-Dec and MS-INSL enhance
style transfer effects without sacrificing fine details of the
content. Please see colors of leaves in (e-g).
(a) Input (b) Result by Concat (c) Result by Sum
𝐼
"
𝐼
#
Figure 5: Comparison of “Concat” and “Sum”.
is that SCs placed at low-level layers of an auto-encoder
will short circuit and block the information stream flow into
transfer modules work at the bottleneck. Interestingly, as
shown in Fig. 3 (e), we find that WCT
2
also fails to make
stylization if turn their proposed High-Frequency Compo-
nents Skip Links (HFCS) on and disable the input region
masks. To solve this problem, we introduce the Instance
Normalized Skip Links (namely INSL) as a replacement of
the SC, which applies the Instance Normalization (Ulyanov,
Vedaldi, and Lempitsky 2016) at skip connections. We find
that INSL can alleviate the short circuit phenomenon and
strengthen the detail preservation and distortion elimination
abilities of photorealistic style transfer networks. Please re-
fer to Fig 3 (f) for the result produced with INSLs.
Multi-stylization. Multi-stylization means make style trans-
fer repeatedly. As shown in Fig. 4 (a), WCT and PhotoWCT
adopt a strategy called multi-level stylization. They train
five auto-encoders and make stylization for five rounds in
(a) Input (b) Result by Upsampling (c) Result by Unpooling
!
"
!
#
Figure 6: Comparison of “Upsampling” and “Unpool-
ing”.
(a) Input (b) Use AdaIN (c) Use WCT
!
"
!
#
Figure 7: Comparison of using AdaIN and WCT as trans-
fer module. Using WCT as transfer module (c) achieves
more faithful photorealistic stylization effects against using
AdaIN (b).
a coarse-to-fine manner. Instead of that, WCT
2
proposes
progressive stylization, which uses a single round auto-
encoder but progressively executes style transfer modules
multi times at every part of the auto-encoder. Following
WCT
2
, we adopt a single-round multi stylization strategy
but only transfer features at the decoder and INSLs. Fig. 4
(b) illustrates our strategy. As demonstrated in Fig. 4 (e-g),
MS-Dec and MS-INSL can significantly improve the pro-
duced results in terms of stylization effects. Moreover, ap-
plying style transfer modules at INSLs (Fig. 4 (g)) can fur-
ther eliminate the short circuit phenomenon caused by SC
and strengthen the stylization effects.
Concat v.s. Sum. The choice of “concat” and “sum” opera-
tors when using skip links is a factor that may influence the
performance of auto-encoders. However, we find that using
“concat” generally has no specific difference against using
“sum” except little style fluctuation. Please refer to Fig. 5
(b) (c) for comparison.
Upsampling v.s. Unpooling. PhotoWCT argues that the un-
pooling tends to make the network produce fewer distor-
tions. However, we find that these two operators produce al-
most the same results in our settings. Please refer to Fig. 6
(b) (c) for comparison.
WCT v.s. AdaIN. WCT and AdaIN are two widely used
transfer modules that come from artistic style transfer. As
demonstrated in Fig. 7 (b) (c), WCT can produces more
faithful transfer results. We think this is because AdaIN need
to work with the auto-encoder trained in a more complicated
way. However, we just train the decoder to reconstruct im-
ages to facilitate the following pruning step.
C-Step
Based on the analysis on architecture components that have
significant influence on photorealistic style transfer effects,
we construct an auto-encoder named PhotoNet.