CC-FPSE

Title：Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis
Year：NeurIPS 2019
Author：Xihui Liu
School：The Chinese University of HongKong
Code：https://github.com/xh-liu/CC-FPSE

Topic and Gap

Topic

Semantic image synthesis, which aims at generating photorealistic images conditioned on semantic layouts.

Gap

How to exploit the semantic layout information in the generator

Preserve information: most of network just be fed into semantic label maps once at the input layers.
Spatial locations: distinct semantic labels at different locations as well as the unique semantic layout of each sample, different convolutional kernels should be used for generating different stuff or object.

How to promote high-fidelity details and enhance the spatial semantic alignment in the discriminator

High-fidelity details includes texture and edges
the spatial semantic alignment between generated image and semantic label layout
Current discriminator seems without considering whether generated image matches well with the label map.

Contributions

Predict layout-to-image conditional convolution kernels based on the semantic layout for generating.
A feature pyramid semantics-embedding discriminator.

Method Details

Image Generation

Model input

Noise Map with CC kernels from Weight Prediction Network

Kernel weight
For traditional convolution layers, they are applied to all samples and at all spatial locations, but different objects should be generated differently, the better way is, at each convolution layers and each multiplication times, these kernels weights are suitable for that locations, by given semantic label map.

Depthwise separable convolution
In order to avoid excessive computational costs and GPU memory usage, the author uses “depthwise separable convolution” to predict weights for each layer.

definition: The depth wise separable convolutions consist of two steps: depthwise convolutions and 1x1 convolutions.
info details: A Comprehensive Introduction to Different Types of Convolutions in Deep Learning
parameters: for each layer, parameters number: $C \times k \times k \times H \times W$

Conditional attention operation

role: gate the information flow passed to the next layer.
computation : the element-wise product, the same shape with the last output.

Conditional Weight Predition

The generator uses different kernels to generate images, but how to get the predicted weights?

Model input

Label map

Predicted weights
Semantic label maps have some hidden information, such as locations between different stuff. A small receptive field restricts the weight prediction from incorporating long-range context information. So the author introduce “A feature pyramid structure”.

A feature pyramid structure

process : The features at different levels of the feature pyramid are concatenated with the original semantic map to obtain the global-context-aware semantic feature maps.
role: predict weights which are aware of not only local neighborhood, but also long-range context and relative locations.

Feature Pyramid Semantics Embedding Discriminator

Drawback of PathGAN

Just discriminate whether an image is fake or true, not to keep semantic alignment.

Model input

an image (GT or generated image)
the label map is used as an additional clues to compute a matching score.

Semantic Embedding for Discriminator

PatchGAN
for each feature map, it classify if each patch is real or not by predicting a score for each spatial location.

Projection discriminator
computes the dot product between the class label and image feature vector as part of the output discriminator score.

Ours

role: force the discriminator to classify not only real or fake images, but also whether the patch features match with the semantic labels in that patch within a joint embedding space.
info details: 1)inspired by “Projection discriminator”, calculate the inner product between each spatial location of $F_i$ and $S_i$ , to obtain a semantic matching score map; 2) The semantic matching score is added with the conventional real/fake score as the final discriminator score.

References

Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.

Notes

I found that some papers you were hardly understand what the content means yesterday, would become easier to read tomorrow…