Title:Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis
Year:NeurIPS 2019
Author:Xihui Liu
School:The Chinese University of HongKong
Codehttps://github.com/xh-liu/CC-FPSE

Topic and Gap

Topic

Semantic image synthesis, which aims at generating photorealistic images conditioned on semantic layouts.

Gap

How to exploit the semantic layout information in the generator

  • Preserve information: most of network just be fed into semantic label maps once at the input layers.
  • Spatial locations: distinct semantic labels at different locations as well as the unique semantic layout of each sample, different convolutional kernels should be used for generating different stuff or object.

How to promote high-fidelity details and enhance the spatial semantic alignment in the discriminator

  • High-fidelity details includes texture and edges
  • the spatial semantic alignment between generated image and semantic label layout
  • Current discriminator seems without considering whether generated image matches well with the label map.

Contributions

  • Predict layout-to-image conditional convolution kernels based on the semantic layout for generating.
  • A feature pyramid semantics-embedding discriminator.

Method Details

Image Generation

Model input

  • Noise Map with CC kernels from Weight Prediction Network

Kernel weight
For traditional convolution layers, they are applied to all samples and at all spatial locations, but different objects should be generated differently, the better way is, at each convolution layers and each multiplication times, these kernels weights are suitable for that locations, by given semantic label map.

Depthwise separable convolution
In order to avoid excessive computational costs and GPU memory usage, the author uses “depthwise separable convolution” to predict weights for each layer.

Conditional attention operation

  • role: gate the information flow passed to the next layer.
  • computation : the element-wise product, the same shape with the last output.

Conditional Weight Predition

The generator uses different kernels to generate images, but how to get the predicted weights?

Model input

  • Label map

Predicted weights
Semantic label maps have some hidden information, such as locations between different stuff. A small receptive field restricts the weight prediction from incorporating long-range context information. So the author introduce “A feature pyramid structure”.

A feature pyramid structure

  • process : The features at different levels of the feature pyramid are concatenated with the original semantic map to obtain the global-context-aware semantic feature maps.
  • role: predict weights which are aware of not only local neighborhood, but also long-range context and relative locations.

Feature Pyramid Semantics Embedding Discriminator

Drawback of PathGAN

  • Just discriminate whether an image is fake or true, not to keep semantic alignment.

Model input

  • an image (GT or generated image)
  • the label map is used as an additional clues to compute a matching score.

Semantic Embedding for Discriminator

PatchGAN
for each feature map, it classify if each patch is real or not by predicting a score for each spatial location.

Projection discriminator
computes the dot product between the class label and image feature vector as part of the output discriminator score.

Ours

  • role: force the discriminator to classify not only real or fake images, but also whether the patch features match with the semantic labels in that patch within a joint embedding space.
  • info details: 1)inspired by “Projection discriminator”, calculate the inner product between each spatial location of FiF_i and SiS_i, to obtain a semantic matching score map; 2) The semantic matching score is added with the conventional real/fake score as the final discriminator score.

References

  1. Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.

Notes

I found that some papers you were hardly understand what the content means yesterday, would become easier to read tomorrow…