Diverse Inpainting and Editing with GAN Inversion (DivInversion)

Bilkent University

*Joint first authors, contributed equally.

ICCV 2023
DivInversion Teaser

Our framework achieves diverse inpainting and editing with GAN inversion. Compared to other works that are proposed to invert and edit images with StyleGAN, our framework has advantages as in the above example.


Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability.

In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts.

We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements.


DivInversion Architecture

First stage framework includes trainable image encoder and mixing network, and frozen StyleGAN's mapping and generator networks. Our encoder takes an erased image and binary mask to embed the image into StyleGAN's latent space.

We also sample a latent code via the mapping network to achieve stochasticity. The mixing network combines the available information of the erased image from the encoder and the missing part from the mapping network.

The mixed encoded latent representations are fed to the Generator via the instance normalization layers to output the fake image. There is a final step where the input image and the fake image are combined based on the mask.


We compare our inpainting results with the state-of-the-art models below and show the diverse inpainting and editing results of our model on the FFHQ dataset. Moreover, inpainting results on the AFHQ dataset are shown in the last row.

FFHQ SOTA Comparison
FFHQ Diverse Inpainting Results
FFHQ Edit Results
AFHQ Inpainting Results

ICCV 2023 Video