mirror of https://github.com/vladmandic/automatic
411 lines
24 KiB
Markdown
411 lines
24 KiB
Markdown
# Taming Transformers for High-Resolution Image Synthesis
|
|
##### CVPR 2021 (Oral)
|
|

|
|
|
|
[**Taming Transformers for High-Resolution Image Synthesis**](https://compvis.github.io/taming-transformers/)<br/>
|
|
[Patrick Esser](https://github.com/pesser)\*,
|
|
[Robin Rombach](https://github.com/rromb)\*,
|
|
[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
|
|
\* equal contribution
|
|
|
|
**tl;dr** We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.
|
|
|
|

|
|
[arXiv](https://arxiv.org/abs/2012.09841) | [BibTeX](#bibtex) | [Project Page](https://compvis.github.io/taming-transformers/)
|
|
|
|
|
|
### News
|
|
#### 2022
|
|
- More pretrained VQGANs (e.g. a f8-model with only 256 codebook entries) are available in our new work on [Latent Diffusion Models](https://github.com/CompVis/latent-diffusion).
|
|
- Added scene synthesis models as proposed in the paper [High-Resolution Complex Scene Synthesis with Transformers](https://arxiv.org/abs/2105.06458), see [this section](#scene-image-synthesis).
|
|
#### 2021
|
|
- Thanks to [rom1504](https://github.com/rom1504) it is now easy to [train a VQGAN on your own datasets](#training-on-custom-data).
|
|
- Included a bugfix for the quantizer. For backward compatibility it is
|
|
disabled by default (which corresponds to always training with `beta=1.0`).
|
|
Use `legacy=False` in the quantizer config to enable it.
|
|
Thanks [richcmwang](https://github.com/richcmwang) and [wcshin-git](https://github.com/wcshin-git)!
|
|
- Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog.
|
|
- Added a pretrained, [1.4B transformer model](https://k00.fr/s511rwcv) trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN.
|
|
- Added pretrained, unconditional models on [FFHQ](https://k00.fr/yndvfu95) and [CelebA-HQ](https://k00.fr/2xkmielf).
|
|
- Added accelerated sampling via caching of keys/values in the self-attention operation, used in `scripts/sample_fast.py`.
|
|
- Added a checkpoint of a [VQGAN](https://heibox.uni-heidelberg.de/d/2e5662443a6b4307b470/) trained with f8 compression and Gumbel-Quantization.
|
|
See also our updated [reconstruction notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb).
|
|
- We added a [colab notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb) which compares two VQGANs and OpenAI's [DALL-E](https://github.com/openai/DALL-E). See also [this section](#more-resources).
|
|
- We now include an overview of pretrained models in [Tab.1](#overview-of-pretrained-models). We added models for [COCO](#coco) and [ADE20k](#ade20k).
|
|
- The streamlit demo now supports image completions.
|
|
- We now include a couple of examples from the D-RIN dataset so you can run the
|
|
[D-RIN demo](#d-rin) without preparing the dataset first.
|
|
- You can now jump right into sampling with our [Colab quickstart notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb).
|
|
|
|
## Requirements
|
|
A suitable [conda](https://conda.io/) environment named `taming` can be created
|
|
and activated with:
|
|
|
|
```
|
|
conda env create -f environment.yaml
|
|
conda activate taming
|
|
```
|
|
## Overview of pretrained models
|
|
The following table provides an overview of all models that are currently available.
|
|
FID scores were evaluated using [torch-fidelity](https://github.com/toshas/torch-fidelity).
|
|
For reference, we also include a link to the recently released autoencoder of the [DALL-E](https://github.com/openai/DALL-E) model.
|
|
See the corresponding [colab
|
|
notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb)
|
|
for a comparison and discussion of reconstruction capabilities.
|
|
|
|
| Dataset | FID vs train | FID vs val | Link | Samples (256x256) | Comments
|
|
| ------------- | ------------- | ------------- |------------- | ------------- |------------- |
|
|
| FFHQ (f=16) | 9.6 | -- | [ffhq_transformer](https://k00.fr/yndvfu95) | [ffhq_samples](https://k00.fr/j626x093) |
|
|
| CelebA-HQ (f=16) | 10.2 | -- | [celebahq_transformer](https://k00.fr/2xkmielf) | [celebahq_samples](https://k00.fr/j626x093) |
|
|
| ADE20K (f=16) | -- | 35.5 | [ade20k_transformer](https://k00.fr/ot46cksa) | [ade20k_samples.zip](https://heibox.uni-heidelberg.de/f/70bb78cbaf844501b8fb/) [2k] | evaluated on val split (2k images)
|
|
| COCO-Stuff (f=16) | -- | 20.4 | [coco_transformer](https://k00.fr/2zz6i2ce) | [coco_samples.zip](https://heibox.uni-heidelberg.de/f/a395a9be612f4a7a8054/) [5k] | evaluated on val split (5k images)
|
|
| ImageNet (cIN) (f=16) | 15.98/15.78/6.59/5.88/5.20 | -- | [cin_transformer](https://k00.fr/s511rwcv) | [cin_samples](https://k00.fr/j626x093) | different decoding hyperparameters |
|
|
| | | | || |
|
|
| FacesHQ (f=16) | -- | -- | [faceshq_transformer](https://k00.fr/qqfl2do8)
|
|
| S-FLCKR (f=16) | -- | -- | [sflckr](https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/)
|
|
| D-RIN (f=16) | -- | -- | [drin_transformer](https://k00.fr/39jcugc5)
|
|
| | | | | || |
|
|
| VQGAN ImageNet (f=16), 1024 | 10.54 | 7.94 | [vqgan_imagenet_f16_1024](https://heibox.uni-heidelberg.de/d/8088892a516d4e3baf92/) | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs.
|
|
| VQGAN ImageNet (f=16), 16384 | 7.41 | 4.98 |[vqgan_imagenet_f16_16384](https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/) | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs.
|
|
| VQGAN OpenImages (f=8), 256 | -- | 1.49 |https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip | --- | Reconstruction-FIDs. Available via [latent diffusion](https://github.com/CompVis/latent-diffusion).
|
|
| VQGAN OpenImages (f=8), 16384 | -- | 1.14 |https://ommer-lab.com/files/latent-diffusion/vq-f8.zip | --- | Reconstruction-FIDs. Available via [latent diffusion](https://github.com/CompVis/latent-diffusion)
|
|
| VQGAN OpenImages (f=8), 8192, GumbelQuantization | 3.24 | 1.49 |[vqgan_gumbel_f8](https://heibox.uni-heidelberg.de/d/2e5662443a6b4307b470/) | --- | Reconstruction-FIDs.
|
|
| | | | | || |
|
|
| DALL-E dVAE (f=8), 8192, GumbelQuantization | 33.88 | 32.01 | https://github.com/openai/DALL-E | [reconstructions](https://k00.fr/j626x093) | Reconstruction-FIDs.
|
|
|
|
|
|
## Running pretrained models
|
|
|
|
The commands below will start a streamlit demo which supports sampling at
|
|
different resolutions and image completions. To run a non-interactive version
|
|
of the sampling process, replace `streamlit run scripts/sample_conditional.py --`
|
|
by `python scripts/make_samples.py --outdir <path_to_write_samples_to>` and
|
|
keep the remaining command line arguments.
|
|
|
|
To sample from unconditional or class-conditional models,
|
|
run `python scripts/sample_fast.py -r <path/to/config_and_checkpoint>`.
|
|
We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models,
|
|
respectively.
|
|
|
|
### S-FLCKR
|
|

|
|
|
|
You can also [run this model in a Colab
|
|
notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb),
|
|
which includes all necessary steps to start sampling.
|
|
|
|
Download the
|
|
[2020-11-09T13-31-51_sflckr](https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/)
|
|
folder and place it into `logs`. Then, run
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/
|
|
```
|
|
|
|
### ImageNet
|
|

|
|
|
|
Download the [2021-04-03T19-39-50_cin_transformer](https://k00.fr/s511rwcv)
|
|
folder and place it into logs. Sampling from the class-conditional ImageNet
|
|
model does not require any data preparation. To produce 50 samples for each of
|
|
the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus
|
|
sampling and temperature t=1.0, run
|
|
|
|
```
|
|
python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25
|
|
```
|
|
|
|
To restrict the model to certain classes, provide them via the `--classes` argument, separated by
|
|
commas. For example, to sample 50 *ostriches*, *border collies* and *whiskey jugs*, run
|
|
|
|
```
|
|
python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901
|
|
```
|
|
We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.
|
|
|
|
### FFHQ/CelebA-HQ
|
|
|
|
Download the [2021-04-23T18-19-01_ffhq_transformer](https://k00.fr/yndvfu95) and
|
|
[2021-04-23T18-11-19_celebahq_transformer](https://k00.fr/2xkmielf)
|
|
folders and place them into logs.
|
|
Again, sampling from these unconditional models does not require any data preparation.
|
|
To produce 50000 samples, with k=250 for top-k sampling,
|
|
p=1.0 for nucleus sampling and temperature t=1.0, run
|
|
|
|
```
|
|
python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/
|
|
```
|
|
for FFHQ and
|
|
|
|
```
|
|
python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/
|
|
```
|
|
to sample from the CelebA-HQ model.
|
|
For both models it can be advantageous to vary the top-k/top-p parameters for sampling.
|
|
|
|
### FacesHQ
|
|

|
|
|
|
Download [2020-11-13T21-41-45_faceshq_transformer](https://k00.fr/qqfl2do8) and
|
|
place it into `logs`. Follow the data preparation steps for
|
|
[CelebA-HQ](#celeba-hq) and [FFHQ](#ffhq). Run
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/
|
|
```
|
|
|
|
### D-RIN
|
|

|
|
|
|
Download [2020-11-20T12-54-32_drin_transformer](https://k00.fr/39jcugc5) and
|
|
place it into `logs`. To run the demo on a couple of example depth maps
|
|
included in the repository, run
|
|
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"
|
|
```
|
|
|
|
To run the demo on the complete validation set, first follow the data preparation steps for
|
|
[ImageNet](#imagenet) and then run
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/
|
|
```
|
|
|
|
### COCO
|
|
Download [2021-01-20T16-04-20_coco_transformer](https://k00.fr/2zz6i2ce) and
|
|
place it into `logs`. To run the demo on a couple of example segmentation maps
|
|
included in the repository, run
|
|
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"
|
|
```
|
|
|
|
### ADE20k
|
|
Download [2020-11-20T21-45-44_ade20k_transformer](https://k00.fr/ot46cksa) and
|
|
place it into `logs`. To run the demo on a couple of example segmentation maps
|
|
included in the repository, run
|
|
|
|
```
|
|
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"
|
|
```
|
|
|
|
## Scene Image Synthesis
|
|

|
|
Scene image generation based on bounding box conditionals as done in our CVPR2021 AI4CC workshop paper [High-Resolution Complex Scene Synthesis with Transformers](https://arxiv.org/abs/2105.06458) (see talk on [workshop page](https://visual.cs.brown.edu/workshops/aicc2021/#awards)). Supporting the datasets COCO and Open Images.
|
|
|
|
### Training
|
|
Download first-stage models [COCO-8k-VQGAN](https://heibox.uni-heidelberg.de/f/78dea9589974474c97c1/) for COCO or [COCO/Open-Images-8k-VQGAN](https://heibox.uni-heidelberg.de/f/461d9a9f4fcf48ab84f4/) for Open Images.
|
|
Change `ckpt_path` in `data/coco_scene_images_transformer.yaml` and `data/open_images_scene_images_transformer.yaml` to point to the downloaded first-stage models.
|
|
Download the full COCO/OI datasets and adapt `data_path` in the same files, unless working with the 100 files provided for training and validation suits your needs already.
|
|
|
|
Code can be run with
|
|
`python main.py --base configs/coco_scene_images_transformer.yaml -t True --gpus 0,`
|
|
or
|
|
`python main.py --base configs/open_images_scene_images_transformer.yaml -t True --gpus 0,`
|
|
|
|
### Sampling
|
|
Train a model as described above or download a pre-trained model:
|
|
- [Open Images 1 billion parameter model](https://drive.google.com/file/d/1FEK-Z7hyWJBvFWQF50pzSK9y1W_CJEig/view?usp=sharing) available that trained 100 epochs. On 256x256 pixels, FID 41.48±0.21, SceneFID 14.60±0.15, Inception Score 18.47±0.27. The model was trained with 2d crops of images and is thus well-prepared for the task of generating high-resolution images, e.g. 512x512.
|
|
- [Open Images distilled version of the above model with 125 million parameters](https://drive.google.com/file/d/1xf89g0mc78J3d8Bx5YhbK4tNRNlOoYaO) allows for sampling on smaller GPUs (4 GB is enough for sampling 256x256 px images). Model was trained for 60 epochs with 10% soft loss, 90% hard loss. On 256x256 pixels, FID 43.07±0.40, SceneFID 15.93±0.19, Inception Score 17.23±0.11.
|
|
- [COCO 30 epochs](https://heibox.uni-heidelberg.de/f/0d0b2594e9074c7e9a33/)
|
|
- [COCO 60 epochs](https://drive.google.com/file/d/1bInd49g2YulTJBjU32Awyt5qnzxxG5U9/) (find model statistics for both COCO versions in `assets/coco_scene_images_training.svg`)
|
|
|
|
When downloading a pre-trained model, remember to change `ckpt_path` in `configs/*project.yaml` to point to your downloaded first-stage model (see ->Training).
|
|
|
|
Scene image generation can be run with
|
|
`python scripts/make_scene_samples.py --outdir=/some/outdir -r /path/to/pretrained/model --resolution=512,512`
|
|
|
|
|
|
## Training on custom data
|
|
|
|
Training on your own dataset can be beneficial to get better tokens and hence better images for your domain.
|
|
Those are the steps to follow to make this work:
|
|
1. install the repo with `conda env create -f environment.yaml`, `conda activate taming` and `pip install -e .`
|
|
1. put your .jpg files in a folder `your_folder`
|
|
2. create 2 text files a `xx_train.txt` and `xx_test.txt` that point to the files in your training and test set respectively (for example `find $(pwd)/your_folder -name "*.jpg" > train.txt`)
|
|
3. adapt `configs/custom_vqgan.yaml` to point to these 2 files
|
|
4. run `python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1` to
|
|
train on two GPUs. Use `--gpus 0,` (with a trailing comma) to train on a single GPU.
|
|
|
|
## Data Preparation
|
|
|
|
### ImageNet
|
|
The code will try to download (through [Academic
|
|
Torrents](http://academictorrents.com/)) and prepare ImageNet the first time it
|
|
is used. However, since ImageNet is quite large, this requires a lot of disk
|
|
space and time. If you already have ImageNet on your disk, you can speed things
|
|
up by putting the data into
|
|
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` (which defaults to
|
|
`~/.cache/autoencoders/data/ILSVRC2012_{split}/data/`), where `{split}` is one
|
|
of `train`/`validation`. It should have the following structure:
|
|
|
|
```
|
|
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
|
|
├── n01440764
|
|
│ ├── n01440764_10026.JPEG
|
|
│ ├── n01440764_10027.JPEG
|
|
│ ├── ...
|
|
├── n01443537
|
|
│ ├── n01443537_10007.JPEG
|
|
│ ├── n01443537_10014.JPEG
|
|
│ ├── ...
|
|
├── ...
|
|
```
|
|
|
|
If you haven't extracted the data, you can also place
|
|
`ILSVRC2012_img_train.tar`/`ILSVRC2012_img_val.tar` (or symlinks to them) into
|
|
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/` /
|
|
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/`, which will then be
|
|
extracted into above structure without downloading it again. Note that this
|
|
will only happen if neither a folder
|
|
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/` nor a file
|
|
`${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready` exist. Remove them
|
|
if you want to force running the dataset preparation again.
|
|
|
|
You will then need to prepare the depth data using
|
|
[MiDaS](https://github.com/intel-isl/MiDaS). Create a symlink
|
|
`data/imagenet_depth` pointing to a folder with two subfolders `train` and
|
|
`val`, each mirroring the structure of the corresponding ImageNet folder
|
|
described above and containing a `png` file for each of ImageNet's `JPEG`
|
|
files. The `png` encodes `float32` depth values obtained from MiDaS as RGBA
|
|
images. We provide the script `scripts/extract_depth.py` to generate this data.
|
|
**Please note** that this script uses [MiDaS via PyTorch
|
|
Hub](https://pytorch.org/hub/intelisl_midas_v2/). When we prepared the data,
|
|
the hub provided the [MiDaS
|
|
v2.0](https://github.com/intel-isl/MiDaS/releases/tag/v2) version, but now it
|
|
provides a v2.1 version. We haven't tested our models with depth maps obtained
|
|
via v2.1 and if you want to make sure that things work as expected, you must
|
|
adjust the script to make sure it explicitly uses
|
|
[v2.0](https://github.com/intel-isl/MiDaS/releases/tag/v2)!
|
|
|
|
### CelebA-HQ
|
|
Create a symlink `data/celebahq` pointing to a folder containing the `.npy`
|
|
files of CelebA-HQ (instructions to obtain them can be found in the [PGGAN
|
|
repository](https://github.com/tkarras/progressive_growing_of_gans)).
|
|
|
|
### FFHQ
|
|
Create a symlink `data/ffhq` pointing to the `images1024x1024` folder obtained
|
|
from the [FFHQ repository](https://github.com/NVlabs/ffhq-dataset).
|
|
|
|
### S-FLCKR
|
|
Unfortunately, we are not allowed to distribute the images we collected for the
|
|
S-FLCKR dataset and can therefore only give a description how it was produced.
|
|
There are many resources on [collecting images from the
|
|
web](https://github.com/adrianmrit/flickrdatasets) to get started.
|
|
We collected sufficiently large images from [flickr](https://www.flickr.com)
|
|
(see `data/flickr_tags.txt` for a full list of tags used to find images)
|
|
and various [subreddits](https://www.reddit.com/r/sfwpornnetwork/wiki/network)
|
|
(see `data/subreddits.txt` for all subreddits that were used).
|
|
Overall, we collected 107625 images, and split them randomly into 96861
|
|
training images and 10764 validation images. We then obtained segmentation
|
|
masks for each image using [DeepLab v2](https://arxiv.org/abs/1606.00915)
|
|
trained on [COCO-Stuff](https://arxiv.org/abs/1612.03716). We used a [PyTorch
|
|
reimplementation](https://github.com/kazuto1011/deeplab-pytorch) and include an
|
|
example script for this process in `scripts/extract_segmentation.py`.
|
|
|
|
### COCO
|
|
Create a symlink `data/coco` containing the images from the 2017 split in
|
|
`train2017` and `val2017`, and their annotations in `annotations`. Files can be
|
|
obtained from the [COCO webpage](https://cocodataset.org/). In addition, we use
|
|
the [Stuff+thing PNG-style annotations on COCO 2017
|
|
trainval](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip)
|
|
annotations from [COCO-Stuff](https://github.com/nightrome/cocostuff), which
|
|
should be placed under `data/cocostuffthings`.
|
|
|
|
### ADE20k
|
|
Create a symlink `data/ade20k_root` containing the contents of
|
|
[ADEChallengeData2016.zip](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip)
|
|
from the [MIT Scene Parsing Benchmark](http://sceneparsing.csail.mit.edu/).
|
|
|
|
## Training models
|
|
|
|
### FacesHQ
|
|
|
|
Train a VQGAN with
|
|
```
|
|
python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,
|
|
```
|
|
|
|
Then, adjust the checkpoint path of the config key
|
|
`model.params.first_stage_config.params.ckpt_path` in
|
|
`configs/faceshq_transformer.yaml` (or download
|
|
[2020-11-09T13-33-36_faceshq_vqgan](https://k00.fr/uxy5usa9) and place into `logs`, which
|
|
corresponds to the preconfigured checkpoint path), then run
|
|
```
|
|
python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,
|
|
```
|
|
|
|
### D-RIN
|
|
|
|
Train a VQGAN on ImageNet with
|
|
```
|
|
python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,
|
|
```
|
|
|
|
or download a pretrained one from [2020-09-23T17-56-33_imagenet_vqgan](https://k00.fr/u0j2dtac)
|
|
and place under `logs`. If you trained your own, adjust the path in the config
|
|
key `model.params.first_stage_config.params.ckpt_path` of
|
|
`configs/drin_transformer.yaml`.
|
|
|
|
Train a VQGAN on Depth Maps of ImageNet with
|
|
```
|
|
python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,
|
|
```
|
|
|
|
or download a pretrained one from [2020-11-03T15-34-24_imagenetdepth_vqgan](https://k00.fr/55rlxs6i)
|
|
and place under `logs`. If you trained your own, adjust the path in the config
|
|
key `model.params.cond_stage_config.params.ckpt_path` of
|
|
`configs/drin_transformer.yaml`.
|
|
|
|
To train the transformer, run
|
|
```
|
|
python main.py --base configs/drin_transformer.yaml -t True --gpus 0,
|
|
```
|
|
|
|
## More Resources
|
|
### Comparing Different First Stage Models
|
|
The reconstruction and compression capabilities of different fist stage models can be analyzed in this [colab notebook](https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/reconstruction_usage.ipynb).
|
|
In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384,
|
|
a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI's [DALL-E](https://github.com/openai/DALL-E) (which has f=8 and 8192
|
|
codebook entries).
|
|

|
|

|
|
|
|
### Other
|
|
- A [video summary](https://www.youtube.com/watch?v=o7dqGcLDf0A&feature=emb_imp_woyt) by [Two Minute Papers](https://www.youtube.com/channel/UCbfYPyITQ-7l4upoX8nvctg).
|
|
- A [video summary](https://www.youtube.com/watch?v=-wDSDtIAyWQ) by [Gradient Dude](https://www.youtube.com/c/GradientDude/about).
|
|
- A [weights and biases report summarizing the paper](https://wandb.ai/ayush-thakur/taming-transformer/reports/-Overview-Taming-Transformers-for-High-Resolution-Image-Synthesis---Vmlldzo0NjEyMTY)
|
|
by [ayulockin](https://github.com/ayulockin).
|
|
- A [video summary](https://www.youtube.com/watch?v=JfUTd8fjtX8&feature=emb_imp_woyt) by [What's AI](https://www.youtube.com/channel/UCUzGQrN-lyyc0BWTYoJM_Sg).
|
|
- Take a look at [ak9250's notebook](https://github.com/ak9250/taming-transformers/blob/master/tamingtransformerscolab.ipynb) if you want to run the streamlit demos on Colab.
|
|
|
|
### Text-to-Image Optimization via CLIP
|
|
VQGAN has been successfully used as an image generator guided by the [CLIP](https://github.com/openai/CLIP) model, both for pure image generation
|
|
from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:
|
|
|
|
- [Advadnouns](https://twitter.com/advadnoun/status/1389316507134357506) Patreon and corresponding LatentVision notebooks: https://www.patreon.com/patronizeme
|
|
- The [notebook]( https://colab.research.google.com/drive/1L8oL-vLJXVcRzCFbPwOoMkPKJ8-aYdPN) of [Rivers Have Wings](https://twitter.com/RiversHaveWings).
|
|
- A [video](https://www.youtube.com/watch?v=90QDe6DQXF4&t=12s) explanation by [Dot CSV](https://www.youtube.com/channel/UCy5znSnfMsDwaLlROnZ7Qbg) (in Spanish, but English subtitles are available)
|
|
|
|

|
|
|
|
Text prompt: *'A bird drawn by a child'*
|
|
|
|
## Shout-outs
|
|
Thanks to everyone who makes their code and models available. In particular,
|
|
|
|
- The architecture of our VQGAN is inspired by [Denoising Diffusion Probabilistic Models](https://github.com/hojonathanho/diffusion)
|
|
- The very hackable transformer implementation [minGPT](https://github.com/karpathy/minGPT)
|
|
- The good ol' [PatchGAN](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) and [Learned Perceptual Similarity (LPIPS)](https://github.com/richzhang/PerceptualSimilarity)
|
|
|
|
## BibTeX
|
|
|
|
```
|
|
@misc{esser2020taming,
|
|
title={Taming Transformers for High-Resolution Image Synthesis},
|
|
author={Patrick Esser and Robin Rombach and Björn Ommer},
|
|
year={2020},
|
|
eprint={2012.09841},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CV}
|
|
}
|
|
```
|