mirror of https://github.com/bmaltais/kohya_ss
507 lines
12 KiB
Markdown
507 lines
12 KiB
Markdown
# Docker Setup Guide for Kohya_ss
|
|
|
|
This guide provides comprehensive instructions for running Kohya_ss in Docker containers.
|
|
|
|
## Table of Contents
|
|
|
|
- [Prerequisites](#prerequisites)
|
|
- [Quick Start](#quick-start)
|
|
- [Configuration](#configuration)
|
|
- [Usage](#usage)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Advanced Configuration](#advanced-configuration)
|
|
|
|
## Prerequisites
|
|
|
|
### System Requirements
|
|
|
|
- **GPU**: NVIDIA GPU with CUDA support (compute capability 7.0+)
|
|
- **RAM**: Minimum 16GB recommended
|
|
- **Storage**: At least 50GB free space for models and datasets
|
|
- **OS**: Linux, Windows 10/11 with WSL2, or macOS (limited support)
|
|
|
|
### Required Software
|
|
|
|
#### Windows
|
|
|
|
1. **Docker Desktop** (version 4.0+)
|
|
- Download from: <https://www.docker.com/products/docker-desktop/>
|
|
- Ensure WSL2 backend is enabled
|
|
|
|
2. **NVIDIA CUDA Toolkit**
|
|
- Download from: <https://developer.nvidia.com/cuda-downloads>
|
|
- Version 12.8 or compatible
|
|
|
|
3. **NVIDIA Windows Driver**
|
|
- Download from: <https://www.nvidia.com/Download/index.aspx>
|
|
- Version 525.60.11 or newer
|
|
|
|
4. **WSL2 with GPU Support**
|
|
- Enable WSL2: <https://docs.docker.com/desktop/wsl/#turn-on-docker-desktop-wsl-2>
|
|
- Verify GPU support: <https://docs.docker.com/desktop/wsl/use-wsl/#gpu-support>
|
|
|
|
**Official Documentation:**
|
|
- <https://docs.nvidia.com/cuda/wsl-user-guide/index.html#nvidia-compute-software-support-on-wsl-2>
|
|
|
|
#### Linux
|
|
|
|
1. **Docker Engine** or **Docker Desktop**
|
|
- Install guide: <https://docs.docker.com/engine/install/>
|
|
|
|
2. **NVIDIA GPU Driver**
|
|
- Install the latest driver for your GPU
|
|
- Guide: <https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html>
|
|
|
|
3. **NVIDIA Container Toolkit**
|
|
- Required for GPU access in containers
|
|
- Install guide: <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>
|
|
|
|
#### macOS
|
|
|
|
Docker on macOS does not support NVIDIA GPU acceleration. For GPU-accelerated training on Mac:
|
|
- Use cloud-based solutions (see [Cloud Alternatives](#cloud-alternatives))
|
|
- Or install natively using the installation guides in `/docs/Installation/`
|
|
|
|
## Quick Start
|
|
|
|
### Using Pre-built Images (Recommended)
|
|
|
|
This is the fastest way to get started. The images are automatically built and published to GitHub Container Registry.
|
|
|
|
```bash
|
|
# Clone the repository recursively (important!)
|
|
git clone --recursive https://github.com/bmaltais/kohya_ss.git
|
|
cd kohya_ss
|
|
|
|
# Start the services
|
|
docker compose up -d
|
|
|
|
# View logs
|
|
docker compose logs -f
|
|
```
|
|
|
|
**Access the GUI:**
|
|
- Kohya GUI: <http://localhost:7860>
|
|
- TensorBoard: <http://localhost:6006>
|
|
|
|
### Building Locally
|
|
|
|
If you need to modify the Dockerfile or want to build from source:
|
|
|
|
```bash
|
|
# Clone recursively to include submodules
|
|
git clone --recursive https://github.com/bmaltais/kohya_ss.git
|
|
cd kohya_ss
|
|
|
|
# Build and start
|
|
docker compose up -d --build
|
|
```
|
|
|
|
**Note:** Initial build may take 15-30 minutes depending on your internet connection and hardware.
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Create a `.env` file in the root directory to customize settings:
|
|
|
|
```bash
|
|
# .env file example
|
|
TENSORBOARD_PORT=6006
|
|
UID=1000
|
|
```
|
|
|
|
**Available Variables:**
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `TENSORBOARD_PORT` | Port for TensorBoard web interface | `6006` |
|
|
| `UID` | User ID for file permissions | `1000` |
|
|
|
|
### User ID Configuration
|
|
|
|
The `UID` parameter is critical for file permissions. To find your user ID:
|
|
|
|
```bash
|
|
# Linux/macOS/WSL
|
|
id -u
|
|
|
|
# Then set it in docker-compose.yaml or .env
|
|
```
|
|
|
|
If you encounter permission errors, ensure the UID in docker-compose.yaml matches your host user ID.
|
|
|
|
### Volume Mounts
|
|
|
|
The Docker setup uses the following directory structure:
|
|
|
|
```
|
|
kohya_ss/
|
|
├── dataset/ # Your training datasets
|
|
│ ├── images/ # Training images
|
|
│ ├── logs/ # TensorBoard logs
|
|
│ ├── outputs/ # Trained models output
|
|
│ └── regularization/ # Regularization images
|
|
├── models/ # Pre-trained models
|
|
└── .cache/ # Cache directories
|
|
├── config/
|
|
├── user/
|
|
├── triton/
|
|
├── nv/
|
|
└── keras/
|
|
```
|
|
|
|
**Important:** All training data must be placed in the `dataset/` directory or its subdirectories.
|
|
|
|
### Directory Setup
|
|
|
|
Before first use, ensure these directories exist:
|
|
|
|
```bash
|
|
mkdir -p dataset/images dataset/logs dataset/outputs dataset/regularization
|
|
mkdir -p models
|
|
mkdir -p .cache/{config,user,triton,nv,keras}
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Starting the Services
|
|
|
|
```bash
|
|
# Start in detached mode
|
|
docker compose up -d
|
|
|
|
# Start with logs visible
|
|
docker compose up
|
|
|
|
# Start only specific service
|
|
docker compose up -d kohya-ss-gui
|
|
```
|
|
|
|
### Stopping the Services
|
|
|
|
```bash
|
|
# Stop all services
|
|
docker compose down
|
|
|
|
# Stop and remove volumes (warning: deletes data)
|
|
docker compose down -v
|
|
```
|
|
|
|
### Updating
|
|
|
|
To update to the latest version:
|
|
|
|
```bash
|
|
# Pull latest images
|
|
docker compose down
|
|
docker compose pull
|
|
docker compose up -d
|
|
|
|
# Or with auto-pull
|
|
docker compose down && docker compose up -d --pull always
|
|
```
|
|
|
|
If you're building locally:
|
|
|
|
```bash
|
|
# Update code
|
|
git pull
|
|
git submodule update --init --recursive
|
|
|
|
# Rebuild and restart
|
|
docker compose down
|
|
docker compose up -d --build --pull always
|
|
```
|
|
|
|
### Viewing Logs
|
|
|
|
```bash
|
|
# All services
|
|
docker compose logs -f
|
|
|
|
# Specific service
|
|
docker compose logs -f kohya-ss-gui
|
|
|
|
# Last 100 lines
|
|
docker compose logs --tail=100
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### GPU Not Detected
|
|
|
|
**Symptoms:** Training is slow, no GPU utilization in `nvidia-smi`
|
|
|
|
**Solutions:**
|
|
|
|
1. Verify GPU is visible to Docker:
|
|
```bash
|
|
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi
|
|
```
|
|
|
|
2. Check NVIDIA Container Toolkit:
|
|
```bash
|
|
# Linux
|
|
nvidia-ctk --version
|
|
|
|
# If not installed, see prerequisites
|
|
```
|
|
|
|
3. Windows WSL2 users:
|
|
- Ensure Docker Desktop is using WSL2 backend
|
|
- Verify CUDA is working in WSL: `nvidia-smi` in WSL terminal
|
|
|
|
### Permission Denied Errors
|
|
|
|
**Symptoms:** Cannot read/write files in mounted volumes
|
|
|
|
**Solutions:**
|
|
|
|
1. Check your user ID:
|
|
```bash
|
|
id -u
|
|
```
|
|
|
|
2. Update docker-compose.yaml:
|
|
```yaml
|
|
services:
|
|
kohya-ss-gui:
|
|
user: YOUR_UID:0 # Replace YOUR_UID with actual UID
|
|
build:
|
|
args:
|
|
- UID=YOUR_UID # Same here
|
|
```
|
|
|
|
3. Fix ownership of existing files:
|
|
```bash
|
|
sudo chown -R YOUR_UID:YOUR_UID dataset/ models/ .cache/
|
|
```
|
|
|
|
### Out of Memory Errors
|
|
|
|
**Symptoms:** Container crashes, training fails with OOM
|
|
|
|
**Solutions:**
|
|
|
|
1. Add memory limits to docker-compose.yaml:
|
|
```yaml
|
|
services:
|
|
kohya-ss-gui:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
memory: 32G # Adjust based on your system
|
|
```
|
|
|
|
2. Reduce batch size in training parameters
|
|
3. Use gradient checkpointing
|
|
4. Enable CPU offloading in training settings
|
|
|
|
### Container Won't Start
|
|
|
|
**Symptoms:** Container exits immediately or shows errors
|
|
|
|
**Solutions:**
|
|
|
|
1. Check logs:
|
|
```bash
|
|
docker compose logs kohya-ss-gui
|
|
```
|
|
|
|
2. Verify all submodules are cloned:
|
|
```bash
|
|
git submodule update --init --recursive
|
|
```
|
|
|
|
3. Remove old containers and images:
|
|
```bash
|
|
docker compose down
|
|
docker system prune -a
|
|
docker compose up -d --build
|
|
```
|
|
|
|
### File Picker Not Working
|
|
|
|
**Note:** This is a known limitation of the Docker setup.
|
|
|
|
**Workaround:** Manually type the full path instead of using the file picker. Paths should be relative to `/app` or `/dataset`:
|
|
|
|
Examples:
|
|
- Training images: `/dataset/images/my_dataset`
|
|
- Model output: `/dataset/outputs/my_model`
|
|
- Pretrained model: `/app/models/sd_xl_base_1.0.safetensors`
|
|
|
|
### TensorBoard Not Accessible
|
|
|
|
**Symptoms:** Cannot access TensorBoard at localhost:6006
|
|
|
|
**Solutions:**
|
|
|
|
1. Check if container is running:
|
|
```bash
|
|
docker compose ps
|
|
```
|
|
|
|
2. Verify logs are being written:
|
|
```bash
|
|
ls -la dataset/logs/
|
|
```
|
|
|
|
3. Check port conflicts:
|
|
```bash
|
|
# Linux/macOS
|
|
sudo lsof -i :6006
|
|
|
|
# Windows PowerShell
|
|
netstat -ano | findstr :6006
|
|
```
|
|
|
|
4. Change port in .env file if needed:
|
|
```bash
|
|
echo "TENSORBOARD_PORT=6007" > .env
|
|
docker compose down && docker compose up -d
|
|
```
|
|
|
|
## Advanced Configuration
|
|
|
|
### Custom CUDA Version
|
|
|
|
If you need a different CUDA version, modify the Dockerfile:
|
|
|
|
```dockerfile
|
|
# Line 39-40
|
|
ENV CUDA_VERSION=12.8
|
|
ENV NVIDIA_REQUIRE_CUDA=cuda>=12.8
|
|
|
|
# Line 61
|
|
ENV UV_INDEX=https://download.pytorch.org/whl/cu128
|
|
```
|
|
|
|
### Resource Limits
|
|
|
|
Add resource limits to prevent container from consuming all system resources:
|
|
|
|
```yaml
|
|
# docker-compose.yaml
|
|
services:
|
|
kohya-ss-gui:
|
|
deploy:
|
|
resources:
|
|
limits:
|
|
cpus: '8'
|
|
memory: 32G
|
|
reservations:
|
|
cpus: '4'
|
|
memory: 16G
|
|
devices:
|
|
- driver: nvidia
|
|
capabilities: [gpu]
|
|
device_ids: ["0"] # Specific GPU
|
|
```
|
|
|
|
### Multiple GPU Setup
|
|
|
|
To use specific GPUs:
|
|
|
|
```yaml
|
|
# Use GPU 0 and 1
|
|
device_ids: ["0", "1"]
|
|
|
|
# Use all GPUs
|
|
device_ids: ["all"]
|
|
```
|
|
|
|
In the container, you can also use `CUDA_VISIBLE_DEVICES`:
|
|
|
|
```yaml
|
|
environment:
|
|
CUDA_VISIBLE_DEVICES: "0,1"
|
|
```
|
|
|
|
### Restart Policies
|
|
|
|
Add automatic restart on failure:
|
|
|
|
```yaml
|
|
services:
|
|
kohya-ss-gui:
|
|
restart: unless-stopped
|
|
tensorboard:
|
|
restart: unless-stopped
|
|
```
|
|
|
|
### Using Different Base Images
|
|
|
|
For development or debugging, you can switch base images:
|
|
|
|
```dockerfile
|
|
# Use full CUDA toolkit instead of minimal
|
|
FROM docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 AS base
|
|
```
|
|
|
|
## Docker Design Philosophy
|
|
|
|
This Docker setup follows these principles:
|
|
|
|
1. **Disposable Containers**: Containers can be destroyed and recreated at any time. All important data is stored in mounted volumes.
|
|
|
|
2. **Data Separation**: Training data, models, and outputs are kept outside the container in the `dataset/` directory.
|
|
|
|
3. **No Built-in File Picker**: Due to container isolation, the GUI file picker is disabled. Use manual path entry instead.
|
|
|
|
4. **Separate TensorBoard**: TensorBoard runs in its own container for better resource isolation and easier updates.
|
|
|
|
5. **Minimal Image Size**: Only essential CUDA libraries are included to reduce image size from ~8GB to ~3GB.
|
|
|
|
## Cloud Alternatives
|
|
|
|
If Docker on your local machine isn't suitable:
|
|
|
|
- **RunPod**: See [docs/installation_runpod.md](installation_runpod.md)
|
|
- **Novita**: See [docs/installation_novita.md](installation_novita.md)
|
|
- **Colab**: See [README.md](../README.md#-colab) for free cloud-based option
|
|
|
|
## Community Docker Builds
|
|
|
|
Alternative Docker implementations with different features:
|
|
|
|
- **P2Enjoy's Linux-optimized build**: <https://github.com/P2Enjoy/kohya_ss-docker>
|
|
- Fewer limitations on Linux
|
|
- Different architecture
|
|
|
|
- **Ashley Kleynhans' RunPod templates**:
|
|
- Standalone: <https://github.com/ashleykleynhans/kohya-docker>
|
|
- With Auto1111: <https://github.com/ashleykleynhans/stable-diffusion-docker>
|
|
|
|
## Getting Help
|
|
|
|
If you encounter issues:
|
|
|
|
1. Check this troubleshooting guide
|
|
2. Review container logs: `docker compose logs`
|
|
3. Search existing issues: <https://github.com/bmaltais/kohya_ss/issues>
|
|
4. Open a new issue with:
|
|
- Your OS and Docker version
|
|
- Complete error logs
|
|
- Steps to reproduce
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use SSD storage** for dataset and model directories
|
|
2. **Increase Docker memory limit** in Docker Desktop settings (Windows/macOS)
|
|
3. **Use tmpfs for temporary files** (already configured in docker-compose.yaml)
|
|
4. **Enable BuildKit** for faster builds:
|
|
```bash
|
|
export DOCKER_BUILDKIT=1
|
|
```
|
|
5. **Use pillow-simd** (automatically enabled on x86_64 in Dockerfile)
|
|
|
|
## Security Notes
|
|
|
|
1. The container runs as a non-root user (UID 1000 by default)
|
|
2. Only necessary ports are exposed
|
|
3. Sensitive data should not be included in the image build
|
|
4. Use `.dockerignore` to exclude credentials and secrets
|
|
5. Keep base images updated for security patches
|