kohya_ss/docs/docker.md

507 lines
12 KiB
Markdown

# Docker Setup Guide for Kohya_ss
This guide provides comprehensive instructions for running Kohya_ss in Docker containers.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Configuration](#configuration)
- [Usage](#usage)
- [Troubleshooting](#troubleshooting)
- [Advanced Configuration](#advanced-configuration)
## Prerequisites
### System Requirements
- **GPU**: NVIDIA GPU with CUDA support (compute capability 7.0+)
- **RAM**: Minimum 16GB recommended
- **Storage**: At least 50GB free space for models and datasets
- **OS**: Linux, Windows 10/11 with WSL2, or macOS (limited support)
### Required Software
#### Windows
1. **Docker Desktop** (version 4.0+)
- Download from: <https://www.docker.com/products/docker-desktop/>
- Ensure WSL2 backend is enabled
2. **NVIDIA CUDA Toolkit**
- Download from: <https://developer.nvidia.com/cuda-downloads>
- Version 12.8 or compatible
3. **NVIDIA Windows Driver**
- Download from: <https://www.nvidia.com/Download/index.aspx>
- Version 525.60.11 or newer
4. **WSL2 with GPU Support**
- Enable WSL2: <https://docs.docker.com/desktop/wsl/#turn-on-docker-desktop-wsl-2>
- Verify GPU support: <https://docs.docker.com/desktop/wsl/use-wsl/#gpu-support>
**Official Documentation:**
- <https://docs.nvidia.com/cuda/wsl-user-guide/index.html#nvidia-compute-software-support-on-wsl-2>
#### Linux
1. **Docker Engine** or **Docker Desktop**
- Install guide: <https://docs.docker.com/engine/install/>
2. **NVIDIA GPU Driver**
- Install the latest driver for your GPU
- Guide: <https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html>
3. **NVIDIA Container Toolkit**
- Required for GPU access in containers
- Install guide: <https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html>
#### macOS
Docker on macOS does not support NVIDIA GPU acceleration. For GPU-accelerated training on Mac:
- Use cloud-based solutions (see [Cloud Alternatives](#cloud-alternatives))
- Or install natively using the installation guides in `/docs/Installation/`
## Quick Start
### Using Pre-built Images (Recommended)
This is the fastest way to get started. The images are automatically built and published to GitHub Container Registry.
```bash
# Clone the repository recursively (important!)
git clone --recursive https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
# Start the services
docker compose up -d
# View logs
docker compose logs -f
```
**Access the GUI:**
- Kohya GUI: <http://localhost:7860>
- TensorBoard: <http://localhost:6006>
### Building Locally
If you need to modify the Dockerfile or want to build from source:
```bash
# Clone recursively to include submodules
git clone --recursive https://github.com/bmaltais/kohya_ss.git
cd kohya_ss
# Build and start
docker compose up -d --build
```
**Note:** Initial build may take 15-30 minutes depending on your internet connection and hardware.
## Configuration
### Environment Variables
Create a `.env` file in the root directory to customize settings:
```bash
# .env file example
TENSORBOARD_PORT=6006
UID=1000
```
**Available Variables:**
| Variable | Description | Default |
|----------|-------------|---------|
| `TENSORBOARD_PORT` | Port for TensorBoard web interface | `6006` |
| `UID` | User ID for file permissions | `1000` |
### User ID Configuration
The `UID` parameter is critical for file permissions. To find your user ID:
```bash
# Linux/macOS/WSL
id -u
# Then set it in docker-compose.yaml or .env
```
If you encounter permission errors, ensure the UID in docker-compose.yaml matches your host user ID.
### Volume Mounts
The Docker setup uses the following directory structure:
```
kohya_ss/
├── dataset/ # Your training datasets
│ ├── images/ # Training images
│ ├── logs/ # TensorBoard logs
│ ├── outputs/ # Trained models output
│ └── regularization/ # Regularization images
├── models/ # Pre-trained models
└── .cache/ # Cache directories
├── config/
├── user/
├── triton/
├── nv/
└── keras/
```
**Important:** All training data must be placed in the `dataset/` directory or its subdirectories.
### Directory Setup
Before first use, ensure these directories exist:
```bash
mkdir -p dataset/images dataset/logs dataset/outputs dataset/regularization
mkdir -p models
mkdir -p .cache/{config,user,triton,nv,keras}
```
## Usage
### Starting the Services
```bash
# Start in detached mode
docker compose up -d
# Start with logs visible
docker compose up
# Start only specific service
docker compose up -d kohya-ss-gui
```
### Stopping the Services
```bash
# Stop all services
docker compose down
# Stop and remove volumes (warning: deletes data)
docker compose down -v
```
### Updating
To update to the latest version:
```bash
# Pull latest images
docker compose down
docker compose pull
docker compose up -d
# Or with auto-pull
docker compose down && docker compose up -d --pull always
```
If you're building locally:
```bash
# Update code
git pull
git submodule update --init --recursive
# Rebuild and restart
docker compose down
docker compose up -d --build --pull always
```
### Viewing Logs
```bash
# All services
docker compose logs -f
# Specific service
docker compose logs -f kohya-ss-gui
# Last 100 lines
docker compose logs --tail=100
```
## Troubleshooting
### GPU Not Detected
**Symptoms:** Training is slow, no GPU utilization in `nvidia-smi`
**Solutions:**
1. Verify GPU is visible to Docker:
```bash
docker run --rm --gpus all nvidia/cuda:12.8.0-base-ubuntu22.04 nvidia-smi
```
2. Check NVIDIA Container Toolkit:
```bash
# Linux
nvidia-ctk --version
# If not installed, see prerequisites
```
3. Windows WSL2 users:
- Ensure Docker Desktop is using WSL2 backend
- Verify CUDA is working in WSL: `nvidia-smi` in WSL terminal
### Permission Denied Errors
**Symptoms:** Cannot read/write files in mounted volumes
**Solutions:**
1. Check your user ID:
```bash
id -u
```
2. Update docker-compose.yaml:
```yaml
services:
kohya-ss-gui:
user: YOUR_UID:0 # Replace YOUR_UID with actual UID
build:
args:
- UID=YOUR_UID # Same here
```
3. Fix ownership of existing files:
```bash
sudo chown -R YOUR_UID:YOUR_UID dataset/ models/ .cache/
```
### Out of Memory Errors
**Symptoms:** Container crashes, training fails with OOM
**Solutions:**
1. Add memory limits to docker-compose.yaml:
```yaml
services:
kohya-ss-gui:
deploy:
resources:
limits:
memory: 32G # Adjust based on your system
```
2. Reduce batch size in training parameters
3. Use gradient checkpointing
4. Enable CPU offloading in training settings
### Container Won't Start
**Symptoms:** Container exits immediately or shows errors
**Solutions:**
1. Check logs:
```bash
docker compose logs kohya-ss-gui
```
2. Verify all submodules are cloned:
```bash
git submodule update --init --recursive
```
3. Remove old containers and images:
```bash
docker compose down
docker system prune -a
docker compose up -d --build
```
### File Picker Not Working
**Note:** This is a known limitation of the Docker setup.
**Workaround:** Manually type the full path instead of using the file picker. Paths should be relative to `/app` or `/dataset`:
Examples:
- Training images: `/dataset/images/my_dataset`
- Model output: `/dataset/outputs/my_model`
- Pretrained model: `/app/models/sd_xl_base_1.0.safetensors`
### TensorBoard Not Accessible
**Symptoms:** Cannot access TensorBoard at localhost:6006
**Solutions:**
1. Check if container is running:
```bash
docker compose ps
```
2. Verify logs are being written:
```bash
ls -la dataset/logs/
```
3. Check port conflicts:
```bash
# Linux/macOS
sudo lsof -i :6006
# Windows PowerShell
netstat -ano | findstr :6006
```
4. Change port in .env file if needed:
```bash
echo "TENSORBOARD_PORT=6007" > .env
docker compose down && docker compose up -d
```
## Advanced Configuration
### Custom CUDA Version
If you need a different CUDA version, modify the Dockerfile:
```dockerfile
# Line 39-40
ENV CUDA_VERSION=12.8
ENV NVIDIA_REQUIRE_CUDA=cuda>=12.8
# Line 61
ENV UV_INDEX=https://download.pytorch.org/whl/cu128
```
### Resource Limits
Add resource limits to prevent container from consuming all system resources:
```yaml
# docker-compose.yaml
services:
kohya-ss-gui:
deploy:
resources:
limits:
cpus: '8'
memory: 32G
reservations:
cpus: '4'
memory: 16G
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ["0"] # Specific GPU
```
### Multiple GPU Setup
To use specific GPUs:
```yaml
# Use GPU 0 and 1
device_ids: ["0", "1"]
# Use all GPUs
device_ids: ["all"]
```
In the container, you can also use `CUDA_VISIBLE_DEVICES`:
```yaml
environment:
CUDA_VISIBLE_DEVICES: "0,1"
```
### Restart Policies
Add automatic restart on failure:
```yaml
services:
kohya-ss-gui:
restart: unless-stopped
tensorboard:
restart: unless-stopped
```
### Using Different Base Images
For development or debugging, you can switch base images:
```dockerfile
# Use full CUDA toolkit instead of minimal
FROM docker.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 AS base
```
## Docker Design Philosophy
This Docker setup follows these principles:
1. **Disposable Containers**: Containers can be destroyed and recreated at any time. All important data is stored in mounted volumes.
2. **Data Separation**: Training data, models, and outputs are kept outside the container in the `dataset/` directory.
3. **No Built-in File Picker**: Due to container isolation, the GUI file picker is disabled. Use manual path entry instead.
4. **Separate TensorBoard**: TensorBoard runs in its own container for better resource isolation and easier updates.
5. **Minimal Image Size**: Only essential CUDA libraries are included to reduce image size from ~8GB to ~3GB.
## Cloud Alternatives
If Docker on your local machine isn't suitable:
- **RunPod**: See [docs/installation_runpod.md](installation_runpod.md)
- **Novita**: See [docs/installation_novita.md](installation_novita.md)
- **Colab**: See [README.md](../README.md#-colab) for free cloud-based option
## Community Docker Builds
Alternative Docker implementations with different features:
- **P2Enjoy's Linux-optimized build**: <https://github.com/P2Enjoy/kohya_ss-docker>
- Fewer limitations on Linux
- Different architecture
- **Ashley Kleynhans' RunPod templates**:
- Standalone: <https://github.com/ashleykleynhans/kohya-docker>
- With Auto1111: <https://github.com/ashleykleynhans/stable-diffusion-docker>
## Getting Help
If you encounter issues:
1. Check this troubleshooting guide
2. Review container logs: `docker compose logs`
3. Search existing issues: <https://github.com/bmaltais/kohya_ss/issues>
4. Open a new issue with:
- Your OS and Docker version
- Complete error logs
- Steps to reproduce
## Performance Tips
1. **Use SSD storage** for dataset and model directories
2. **Increase Docker memory limit** in Docker Desktop settings (Windows/macOS)
3. **Use tmpfs for temporary files** (already configured in docker-compose.yaml)
4. **Enable BuildKit** for faster builds:
```bash
export DOCKER_BUILDKIT=1
```
5. **Use pillow-simd** (automatically enabled on x86_64 in Dockerfile)
## Security Notes
1. The container runs as a non-root user (UID 1000 by default)
2. Only necessary ports are exposed
3. Sensitive data should not be included in the image build
4. Use `.dockerignore` to exclude credentials and secrets
5. Keep base images updated for security patches