- Remove _get_device_dtype() indirection, inline device/dtype at call sites
- Remove commented-out fallback blocks and try/finally wrappers
- Add modules/sharpfin to ruff and pylint excludes in pyproject.toml
- Fix import ordering in joytag.py and pixelart.py
Changes based on vladmandic and Disty0 feedback:
- Fix logging: use direct `from installer import log` instead of lazy _get_log()
- Remove unused is_available() function
- Remove defensive getattr() calls in _resolve_kernel/_resolve_linearize
- Simplify _get_device_dtype() to use devices module directly
- Refactor to_pil() with single Image.fromarray() call and explicit mode
- Add cross-platform fallback: sharpfin only runs on CUDA, falls back to
PIL/F.interpolate for other devices (CPU, MPS, OpenVINO)
- Replace lambdas with functools.partial in functional.py for torch.compile safety
- Add modules/sharpfin to pylint ignore-paths (vendored code)
- Remove superfluous SimpleNamespace import in cli/api-caption.py, use Map instead
- Drop _ prefix from internal helper functions in modules/api/caption.py
- Move DeepDanbooru model path to top-level models folder instead of nesting under CLIP
- Rename shadowing import in waifudiffusion batch to avoid F823/E0606
- Fix import order in cli/api-caption.py (stdlib before third-party)
- Rename local variable shadowing function name in cli/api-caption.py
- Remove unnecessary global statement in devices.bypass_sdpa_hijacks
- Add _load_blip_model helper with explicit cache_dir so downloads
go to hfcache_dir instead of default HF cache
- Pre-load BLIP model/processor before creating Interrogator config
to control download location and avoid redundant loads
- Set clip_model_path on config for CLIP model cache location
- Add cache_dir to Moondream model and tokenizer loading
Move all caption/interrogate/tagger/VQA API code out of the monolithic
endpoints.py and models.py into a new self-contained modules/api/caption.py,
following the loras.py / nudenet.py self-registering pattern.
- Move 15 Pydantic models (ReqCaption, ResCaption, ReqVQA, ResVQA,
ReqTagger, ResTagger, dispatch union types, etc.) from models.py
- Move 11 handler functions from endpoints.py
- Deduplicate ~150 lines via shared _do_openclip, _do_tagger, _do_vqa
core functions called by both direct and dispatch endpoints
- Add register_api() that registers all 8 caption routes
- Add promptgen field to ResVLMPrompts (bug fix: handler returned it
but response model silently dropped it)
- Improve all endpoint docstrings and Field descriptions for API docs
- Add use_safetensors=True to all 16 model from_pretrained calls to
avoid downloading redundant .bin files alongside safetensors
- Add device property to JoyTag VisionModel so move_model can relocate
it to CUDA (fixes 'ViT object has no attribute device')
- Fix Pix2Struct dtype mismatch by casting float inputs to model dtype
while preserving integer tensor types
- Patch AutoConfig.register with exist_ok=True during Ovis loading to
handle duplicate aimv2 registration on model reload
- Detect Qwen VL fine-tune architecture from config model_type instead
of repo name, fixing ToriiGate and similar third-party fine-tunes
- Change UI default task from Short Caption to Normal Caption, and
preserve it on model switch instead of resetting to Use Prompt
- Add dual-prefill testing across 5 VQA test methods using a shared
_check_prefill helper
- Fix pre-existing ruff W605 in strip_think_xml_tags docstring
update_caption_params() was setting caption_max_length, chunk_size, and
flavor_intermediate_count on the Interrogator instance, but the library
reads them from self.config. The overrides were silently ignored.
- Add parse_florence_detections() and format_florence_response() to
vqa_detection for handling Florence-2 detection output formats
- Add bypass_sdpa_hijacks() context manager to devices.py for models
incompatible with SageAttention or other SDPA hijacks
- Add OpenCLIP model offload support when caption_offload is enabled
- Remove caption_openclip_min_length from settings, API models, endpoints, and UI
(clip_interrogator library has no min_length support; parameter was never functional)
- Split vlm_prompts_florence into base Florence prompts and PromptGen-only prompts
(GENERATE_TAGS, Analyze, Mixed Caption require MiaoshouAI PromptGen fine-tune)
- Add 'promptgen' category to /vqa/prompts API endpoint
- Fix gaze detection: move DETECT_GAZE check before generic 'detect ' prefix
to prevent "Detect Gaze" matching as detect target="Gaze"
- Update test suite: remove min_length tests, fix min_flavors to use mode='best',
add acceptance-only notes, fix thinking trace detection, improve bracket/OCR tests,
split Florence/PromptGen test coverage
Update cli/test-caption-api.py:
- Update test structure for new caption API endpoints
- Fix Moondream gaze detection test prompt to use 'Detect Gaze'
instead of 'Where is the person looking?' to match handler trigger
- Improve test result categorization and tracking
- Rename cli/api-interrogate.py to cli/api-caption.py
- Update cli/options.py, cli/process.py for new module paths
- Update cli/test-tagger.py for caption module imports
Move all caption-related modules from modules/interrogate/ to modules/caption/
for better naming consistency:
- Rename deepbooru, deepseek, joycaption, joytag, moondream3, openclip, tagger,
vqa, vqa_detection, waifudiffusion modules
- Add new caption.py dispatcher module
- Remove old interrogate.py (functionality moved to caption.py)
- Update cli/api-interrogate.py to use /sdapi/v1/tagger for DeepBooru
- Handle tagger response format (scores dict or tags string)
- Remove DeepBooru test from interrogate endpoint tests
- Update API model descriptions to reference tagger for anime tagging
DeepBooru/DeepDanbooru should only be accessed via the tagger endpoint.
The interrogate endpoint is now exclusively for OpenCLIP/BLIP.
- Remove DeepDanbooru handling from post_interrogate
- Update docstring to reference tagger endpoint for anime tagging
- Simplify code by removing if/else branching
Add model architecture coverage tests:
- VQA model family detection for 19 architectures
- Florence special prompts test (<OD>, <OCR>, <CAPTION>, etc.)
- Moondream detection features test
- VQA architecture capabilities test
- Tagger model types and WD version comparison tests
Improve test validation:
- Add is_meaningful_answer() to reject responses like "."
- Verify parameters have actual effect (not just accepted)
- Show actual output traces in PASS/FAIL messages
- Fix prefill tests to verify keep_prefill behavior
Add configurable timeout:
- Default timeout increased to 300s for slow models
- Add --timeout CLI argument for customization
Other improvements:
- Add JoyCaption to recognized model families
- Reduce BLIP models to avoid reloading large models
- Better detection result validation for annotated images
Add prompt field to VQA endpoint and advanced settings to OpenCLIP endpoint
to achieve full parity between UI and API capabilities.
VLM endpoint changes:
- Add prompt field for custom text input (required for 'Use Prompt' task)
- Pass prompt to vqa.interrogate instead of hardcoded empty string
OpenCLIP endpoint changes:
- Add 7 optional per-request override fields: min_length, max_length,
chunk_size, min_flavors, max_flavors, flavor_count, num_beams
- Add get_clip_setting() helper for override support in openclip.py
- Apply overrides via update_interrogate_params() before interrogation
All new fields are optional with None defaults for backwards compatibility.
Update API model field descriptions to match the hints in locale_en.json
for consistency between UI and API documentation.
Updated models:
- ReqInterrogate: clip_model, blip_model, mode
- ReqVQA: model, question, system
- ReqTagger: model, threshold, character_threshold, max_tags,
include_rating, sort_alpha, use_spaces, escape_brackets,
exclude_tags, show_scores
Comprehensive test script for all Caption API endpoints:
- GET/POST /sdapi/v1/interrogate (OpenCLiP/DeepBooru)
- POST /sdapi/v1/vqa (VLM captioning)
- GET /sdapi/v1/vqa/models, /sdapi/v1/vqa/prompts
- POST /sdapi/v1/tagger
- GET /sdapi/v1/tagger/models
Usage: python cli/test-caption-api.py [--url URL] [--image PATH]
Add comprehensive caption/interrogate API with documentation:
- GET /sdapi/v1/interrogate: List available interrogation models
- POST /sdapi/v1/interrogate: Interrogate with OpenCLIP/BLIP/DeepDanbooru
- POST /sdapi/v1/vqa: Caption with Vision-Language Models (VLM)
- GET /sdapi/v1/vqa: List available VLM models
- POST /sdapi/v1/vqa/batch: Batch caption multiple images
- POST /sdapi/v1/tagger: Tag images with WaifuDiffusion/DeepBooru
Updates:
- Add detailed docstrings with usage examples
- Fix analyze_image response parsing for Gradio update dicts
- Add request/response models for all endpoints
Add comprehensive tooltips to Caption tab UI elements in locale_en.json:
- Add new "llm" section for shared LLM/VLM parameters:
System prompt, Prefill, Top-K, Top-P, Temperature, Num Beams,
Use Samplers, Thinking Mode, Keep Thinking Trace, Keep Prefill
- Add new "caption" section for caption-specific settings:
VLM, OpenCLiP, Tagger tab labels and all their parameters
including thresholds, tag formatting, batch options
- Consolidate accordion labels in ui_caption.py:
"Caption: Advanced Options" and "Caption: Batch" shared across
VLM, OpenCLiP, and Tagger tabs (localized to "Advanced Options"
and "Batch" in UI)
- Remove duplicate entries from missing section