LiteLLM Local LLM Setup

Status: 🟡 In progress — connected but cold-start timeouts persist
Started: 2026-05-11
Goal: Run Claude Code against local qwen3_8b / DeepSeek via LiteLLM proxy

Progress log

2026-05-11

Got /v1/chat/completions working ✅
Got /v1/messages (Anthropic pass-through) working ✅
Claude Code connecting but timing out on first message
Discovered root cause: 5000ms+ cold-start on local inference server

2026-05-12 (morning)

Added all Claude model name aliases to config.yaml ✅
Confirmed streaming works ✅
Cold-start still causing Claude Code retries (attempt 4/10)
DISABLE_INTERLEAVED_THINKING=1 set — reduced noise but timeouts persist

2026-05-12 (evening)

Tested with deepseek-r1-distill-qwen-32B model specifically

settings.json config used:

"ANTHROPIC_AUTH_TOKEN": "sk-i5Qh...",
"ANTHROPIC_BASE_URL": "http://172.18.0.1:4001",
"API_TIMEOUT_MS": "3000000",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-r1-distill-qwen-32B",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-r1-distill-qwen-32B",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-r1-distill-qwen-32B",
"CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-r1-distill-qwen-32B",
"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "TRUE"

curl to LiteLLM works fine; Claude Code still fails with “Retrying in 0s · attempt 1/10”
Even with API_TIMEOUT_MS=3000000, Claude Code shows timeout immediately
Suspicion: Claude Code now requires /v1/responses endpoint (Responses API), not just /v1/messages
LiteLLM v1.66.3+ added Responses API support — need to verify version in use

Current config snapshot

# config.yaml (working)
model_list:
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/qwen3_8b
      api_base: http://192.168.35.9:8000/v1
      api_key: "dummy"
      supports_system_message: false
      timeout: 300
 
general_settings:
  enable_anthropic_pass_through: true

# .env (Claude Code side)
ANTHROPIC_BASE_URL=http://172.18.0.1:4000
ANTHROPIC_AUTH_TOKEN=sk-...
DISABLE_PROMPT_CACHING=1
DISABLE_INTERLEAVED_THINKING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

Open tasks — for this week

Check LiteLLM version: litellm --version — need ≥1.66.3 for Responses API support
Test /v1/responses endpoint directly with curl to confirm LiteLLM supports it
If not supported: pin LiteLLM to a version that has /v1/responses or find workaround
Implement pre-warm script (ping every 20s before claude)
Investigate if inference server has a /preload or keep-alive endpoint
Check docker logs -f litellm while launching claude to trace exact failure point

Reference

→ reference/litellm-claude-code-setup

Week of 2026-W20 (Mon 12 – Sun 17 May)

Debugging continued on May 12. Root cause hypothesis: LiteLLM may not support Claude Code’s /v1/responses endpoint (Responses API), only /v1/messages (Messages API). Claude Code calls Responses API; LiteLLM proxy must handle that endpoint for compatibility. DeepSeek-R1-Distill-Qwen-32B config confirmed: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, API_TIMEOUT_MS=3000000, CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=TRUE all set correctly. curl to LiteLLM works but Claude Code still timeouts with “Retrying in 0s · attempt 1/10”. Keepalive ping strategy (ping every 20s before launching claude) identified as likely cold-start fix.

Status: 🔴 Blocked — /v1/responses endpoint gap in LiteLLM unresolved.

Open tasks (W21):

Investigate LiteLLM /v1/responses endpoint support — check release notes or GitHub issues
Implement 20s keepalive ping before launching claude to avoid cold-start timeouts
Test Tailscale on mobile hotspot to confirm ISP blocking (Thailand ISP blocking 192.200.0.0/24)

2026-05-29

Expanded LiteLLM config to include MiniMax cloud API alongside existing local Ollama model. MiniMax uses minimax/MiniMax-Text-01 model string with MINIMAX_API_KEY env var. Also troubleshooting rm -f /tmp/minimax-quota-monitor.state failing — likely a permissions or path issue, unresolved. Config now supports dual backend: local (Ollama/LM Studio) + cloud (MiniMax) under one proxy. → See reference/litellm-minimax-config

2026-05-30

Explored y-router as an alternative approach to the LiteLLM proxy problem. y-router converts Anthropic API format → OpenAI format, enabling Claude Code to use OpenRouter-hosted models (GLM-4.5, Gemini, etc.) without LiteLLM. Hosted version at https://cc.yovy.app requires zero local setup. This is a simpler path than fixing LiteLLM’s /v1/responses gap — worth testing for the CC-Autonomous use case. → See reference/claude-code-openrouter-y-router

Phuriwaj

Effort — LiteLLM Local LLM Setup