LiteLLM Local LLM Setup
Status: π‘ In progress β connected but cold-start timeouts persist
Started: 2026-05-11
Goal: Run Claude Code against local qwen3_8b / DeepSeek via LiteLLM proxy
Progress log
2026-05-11
- Got
/v1/chat/completionsworking β - Got
/v1/messages(Anthropic pass-through) working β - Claude Code connecting but timing out on first message
- Discovered root cause: 5000ms+ cold-start on local inference server
2026-05-12 (morning)
- Added all Claude model name aliases to
config.yamlβ - Confirmed streaming works β
- Cold-start still causing Claude Code retries (
attempt 4/10) DISABLE_INTERLEAVED_THINKING=1set β reduced noise but timeouts persist
2026-05-12 (evening)
- Tested with
deepseek-r1-distill-qwen-32Bmodel specifically settings.jsonconfig used:"ANTHROPIC_AUTH_TOKEN": "sk-i5Qh...", "ANTHROPIC_BASE_URL": "http://172.18.0.1:4001", "API_TIMEOUT_MS": "3000000", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-r1-distill-qwen-32B", "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-r1-distill-qwen-32B", "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-r1-distill-qwen-32B", "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-r1-distill-qwen-32B", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "TRUE"- curl to LiteLLM works fine; Claude Code still fails with βRetrying in 0s Β· attempt 1/10β
- Even with
API_TIMEOUT_MS=3000000, Claude Code shows timeout immediately - Suspicion: Claude Code now requires
/v1/responsesendpoint (Responses API), not just/v1/messages - LiteLLM v1.66.3+ added Responses API support β need to verify version in use
Current config snapshot
# config.yaml (working)
model_list:
- model_name: claude-sonnet-4-20250514
litellm_params:
model: openai/qwen3_8b
api_base: http://192.168.35.9:8000/v1
api_key: "dummy"
supports_system_message: false
timeout: 300
general_settings:
enable_anthropic_pass_through: true# .env (Claude Code side)
ANTHROPIC_BASE_URL=http://172.18.0.1:4000
ANTHROPIC_AUTH_TOKEN=sk-...
DISABLE_PROMPT_CACHING=1
DISABLE_INTERLEAVED_THINKING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1Open tasks β for this week
- Check LiteLLM version:
litellm --versionβ need β₯1.66.3 for Responses API support - Test
/v1/responsesendpoint directly with curl to confirm LiteLLM supports it - If not supported: pin LiteLLM to a version that has
/v1/responsesor find workaround - Implement pre-warm script (ping every 20s before
claude) - Investigate if inference server has a
/preloador keep-alive endpoint - Check
docker logs -f litellmwhile launchingclaudeto trace exact failure point
Reference
β reference/litellm-claude-code-setup
Week of 2026-W20 (Mon 12 β Sun 17 May)
Debugging continued on May 12. Root cause hypothesis: LiteLLM may not support Claude Codeβs /v1/responses endpoint (Responses API), only /v1/messages (Messages API). Claude Code calls Responses API; LiteLLM proxy must handle that endpoint for compatibility. DeepSeek-R1-Distill-Qwen-32B config confirmed: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, API_TIMEOUT_MS=3000000, CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=TRUE all set correctly. curl to LiteLLM works but Claude Code still timeouts with βRetrying in 0s Β· attempt 1/10β. Keepalive ping strategy (ping every 20s before launching claude) identified as likely cold-start fix.
Status: π΄ Blocked β /v1/responses endpoint gap in LiteLLM unresolved.
Open tasks (W21):
- Investigate LiteLLM
/v1/responsesendpoint support β check release notes or GitHub issues - Implement 20s keepalive ping before launching
claudeto avoid cold-start timeouts - Test Tailscale on mobile hotspot to confirm ISP blocking (Thailand ISP blocking
192.200.0.0/24)
2026-05-29
Expanded LiteLLM config to include MiniMax cloud API alongside existing local Ollama model. MiniMax uses minimax/MiniMax-Text-01 model string with MINIMAX_API_KEY env var. Also troubleshooting rm -f /tmp/minimax-quota-monitor.state failing β likely a permissions or path issue, unresolved. Config now supports dual backend: local (Ollama/LM Studio) + cloud (MiniMax) under one proxy.
β See reference/litellm-minimax-config
2026-05-30
Explored y-router as an alternative approach to the LiteLLM proxy problem. y-router converts Anthropic API format β OpenAI format, enabling Claude Code to use OpenRouter-hosted models (GLM-4.5, Gemini, etc.) without LiteLLM. Hosted version at https://cc.yovy.app requires zero local setup. This is a simpler path than fixing LiteLLMβs /v1/responses gap β worth testing for the CC-Autonomous use case.
β See reference/claude-code-openrouter-y-router