Phuriwaj

LiteLLM Local LLM Setup

Status: 🟑 In progress β€” connected but cold-start timeouts persist
Started: 2026-05-11
Goal: Run Claude Code against local qwen3_8b / DeepSeek via LiteLLM proxy


Progress log

2026-05-11

  • Got /v1/chat/completions working βœ…
  • Got /v1/messages (Anthropic pass-through) working βœ…
  • Claude Code connecting but timing out on first message
  • Discovered root cause: 5000ms+ cold-start on local inference server

2026-05-12 (morning)

  • Added all Claude model name aliases to config.yaml βœ…
  • Confirmed streaming works βœ…
  • Cold-start still causing Claude Code retries (attempt 4/10)
  • DISABLE_INTERLEAVED_THINKING=1 set β€” reduced noise but timeouts persist

2026-05-12 (evening)

  • Tested with deepseek-r1-distill-qwen-32B model specifically
  • settings.json config used:
    "ANTHROPIC_AUTH_TOKEN": "sk-i5Qh...",
    "ANTHROPIC_BASE_URL": "http://172.18.0.1:4001",
    "API_TIMEOUT_MS": "3000000",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-r1-distill-qwen-32B",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-r1-distill-qwen-32B",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-r1-distill-qwen-32B",
    "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-r1-distill-qwen-32B",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "TRUE"
  • curl to LiteLLM works fine; Claude Code still fails with β€œRetrying in 0s Β· attempt 1/10”
  • Even with API_TIMEOUT_MS=3000000, Claude Code shows timeout immediately
  • Suspicion: Claude Code now requires /v1/responses endpoint (Responses API), not just /v1/messages
  • LiteLLM v1.66.3+ added Responses API support β€” need to verify version in use

Current config snapshot

# config.yaml (working)
model_list:
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/qwen3_8b
      api_base: http://192.168.35.9:8000/v1
      api_key: "dummy"
      supports_system_message: false
      timeout: 300
 
general_settings:
  enable_anthropic_pass_through: true
# .env (Claude Code side)
ANTHROPIC_BASE_URL=http://172.18.0.1:4000
ANTHROPIC_AUTH_TOKEN=sk-...
DISABLE_PROMPT_CACHING=1
DISABLE_INTERLEAVED_THINKING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

Open tasks β€” for this week

  • Check LiteLLM version: litellm --version β€” need β‰₯1.66.3 for Responses API support
  • Test /v1/responses endpoint directly with curl to confirm LiteLLM supports it
  • If not supported: pin LiteLLM to a version that has /v1/responses or find workaround
  • Implement pre-warm script (ping every 20s before claude)
  • Investigate if inference server has a /preload or keep-alive endpoint
  • Check docker logs -f litellm while launching claude to trace exact failure point

Reference

β†’ reference/litellm-claude-code-setup

Week of 2026-W20 (Mon 12 – Sun 17 May)

Debugging continued on May 12. Root cause hypothesis: LiteLLM may not support Claude Code’s /v1/responses endpoint (Responses API), only /v1/messages (Messages API). Claude Code calls Responses API; LiteLLM proxy must handle that endpoint for compatibility. DeepSeek-R1-Distill-Qwen-32B config confirmed: ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, API_TIMEOUT_MS=3000000, CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=TRUE all set correctly. curl to LiteLLM works but Claude Code still timeouts with β€œRetrying in 0s Β· attempt 1/10”. Keepalive ping strategy (ping every 20s before launching claude) identified as likely cold-start fix.

Status: πŸ”΄ Blocked β€” /v1/responses endpoint gap in LiteLLM unresolved.

Open tasks (W21):

  • Investigate LiteLLM /v1/responses endpoint support β€” check release notes or GitHub issues
  • Implement 20s keepalive ping before launching claude to avoid cold-start timeouts
  • Test Tailscale on mobile hotspot to confirm ISP blocking (Thailand ISP blocking 192.200.0.0/24)

2026-05-29

Expanded LiteLLM config to include MiniMax cloud API alongside existing local Ollama model. MiniMax uses minimax/MiniMax-Text-01 model string with MINIMAX_API_KEY env var. Also troubleshooting rm -f /tmp/minimax-quota-monitor.state failing β€” likely a permissions or path issue, unresolved. Config now supports dual backend: local (Ollama/LM Studio) + cloud (MiniMax) under one proxy. β†’ See reference/litellm-minimax-config

2026-05-30

Explored y-router as an alternative approach to the LiteLLM proxy problem. y-router converts Anthropic API format β†’ OpenAI format, enabling Claude Code to use OpenRouter-hosted models (GLM-4.5, Gemini, etc.) without LiteLLM. Hosted version at https://cc.yovy.app requires zero local setup. This is a simpler path than fixing LiteLLM’s /v1/responses gap β€” worth testing for the CC-Autonomous use case. β†’ See reference/claude-code-openrouter-y-router