Skip to main content

Cloud GPU & Self-Hosted Models

Don't want to pay per-token API fees but don't have local GPU hardware? You can run open-source models on cloud GPUs and point OpenClaw at them. This guide covers GPU cloud providers, inference-as-a-service, configuration, cost comparisons, and model recommendations.

tip

For running models on your own hardware, see Local Models. This page focuses on cloud-hosted self-hosted inference.

Quick Decision Guide


GPU Cloud Providers

Dedicated GPU Cloud

Rent GPUs by the hour and run your own inference server (vLLM, Ollama, etc.).

ProviderGPUVRAMPrice/hrMonthly (24/7)Notes
Vast.aiRTX 409024 GB~$0.44~$320Auction model, cheapest
RunPodRTX 409024 GB$0.69~$500Per-second billing
Lambda LabsA100 PCIe40 GB$1.25~$900Simple pricing
RunPodA100 80GB80 GB$1.74~$1,250Best for 70B models
CoreWeaveA100 80GB80 GB$2.21~$1,590Enterprise, InfiniBand
RunPodH100 80GB80 GB$2.79~$2,010Fastest inference
AMD Dev CloudMI300X192 GBFreeFree ($100 credits)~50 hours free

GPU Size Guide

GPUVRAMMax Model SizeBest For
RTX 409024 GB14B FP16, 32B Q4Daily driver, coding
A100 40GB40 GB32B FP16, 70B Q4Production serving
A100 80GB80 GB70B FP16, 120B Q4Large models
H100 80GB80 GB70B FP16, 120B Q4Maximum throughput
MI300X192 GB139B FP8Largest models (free credits!)

RunPod Setup

RunPod has a native vLLM deployment template:

  1. Create a RunPod account at runpod.io
  2. Deploy a GPU Pod with the vLLM template
  3. Set the model name (e.g., Qwen/Qwen3-32B-Instruct)
  4. Set Max Model Length to 8192 (or your preferred context window)
  5. Note the endpoint URL once initialized
~/.openclaw/config.yml
brain:
provider: "local"
local:
endpoint: "https://your-runpod-endpoint/v1"
model: "Qwen/Qwen3-32B-Instruct"
type: "openai-compatible"

RunPod Serverless

For bursty usage, RunPod Serverless scales to zero when idle:

TierH100/hrA100/hrBilling
Flex Workers$4.18$2.72Pay only when running
Active Workers$3.35$2.1730% off, always warm

Vast.ai Setup

Vast.ai uses an auction model — typically 20-30% cheaper than RunPod but less reliability:

  1. Browse available GPUs at vast.ai
  2. Select a GPU and specify your Docker image (e.g., vllm/vllm-openai:latest)
  3. Set your start command and model
  4. Note the endpoint URL

AMD Developer Cloud (Free!)

AMD offers $100 free credits (~50 hours on MI300X with 192 GB HBM3 memory):

  1. Sign up at amd.com/developer
  2. Launch an MI300X instance with vLLM pre-installed
  3. Run models up to 139B parameters in FP8
  4. Additional credits available for public project showcases

Inference-as-a-Service

These providers host open-source models and expose OpenAI-compatible APIs. You pay per token instead of per GPU hour — no infrastructure to manage.

Provider70B Model CostSpeedOpenClaw IntegrationBest For
Cerebras$0.10/M tokens2,957 tok/sNative (CEREBRAS_API_KEY)Cheapest per-token
GroqPer-token18x fasterNative (GROQ_API_KEY)Lowest latency
OpenRouterVaries by routeVariesNative (first-class)Auto-cheapest routing
Together AI~$0.88/M tokensFastopenai-completionsFine-tuning support
Fireworks AI~$0.90/M tokensFastopenai-completionsBatch inference (50% off)
ReplicatePer-predictionCold startsopenai-completionsOccasional use
ModalPer GPU cycleScales to zeroCustom endpointBursty workloads

Cerebras Configuration

The cheapest per-token option at $0.10/M tokens with the fastest output speed.

~/.openclaw/env
CEREBRAS_API_KEY=your-key-here

OpenClaw picks it up natively — no additional config needed. Base URL: https://api.cerebras.ai/v1

Groq Configuration

LPU hardware delivers up to 18x faster inference than traditional GPUs.

~/.openclaw/env
GROQ_API_KEY=your-key-here

Native provider — set key and go.

OpenRouter Configuration

Meta-provider that aggregates Together AI, Fireworks, Groq, Cerebras, and more. Automatically routes to the cheapest or fastest provider.

~/.openclaw/env
OPENROUTER_API_KEY=your-key-here

Use models with the format openrouter/<author>/<slug>. First-class support in OpenClaw.

Custom OpenAI-Compatible Provider

For Together AI, Fireworks, or any provider with an OpenAI-compatible API:

~/.openclaw/openclaw.json
{
"models": {
"providers": {
"together-ai": {
"baseUrl": "https://api.together.xyz/v1",
"apiKey": "${TOGETHER_API_KEY}",
"api": "openai-completions",
"models": [
{"id": "meta-llama/Llama-3.1-70B-Instruct", "name": "Llama 3.1 70B"}
]
}
}
}
}
danger

The api field is required. Omitting it causes a crash with "No API provider registered" (Issue #6054). Use "openai-completions" or "openai-responses".


Inference Server Options

If you're renting your own GPU, you need an inference server. All of these expose OpenAI-compatible endpoints.

ServerBest ForThroughputSetup
vLLMProduction, multi-userHighestpip install vllm
OllamaQuick startGood (single-user)Single binary
llama.cppCPU-only, edgeGoodC++ build
TGI (HuggingFace)EnterpriseHighDocker
TabbyAPIConsumer GPUs (EXL2)FastPython
LocalAIMulti-backendVariesDocker
LM StudioDesktop with GUIGoodApp download

The gold standard for cloud GPU inference:

pip install vllm

vllm serve Qwen/Qwen3-32B-Instruct \
--host 0.0.0.0 \
--port 8090 \
--max-model-len 8192
~/.openclaw/config.yml
brain:
provider: "local"
local:
endpoint: "http://your-server:8090/v1"
model: "Qwen/Qwen3-32B-Instruct"
type: "openai-compatible"

vLLM supports both NVIDIA (CUDA) and AMD (ROCm) GPUs.

TabbyAPI (Best for Consumer GPUs)

Uses ExllamaV2/V3 for maximum speed on RTX 3090/4090 with EXL2/GPTQ quantized models:

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI && pip install -e .
python main.py --model Qwen3-32B-EXL2 --port 5000

Exposes both OpenAI-compatible and KoboldAI-compatible APIs.


OpenClaw Configuration Deep Dive

Config File Locations

FilePurposePermissions
~/.openclaw/openclaw.jsonProvider settings, model IDs, agent config644
~/.openclaw/envAPI keys and secrets600
~/.openclaw/config.ymlGateway and brain config644

Full Custom Provider Example

~/.openclaw/openclaw.json
{
"models": {
"providers": {
"my-runpod-vllm": {
"baseUrl": "https://abc123-8090.proxy.runpod.net/v1",
"apiKey": "${RUNPOD_API_KEY}",
"api": "openai-completions",
"models": [
{"id": "qwen3-32b", "name": "Qwen3 32B"},
{"id": "deepseek-r1-distill-32b", "name": "DeepSeek R1 Distill 32B"}
]
}
}
}
}

Key Configuration Rules

RuleDetail
baseUrl must end with /v1Common misconfiguration — OpenClaw won't connect without it
api field is requiredUse "openai-completions" or "openai-responses"
Environment variable substitutionUse ${ENV_VAR} syntax in config files
Model formatUse provider-name/model-id when selecting models
Context windowMust match --max-model-len in your inference server

Provider Priority

When multiple providers are configured, OpenClaw selects in this order:

Anthropic > OpenAI > OpenRouter > Gemini > xAI > Groq > Mistral > Cerebras > ...

To force a specific provider, set the model explicitly using provider/model-id format.

Known Issues

IssueProblemWorkaround
#6054Missing api field crashes gatewayAlways include "api": "openai-completions"
#7211Sub-agents ignore local providersSub-agents may default to cloud APIs
#4857"Unhandled API in mapOptionsForApi"Caused by missing api field

Cost Comparison

Monthly Costs by Approach

ApproachMonthly CostQualityLatency
Anthropic/OpenAI API (light)$5-20BestLow
Cerebras/Groq/OpenRouter$5-30GoodVery low
Anthropic/OpenAI API (heavy)$50-100BestLow
RunPod serverless (bursty)$50-200GoodMedium
Vast.ai RTX 4090 (24/7)~$320GoodLow
RunPod RTX 4090 (24/7)~$500GoodLow
RunPod A100 80GB (24/7)~$1,250GreatLow
Anthropic/OpenAI (agent-heavy)$200-700BestLow

When Self-Hosting Wins

  • You run agents 24/7 and would spend $200+/mo on API keys
  • Privacy is a hard requirement — no data leaves your infrastructure
  • You already have GPU hardware or free credits
  • You want to fine-tune models for your specific tasks

When API Keys Win

  • You need frontier-quality reasoning (Claude Opus, GPT-5 level)
  • Usage is intermittent ($5-50/mo)
  • You value time over money — API keys are zero maintenance
  • Complex multi-step agent tasks that need the best model available

Smart Multi-Model Routing

The most cost-effective approach uses different models for different tasks:

TaskRecommended ModelCost/M TokensWhy
HeartbeatsGemini 2.5 Flash-Lite~$0.50Ultra-cheap coordination
Sub-agentsDeepSeek R1~$2.74Good reasoning, low cost
CodingDeepSeek Coder v2~$2Strong code generation
Complex tasksClaude Opus / GPT-5$15-30Frontier quality

Configure hybrid routing in OpenClaw:

~/.openclaw/config.yml
brain:
provider: "anthropic"
model: "claude-opus-4-6"

heartbeat_override:
provider: "local"
local:
endpoint: "http://localhost:11434"
model: "qwen3:14b"
type: "ollama"

This approach can cut costs by 50%+ while maintaining quality for critical tasks.


By VRAM Budget

VRAMModelUse Case
8 GB (minimum)Llama 3 8B, DeepSeek-R1 7BBasic tasks, simple chat
12-16 GBQwen3 14B, DeepSeek-R1-Distill-Qwen-14BGeneral use, coding
24 GB (RTX 4090)Qwen3 32B (Q4), DeepSeek-R1-Distill-32B (Q4)Daily driver
40-80 GB (A100)Qwen3 32B (FP16), Llama 3.1 70B (Q4)Production quality
192 GB (MI300X)Llama 3.1 70B (FP16), 139B models (FP8)Near cloud-quality

Community Consensus

From community discussions and benchmarks:

  • Qwen3 32B — the "daily driver" for coding, reasoning, and tool use
  • DeepSeek-R1-Distill-32B — best for complex planning and reasoning
  • Qwen 2.5 Coder 32B — when building new skills or code-heavy work
  • Qwen3 14B — recommended by OpenClaw docs for 16 GB setups
warning

Avoid models under 8B parameters for OpenClaw — they hallucinate tool calls and lose context mid-task. OpenClaw's agentic workflows require reliable instruction following.

Quality Reality Check

"The most impressive OpenClaw demos run on Claude Opus via API, not local models. If you want that level of capability locally, budget for serious hardware (48GB+ VRAM) and accept you're still behind the frontier." — GitHub Discussion #5719


Security

If running inference servers on cloud GPUs:

  • Always use a reverse proxy with TLS — never expose raw inference endpoints
  • Use VPN or Cloudflare Zero-Trust tunnel for remote access
  • Set API keys even on self-hosted servers (vLLM supports --api-key)
  • Run openclaw security audit regularly
  • Restrict network access to only the OpenClaw gateway

See Security Hardening for full guidance.

See Also