Hosted vs Self-Hosted LLMs — Real Cost Analysis for Engin...

Last Updated: May 2026

Everyone talks about "running LLMs locally" like it's free. It's not free. And hosted isn't always expensive. Here's the honest cost breakdown across the four realistic options for engineering teams in 2026.

The Four Options We Evaluated

1. Groq — hosted, specialized LPU inference hardware

2. Together AI — hosted, multi-model marketplace

3. Ollama + local GPU — self-hosted, your hardware

4. Replicate + open weights — hosted, but you bring your own model

Cost Model: 10 Engineers, Moderate Usage

Baseline: 10 engineers, each running ~20 AI-assisted tasks/day, average 800 tokens input + 400 tokens output per task.

Monthly throughput: 160,000 tasks × 1,200 tokens = 192M tokens/month

Provider	Model	Cost/Million Tokens	Monthly Cost
Groq	Llama 4 70B	$0.08	$15
Groq	DeepSeek V4	$0.12	$23
Together AI	Llama 4 70B	$0.90	$173
Replicate	Llama 4 70B	$1.10	$211
Ollama local (RTX 4090)	Llama 4 8B	$0 (GPU amortized)	~$120/month electricity
Ollama local (A100 40GB)	Llama 4 70B	$0	~$280/month electricity

Groq's LPU hardware is genuinely in a different cost league for this workload. At 192M tokens/month, you're paying less than $25 for the month.

Where Self-Hosting Actually Wins

Self-hosting wins on two axes: privacy and high-volume batch processing.

If you're running 10M+ tokens per day (300M/month), the math flips. At that scale, GPU amortization spreads across so many tokens that local inference undercuts hosted by 60-80%.

More importantly: data sovereignty. If your LLM workload touches user data, healthcare records, financial data, or anything with GDPR implications, hosted providers may not be an option regardless of cost. Self-hosting on your own VPC is the only path that keeps data in your jurisdiction.

The Real Cost Nobody Talks About

GPU amortization math looks simple until you add:

•Engineering time: Someone needs to maintain the Ollama deployment, handle model updates, manage GPU fleet health. Estimate 0.1–0.3 FTE ongoing for a small team. At $150k/year loaded cost: $15k–$45k/year.
•Downtime risk: Self-hosted means you're on-call for GPU failures. Model restarts take 5–15 minutes. Do you have ops coverage?
•Model freshness: Hosted providers update models automatically. Self-hosting means you manage that lifecycle.

Speed Comparison (Tokens/Second)

Setup	Model	Tokens/Second
Groq	DeepSeek V4	320 tok/s
Groq	Llama 4 70B	280 tok/s
Together AI	Llama 4 70B	85 tok/s
Ollama (RTX 4090)	Llama 4 8B	45 tok/s
Ollama (RTX 4090)	Llama 4 70B	12 tok/s
Ollama (A100 40GB)	Llama 4 70B	38 tok/s

Groq is 8–10x faster than local RTX 4090 for large models. For interactive coding agents where latency matters, this is the difference between an agent that feels responsive and one that feels sluggish.

Recommendation Matrix

Scenario	Recommended
Startup with sensitive data, < 500M tokens/month	Self-hosted Ollama on AWS A100 spot
Growing team, need speed, limited ops capacity	Groq + Together AI mix
Enterprise with compliance requirements	Self-hosted on-prem or private cloud
Prototype / MVPs	Groq (cheapest and fastest to start)
High-volume production (>1B tokens/month)	Self-hosted A100 cluster

The hosted vs self-hosted debate usually isn't really about cost — it's about ops capacity and data constraints. If you have a small team without dedicated infra engineers, the mental overhead of self-hosting almost never pays back in money saved.