Running an LLM on your own hardware makes sense when your data cannot leave your infrastructure, when your call volume is high enough that API pricing becomes expensive, or when you need deterministic availability without upstream rate limits. None of those reasons has gone away in 2026. What has changed is the quality you can get at each hardware tier.
Three decisions determine which model is right: what hardware you have or are willing to buy, what license restrictions you can live with, and what quality bar the task actually needs. The models below are organized by hardware class, not marketing tier, because the "smallest" model in a headline is often not the one that fits on what you actually own.
Tier 1: Consumer GPU or Mac mini (≤24GB VRAM)
This tier covers a single RTX 4090, any 24GB datacenter card, or an Apple Silicon Mac with 16–24GB unified memory. These machines are available for $500–$2,000. A surprising amount of production-grade work fits here.
Phi-4 Mini (3.8B) | Microsoft | MIT license
At 3.8 billion parameters, Phi-4 Mini fits comfortably in 8GB VRAM at FP16 (or well under 4GB at Q4 quantization). Microsoft trained it specifically for reasoning density at small size; it outperforms several 7B models on coding and math benchmarks while using half the memory. The MIT license means no usage restrictions, no attribution requirements, no user-count limits. For automated pipelines running lightweight extraction, classification, or structured output tasks, this is the lowest-overhead starting point.
Mistral 7B Instruct v0.3 | Mistral AI | Apache 2.0
Seven billion parameters, Apache 2.0 license, roughly 14GB VRAM at FP16 (or under 4GB at Q4_K_M quantization). Mistral 7B remains the most widely deployed self-hosted model in 2026 because the inference tooling is mature: Ollama, vLLM, LM Studio, and llama.cpp all treat it as a first-class target. At Q4, it runs on an 8GB gaming GPU. The gap between Mistral 7B and frontier models is real for long-form reasoning, but for customer support triage, document classification, and simple RAG retrieval it is often good enough at zero marginal cost per call.
Gemma 3 4B | Google | Gemma Terms of Use
Google's Gemma 3 4B sits at about 8GB VRAM in FP16. It performs well on multilingual tasks and holds its own on instruction following for its size class. The Gemma Terms allow commercial use but include restrictions: you cannot use the output to train competing foundation models, and Google reserves the right to modify terms. For internal tooling and products not competing with Google's AI offerings, the terms are workable. For anything sensitive to license changes over time, Apache 2.0 (Mistral, Phi) gives you more stability.
Tier 2: Professional GPU or Mac M4 Max (40–80GB VRAM)
A single A100 80GB, H100 80GB, or a Mac M3/M4 Max with 64–128GB unified memory. Cloud rental runs $2–$3 per hour for an H100.
Llama 4 Scout (109B MoE, 17B active) | Meta | Llama 4 License
Scout uses a Mixture of Experts design: 109B total parameters, but only about 17B are active per token. This means inference speed and memory bandwidth behave more like a 17B model than a 109B one, but VRAM requirements are determined by total weights: at Q4_K_M quantization that is roughly 55–61GB, comfortably within a single H100 80GB. At very aggressive 1.78-bit quantization, it reportedly fits in 24GB, but at quality loss. The Llama 4 license allows commercial use with no royalty up to 700 million monthly active users, which covers virtually every startup and most mid-sized companies. Our breakdown of Mixture of Experts architecture explains why Scout's inference cost per token is lower than its parameter count suggests.
Qwen 2.5 32B | Alibaba | Apache 2.0
Qwen 2.5 32B needs about 64GB at FP16 or roughly 18GB at Q4 quantization, fitting cleanly in an A100 40GB at Q4. Apache 2.0 license, no strings. It performs particularly well on coding and mathematics tasks for its size, and the Qwen series has strong multilingual coverage across European and Asian languages. For teams building on-prem tools that handle mixed-language content, this is worth benchmarking before reaching for the 72B variant.
Tier 3: Multi-GPU server (160GB+ VRAM)
Two or more A100/H100 GPUs, or equivalent enterprise hardware. At this tier, on-prem requires real infrastructure commitment.
Qwen 2.5 72B | Alibaba | Apache 2.0
72B parameters, Apache 2.0, approximately 144GB at FP16 or 40–45GB at Q4 quantization (fitting on two A100 80GBs with room to spare). This is where quality starts to approach frontier models on structured tasks: instruction following, summarization, long-document analysis. At Q4 on a two-GPU setup, throughput is roughly 30–50 tokens per second, fine for async batch workloads, borderline for interactive chat at high concurrency.
DeepSeek V3 | DeepSeek | MIT license
DeepSeek V3 uses a 671B-parameter MoE architecture (the same design family as Scout) with about 37B active parameters per token. The full weights require substantial multi-GPU hardware to run; cloud estimates put it at eight H200s for production inference. DeepSeek's MIT license is the cleanest possible terms: do anything, no restrictions. For teams who need frontier-class quality and can provision the hardware, V3 is the only open-weights model in that quality tier as of mid-2026.
License comparison
| Model | License | Commercial use | No usage cap | Open weights |
|---|---|---|---|---|
| Phi-4 Mini | MIT | Yes | Yes | Yes |
| Mistral 7B | Apache 2.0 | Yes | Yes | Yes |
| Gemma 3 4B | Gemma Terms | Yes (restrictions) | Yes | Yes |
| Llama 4 Scout | Llama 4 License | Yes (< 700M MAU) | Yes | Yes |
| Qwen 2.5 32B / 72B | Apache 2.0 | Yes | Yes | Yes |
| DeepSeek V3 | MIT | Yes | Yes | Yes |
Apache 2.0 and MIT are the most permissive and legally predictable. If your legal team needs to sign off, start with those.
Cost: on-prem vs API
The math depends on utilization. Here is a concrete example:
A team running 500,000 calls per month at 1,000 input tokens and 400 output tokens each:
- Via GPT-4o API: at roughly $2.50/M input + $10/M output = $1,250 + $2,000 = $3,250/month
- Llama 4 Scout on a rented H100 ($2.50/hr): ~730 hours/month = $1,825/month (with no token cost, just time)
- Llama 4 Scout on owned hardware (RTX 5090 ~$2,000 amortized 3 years + $50/month electricity): roughly $100–120/month at this volume
On-prem hardware pays back quickly at high sustained volume. At low volume (under 50,000 calls/month), the API is typically cheaper once you factor in engineering time for setup and maintenance.
The break-even point for a single RTX 4090 running Mistral 7B versus GPT-4o at $2.50/M input tokens is roughly 80,000 input tokens per day of sustained usage. Below that threshold, API wins. Above it, your GPU is cheaper than API billing within a few months.
If you want to benchmark these models on your actual prompts before buying hardware, LLMTest routes to every major provider from a single endpoint. You can test quality and latency without committing to any one model. For a broader view of where API costs accumulate in production, the three hidden LLM costs covers the categories most teams miss before they calculate their break-even.
On-prem is not simpler than API. You own the inference server, the GPU maintenance, the model updates, and the uptime. Start with a clear workload profile and a specific privacy or cost requirement before committing; otherwise the operational overhead rarely pays off.
Sign up for LLMTest to test these models via the API before choosing which to deploy on your own stack.