Field Guide: Running Gemma 4 Locally
Local LLMs have quietly crossed an important threshold.
With the release of Gemma 4, you can now run a capable reasoning model locally—fast enough, cheap enough, and simple enough to actually use in production workflows.
Step 0 — Install llama.cpp
llama.cpp now uses CMake, so older make commands will fail.
Mac (fastest path)
brew install llama.cpp
Mac / Linux (recommended, latest version)
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8
NVIDIA GPU (CUDA)
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8
Verify install
./build/bin/llama-cli --version
Step 1 — Run Gemma 4 (One Line)
Start with this:
./build/bin/llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja --reasoning off
That’s it.
You are now running a modern reasoning model locally.
Running Other Model Variants
Small Models (E2B / E4B)
E2B
./build/bin/llama-cli -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 16384 --jinja --reasoning off
E4B
./build/bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja --reasoning off
Large Model (31B)
./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja
Model Comparison
This is the only section where you need to think about model choice.
Quick Summary
| Model | Speed | Quality | Hardware | Use Case |
|---|---|---|---|---|
| E2B | ⚡⚡⚡ | Low | 8–16GB | automation, tagging |
| E4B | ⚡⚡ | Medium | 16GB | OCR, parsing |
| 26B A4B | ⚡⚡ | High | 32–64GB | default, coding |
| 31B | ⚡ | Very High | 64GB+ | max quality |
Here’s a smoother, more narrative version with minimal bullet points and better flow:
What Changed With Gemma 4
Local models used to force a compromise. You either ran something small and fast that struggled with anything non-trivial, or something large that was technically impressive but too slow and heavy to use day-to-day.
Gemma 4 is one of the first releases where that tradeoff starts to soften.
The change becomes obvious when you try the smaller models. In previous generations, 2B or 4B models were mostly limited to simple classification or basic text tasks. Anything involving structure—like parsing documents or extracting fields—would quickly break down. With Gemma 4, they still have limits, but they fail more gracefully. Instead of collapsing on slightly harder inputs, they remain usable. That shift is subtle but important, because it turns small models into something you can actually plug into pipelines rather than just experiment with.
The bigger shift comes from the 26B A4B model. On paper, it looks like a standard mid-sized model, but its behavior is different because of its Mixture-of-Experts architecture. Only a small portion of the model is active at any given time, which means it runs much closer to a small model in terms of speed while retaining much of the reasoning ability of a larger one. In practice, this removes the usual decision you had to make between “fast” and “capable.” For most local setups, this balance is what makes the model feel immediately useful rather than something you need to optimize around.
What this enables is a shift in how local workflows feel. Before, building a fully local pipeline—processing documents, extracting structure, embedding data, querying it—was possible, but fragile. You would constantly run into edge cases where the model simply wasn’t good enough. With Gemma 4, those workflows start to hold up. Not perfectly, but reliably enough that you can use them without constantly falling back to cloud APIs.
The real change is not a single feature or benchmark improvement. It’s that local models have crossed a threshold where they are no longer just interesting to run—they are practical to build on. Once that happens, the constraint shifts away from model capability and toward what you actually want to create with it.