Igor's Techno Club

Field Guide: Running Gemma 4 Locally

Local LLMs have quietly crossed an important threshold.

With the release of Gemma 4, you can now run a capable reasoning model locally—fast enough, cheap enough, and simple enough to actually use in production workflows.

Step 0 — Install llama.cpp

llama.cpp now uses CMake, so older make commands will fail.

Mac (fastest path)

brew install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j 8

NVIDIA GPU (CUDA)

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8

Verify install

./build/bin/llama-cli --version

Step 1 — Run Gemma 4 (One Line)

Start with this:

./build/bin/llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja --reasoning off

That’s it.

You are now running a modern reasoning model locally.

Running Other Model Variants

Small Models (E2B / E4B)

E2B

./build/bin/llama-cli -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 16384 --jinja --reasoning off

E4B

./build/bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja --reasoning off

Large Model (31B)

./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_K_M -t 1.0 --top-p 0.95 --top-k 64 -c 32768 --jinja

Model Comparison

This is the only section where you need to think about model choice.

Quick Summary

Model Speed Quality Hardware Use Case
E2B ⚡⚡⚡ Low 8–16GB automation, tagging
E4B ⚡⚡ Medium 16GB OCR, parsing
26B A4B ⚡⚡ High 32–64GB default, coding
31B Very High 64GB+ max quality

Here’s a smoother, more narrative version with minimal bullet points and better flow:

What Changed With Gemma 4

Local models used to force a compromise. You either ran something small and fast that struggled with anything non-trivial, or something large that was technically impressive but too slow and heavy to use day-to-day.

Gemma 4 is one of the first releases where that tradeoff starts to soften.

The change becomes obvious when you try the smaller models. In previous generations, 2B or 4B models were mostly limited to simple classification or basic text tasks. Anything involving structure—like parsing documents or extracting fields—would quickly break down. With Gemma 4, they still have limits, but they fail more gracefully. Instead of collapsing on slightly harder inputs, they remain usable. That shift is subtle but important, because it turns small models into something you can actually plug into pipelines rather than just experiment with.

The bigger shift comes from the 26B A4B model. On paper, it looks like a standard mid-sized model, but its behavior is different because of its Mixture-of-Experts architecture. Only a small portion of the model is active at any given time, which means it runs much closer to a small model in terms of speed while retaining much of the reasoning ability of a larger one. In practice, this removes the usual decision you had to make between “fast” and “capable.” For most local setups, this balance is what makes the model feel immediately useful rather than something you need to optimize around.

What this enables is a shift in how local workflows feel. Before, building a fully local pipeline—processing documents, extracting structure, embedding data, querying it—was possible, but fragile. You would constantly run into edge cases where the model simply wasn’t good enough. With Gemma 4, those workflows start to hold up. Not perfectly, but reliably enough that you can use them without constantly falling back to cloud APIs.

The real change is not a single feature or benchmark improvement. It’s that local models have crossed a threshold where they are no longer just interesting to run—they are practical to build on. Once that happens, the constraint shifts away from model capability and toward what you actually want to create with it.