Local AI in Your Terminal: Ollama + Gemma 4 + ShellGPT

Running AI locally means zero API costs, no rate limits, and your data never leaves your machine. With Google's Gemma 4 model, Ollama, and ShellGPT, you can handle everything from generating shell commands to extracting text from images, all from the terminal.

Here's the full setup and the workflows that actually matter.

Why Run AI Locally?

Cloud APIs like OpenAI and Claude are great, but they come with tradeoffs:

Approach	Cost	Privacy	Speed	Offline
Cloud API (OpenAI, Claude)	$5-20/mo	Data sent to server	Depends on network	No
Local Ollama + Gemma 4	Free	100% local	Fast on Apple Silicon	Yes
Hybrid (local + cloud)	Varies	Local for sensitive tasks	Best of both	Partial

Local AI works best for quick shell tasks, file processing, and anything involving private data. Cloud APIs stay in the picture for complex reasoning and web-connected tasks.

What is Gemma 4?

Gemma 4 is Google DeepMind's latest open-weights model family, released April 2, 2026. It comes in four sizes:

Model	RAM	Active params	Best for
`gemma4:e2b`	~3 GB	2B	Edge devices, fastest
`gemma4:e4b`	~5 GB	4B	Daily terminal use (default)
`gemma4:26b`	~10 GB	4B (MoE)	Quality close to 13B, speed of 4B
`gemma4:31b`	~20 GB	31B	Flagship, needs beefy RAM

The e4b is the default when you pull gemma4. It runs great on Apple Silicon's unified memory. The 26B MoE variant is the sleeper pick: only 4B parameters activate per token, so you get near-13B quality at 4B speed.

Key upgrades over Gemma 3: built-in thinking mode, native function calling, 128K-256K context window, and image input on all sizes.

Setup: 3 Steps

Step 1: Install Ollama

Ollama manages and serves local models. On macOS:

brew install ollama

Or grab the latest directly:

curl -fsSL https://ollama.com/install.sh | sh

Start the server:

ollama serve

Step 2: Pull Gemma 4

# Default (e4b, ~9.6 GB download)
ollama pull gemma4

# Or pick a specific size
ollama pull gemma4:e2b    # smallest
ollama pull gemma4:26b    # MoE sweet spot

Verify it's installed:

ollama list
# NAME             SIZE
# gemma4:latest    9.6 GB

Step 3: Create a Fast Model Variant

Gemma 4 has thinking mode enabled by default. For daily terminal use, you want a faster variant with thinking suppressed. Create a custom Modelfile:

cat > /tmp/Modelfile.fast << 'EOF'
FROM gemma4
SYSTEM "Do not use thinking mode. Respond directly without chain-of-thought reasoning. Be concise."
EOF

ollama create gemma4-fast -f /tmp/Modelfile.fast

Now you have two models:

gemma4-fast for quick tasks (shell commands, short answers)
gemma4 for complex tasks (code review, detailed analysis)

Step 4: Wire Up ShellGPT

ShellGPT (sgpt) is a CLI tool that sends your prompts to an LLM. Point it at your local Ollama server by editing ~/.config/shell_gpt/.sgptrc:

DEFAULT_MODEL=gemma4-fast
OPENAI_API_KEY=dummy
OPENAI_BASE_URL=http://localhost:11434/v1
API_BASE_URL=http://localhost:11434/v1
REQUEST_TIMEOUT=120

ShellGPT now talks to Gemma 4 running locally.

How the Modes Connect

ShellGPT uses different system prompts (invisible instructions) depending on what you ask it to do. Your question is the user prompt. The AI sees both, but you only type the user prompt.

You type something
       │
       ▼
   ┌────────┐
   │  sgpt  │ reads .sgptrc (which model? where?)
   └───┬────┘
       │
       ├── sgpt "question"  ──▶ loads ShellGPT.json as system prompt
       │                         tells AI: "be concise, no thinking"
       │
       ├── sgpt -s / Ctrl+L ──▶ loads Shell Command Generator.json
       │                         tells AI: "command only, no words"
       │
       └── sgpt-full ──────────▶ loads ShellGPT.json BUT uses gemma4
                                  (full model with thinking enabled)

   System prompt + your question ──▶ Ollama ──▶ Gemma 4 ──▶ answer

Three config files control this:

1. ~/.config/shell_gpt/roles/ShellGPT.json (default mode)

Tells the AI: you're a programming assistant, keep answers under 100 words, use markdown, don't think step by step.

2. ~/.config/shell_gpt/roles/Shell Command Generator.json (shell mode)

Tells the AI: output only a valid shell command, no explanation, no markdown, don't think.

3. ~/.config/shell_gpt/.sgptrc (connection config)

Tells sgpt: use gemma4-fast model, connect to Ollama at localhost:11434, timeout after 120 seconds.

Quick Reference

What you want	What you type
Turn English into a command	Type it, press Ctrl+L, press Enter
Quick question	`sgpt "your question"`
Detailed answer (with thinking)	`sgpt-full "your question"`
Read text from an image	`ollama run gemma4-fast "read this" image.png`
Fast answer via Ollama directly	`ollama run gemma4-fast "your question"`

Ctrl+L: AI-Powered Command Line

The most frequent use case. Type what you want in plain English, press Ctrl+L, and it turns your words into a terminal command.

You type:    find all python files changed this week
Press:       Ctrl+L
It becomes:  find . -name "*.py" -mtime -7
Press:       Enter to run it

To set this up, add to your ~/.zshrc:

_sgpt_zsh() {
  if [[ -n "$BUFFER" ]]; then
    _sgpt_prev_cmd=$BUFFER
    BUFFER+=" (loading...)"
    zle -R
    output=$(sgpt --shell --no-interaction <<< "$_sgpt_prev_cmd" 2>/dev/null)
    exit_code=$?
    if [[ -n "$output" && $exit_code -eq 0 ]]; then
      BUFFER="$output"
      print -s "# AI: $_sgpt_prev_cmd -> $output"
    else
      BUFFER="$_sgpt_prev_cmd"
      zle -M "sgpt failed, restored original input"
    fi
    zle end-of-line
  fi
}
zle -N _sgpt_zsh
bindkey ^l _sgpt_zsh

This also logs what AI generated to your shell history and restores your original input if sgpt fails.

Image Processing and OCR

Gemma 4 supports image input on all model sizes. Pass an image path directly to Ollama:

# Extract text from a document image
ollama run gemma4-fast "Extract all text from this image" /path/to/document.png

# Describe a screenshot
ollama run gemma4-fast "What does this screenshot show?" ~/Desktop/screenshot.png

# Read a receipt
ollama run gemma4-fast "List all items and prices" receipt.jpg

We tested this on a battery test report cover page. Gemma 4 correctly read the model number (L135F72), chemistry (LiFePO4), serial number (EU7223092221143), and all specs without errors.

Note: Image input only works through ollama run, not through sgpt. ShellGPT doesn't pass image files to the API.

Chat Mode and Code Review

# Multi-turn conversation
sgpt --chat debug "I'm getting a segfault in my C program"
sgpt --chat debug "here's the backtrace: ..."

# Pipe a git diff for review
git diff | sgpt "review this diff for bugs"

# Explain unfamiliar code
cat script.py | sgpt "explain what this code does"

About Gemma 4's Thinking Mode

Gemma 4 has chain-of-thought thinking enabled by default. This means every query goes through an internal reasoning step before answering, which adds 30-60 seconds of latency.

The honest truth: there's no reliable way to fully disable thinking through ShellGPT. Ollama's --think=false flag only works for DeepSeek R1, Qwen 3, and GPT-OSS, not Gemma 4. System prompt instructions like "don't think" help but aren't guaranteed.

What works best (from community testing):

The gemma4-fast Modelfile approach reduces thinking time through ollama run (~33s vs ~110s)
Adding <no thought> tags in system prompts has "some resonance" with the model
For guaranteed speed, use ollama run gemma4-fast directly instead of going through sgpt

Best Practices

Keep Ollama running as a service so it's always ready. First query loads the model (~5 seconds), subsequent queries are instant.
Use Ctrl+L for commands, sgpt for questions, ollama run for images. Each tool has its strength.
gemma4-fast for 90% of tasks. Switch to gemma4 (full thinking) only when you need deeper reasoning.
Clean up old models to save disk: ollama rm gemma3:4b frees gigabytes.
Set REQUEST_TIMEOUT=120 in .sgptrc. Local models can take 30-60 seconds on complex prompts.
Image tasks go through ollama run directly. ShellGPT can't pass image files.

Key Takeaways

Ollama + Gemma 4 gives you a capable local AI that runs free on Apple Silicon
ShellGPT bridges the gap between your terminal and the model, with different modes for different tasks
Three ways to use it: Ctrl+L (commands), sgpt (questions), ollama run (images)
Create a gemma4-fast variant for daily use, keep gemma4 for deep thinking tasks
Thinking mode can't be fully disabled through sgpt, but the Modelfile approach helps through ollama run
Keep cloud APIs for web search and complex reasoning. Local AI handles the rest.