Local AI in Your Terminal: Ollama + Gemma 4 + ShellGPT
Run Google's Gemma 4 locally with Ollama and wire it into your terminal workflow via ShellGPT — from shell commands to image OCR.
Running AI locally means zero API costs, no rate limits, and your data never leaves your machine. With Google's Gemma 4 model, Ollama, and ShellGPT, you can handle everything from generating shell commands to extracting text from images, all from the terminal.
Here's the full setup and the workflows that actually matter.
Why Run AI Locally?
Cloud APIs like OpenAI and Claude are great, but they come with tradeoffs:
| Approach | Cost | Privacy | Speed | Offline |
|---|---|---|---|---|
| Cloud API (OpenAI, Claude) | $5-20/mo | Data sent to server | Depends on network | No |
| Local Ollama + Gemma 4 | Free | 100% local | Fast on Apple Silicon | Yes |
| Hybrid (local + cloud) | Varies | Local for sensitive tasks | Best of both | Partial |
Local AI works best for quick shell tasks, file processing, and anything involving private data. Cloud APIs stay in the picture for complex reasoning and web-connected tasks.
What is Gemma 4?
Gemma 4 is Google DeepMind's latest open-weights model family, released April 2, 2026. It comes in four sizes:
| Model | RAM | Active params | Best for |
|---|---|---|---|
gemma4:e2b |
~3 GB | 2B | Edge devices, fastest |
gemma4:e4b |
~5 GB | 4B | Daily terminal use (default) |
gemma4:26b |
~10 GB | 4B (MoE) | Quality close to 13B, speed of 4B |
gemma4:31b |
~20 GB | 31B | Flagship, needs beefy RAM |
The e4b is the default when you pull gemma4. It runs great on Apple Silicon's unified memory. The 26B MoE variant is the sleeper pick: only 4B parameters activate per token, so you get near-13B quality at 4B speed.
Key upgrades over Gemma 3: built-in thinking mode, native function calling, 128K-256K context window, and image input on all sizes.
Setup: 3 Steps
Step 1: Install Ollama
Ollama manages and serves local models. On macOS:
brew install ollama
Or grab the latest directly:
curl -fsSL https://ollama.com/install.sh | sh
Start the server:
ollama serve
Step 2: Pull Gemma 4
# Default (e4b, ~9.6 GB download)
ollama pull gemma4
# Or pick a specific size
ollama pull gemma4:e2b # smallest
ollama pull gemma4:26b # MoE sweet spot
Verify it's installed:
ollama list
# NAME SIZE
# gemma4:latest 9.6 GB
Step 3: Create a Fast Model Variant
Gemma 4 has thinking mode enabled by default. For daily terminal use, you want a faster variant with thinking suppressed. Create a custom Modelfile:
cat > /tmp/Modelfile.fast << 'EOF'
FROM gemma4
SYSTEM "Do not use thinking mode. Respond directly without chain-of-thought reasoning. Be concise."
EOF
ollama create gemma4-fast -f /tmp/Modelfile.fast
Now you have two models:
gemma4-fastfor quick tasks (shell commands, short answers)gemma4for complex tasks (code review, detailed analysis)
Step 4: Wire Up ShellGPT
ShellGPT (sgpt) is a CLI tool that sends your prompts to an LLM. Point it at your local Ollama server by editing ~/.config/shell_gpt/.sgptrc:
DEFAULT_MODEL=gemma4-fast
OPENAI_API_KEY=dummy
OPENAI_BASE_URL=http://localhost:11434/v1
API_BASE_URL=http://localhost:11434/v1
REQUEST_TIMEOUT=120
ShellGPT now talks to Gemma 4 running locally.
How the Modes Connect
ShellGPT uses different system prompts (invisible instructions) depending on what you ask it to do. Your question is the user prompt. The AI sees both, but you only type the user prompt.
You type something
│
▼
┌────────┐
│ sgpt │ reads .sgptrc (which model? where?)
└───┬────┘
│
├── sgpt "question" ──▶ loads ShellGPT.json as system prompt
│ tells AI: "be concise, no thinking"
│
├── sgpt -s / Ctrl+L ──▶ loads Shell Command Generator.json
│ tells AI: "command only, no words"
│
└── sgpt-full ──────────▶ loads ShellGPT.json BUT uses gemma4
(full model with thinking enabled)
System prompt + your question ──▶ Ollama ──▶ Gemma 4 ──▶ answer
Three config files control this:
1. ~/.config/shell_gpt/roles/ShellGPT.json (default mode)
Tells the AI: you're a programming assistant, keep answers under 100 words, use markdown, don't think step by step.
2. ~/.config/shell_gpt/roles/Shell Command Generator.json (shell mode)
Tells the AI: output only a valid shell command, no explanation, no markdown, don't think.
3. ~/.config/shell_gpt/.sgptrc (connection config)
Tells sgpt: use gemma4-fast model, connect to Ollama at localhost:11434, timeout after 120 seconds.
Quick Reference
| What you want | What you type |
|---|---|
| Turn English into a command | Type it, press Ctrl+L, press Enter |
| Quick question | sgpt "your question" |
| Detailed answer (with thinking) | sgpt-full "your question" |
| Read text from an image | ollama run gemma4-fast "read this" image.png |
| Fast answer via Ollama directly | ollama run gemma4-fast "your question" |
Ctrl+L: AI-Powered Command Line
The most frequent use case. Type what you want in plain English, press Ctrl+L, and it turns your words into a terminal command.
You type: find all python files changed this week
Press: Ctrl+L
It becomes: find . -name "*.py" -mtime -7
Press: Enter to run it
To set this up, add to your ~/.zshrc:
_sgpt_zsh() {
if [[ -n "$BUFFER" ]]; then
_sgpt_prev_cmd=$BUFFER
BUFFER+=" (loading...)"
zle -R
output=$(sgpt --shell --no-interaction <<< "$_sgpt_prev_cmd" 2>/dev/null)
exit_code=$?
if [[ -n "$output" && $exit_code -eq 0 ]]; then
BUFFER="$output"
print -s "# AI: $_sgpt_prev_cmd -> $output"
else
BUFFER="$_sgpt_prev_cmd"
zle -M "sgpt failed, restored original input"
fi
zle end-of-line
fi
}
zle -N _sgpt_zsh
bindkey ^l _sgpt_zsh
This also logs what AI generated to your shell history and restores your original input if sgpt fails.
Image Processing and OCR
Gemma 4 supports image input on all model sizes. Pass an image path directly to Ollama:
# Extract text from a document image
ollama run gemma4-fast "Extract all text from this image" /path/to/document.png
# Describe a screenshot
ollama run gemma4-fast "What does this screenshot show?" ~/Desktop/screenshot.png
# Read a receipt
ollama run gemma4-fast "List all items and prices" receipt.jpg
We tested this on a battery test report cover page. Gemma 4 correctly read the model number (L135F72), chemistry (LiFePO4), serial number (EU7223092221143), and all specs without errors.
Note: Image input only works through ollama run, not through sgpt. ShellGPT doesn't pass image files to the API.
Chat Mode and Code Review
# Multi-turn conversation
sgpt --chat debug "I'm getting a segfault in my C program"
sgpt --chat debug "here's the backtrace: ..."
# Pipe a git diff for review
git diff | sgpt "review this diff for bugs"
# Explain unfamiliar code
cat script.py | sgpt "explain what this code does"
About Gemma 4's Thinking Mode
Gemma 4 has chain-of-thought thinking enabled by default. This means every query goes through an internal reasoning step before answering, which adds 30-60 seconds of latency.
The honest truth: there's no reliable way to fully disable thinking through ShellGPT. Ollama's --think=false flag only works for DeepSeek R1, Qwen 3, and GPT-OSS, not Gemma 4. System prompt instructions like "don't think" help but aren't guaranteed.
What works best (from community testing):
- The
gemma4-fastModelfile approach reduces thinking time throughollama run(~33s vs ~110s) - Adding
<no thought>tags in system prompts has "some resonance" with the model - For guaranteed speed, use
ollama run gemma4-fastdirectly instead of going through sgpt
Best Practices
- Keep Ollama running as a service so it's always ready. First query loads the model (~5 seconds), subsequent queries are instant.
- Use Ctrl+L for commands,
sgptfor questions,ollama runfor images. Each tool has its strength. gemma4-fastfor 90% of tasks. Switch togemma4(full thinking) only when you need deeper reasoning.- Clean up old models to save disk:
ollama rm gemma3:4bfrees gigabytes. - Set
REQUEST_TIMEOUT=120in.sgptrc. Local models can take 30-60 seconds on complex prompts. - Image tasks go through
ollama rundirectly. ShellGPT can't pass image files.
Key Takeaways
- Ollama + Gemma 4 gives you a capable local AI that runs free on Apple Silicon
- ShellGPT bridges the gap between your terminal and the model, with different modes for different tasks
- Three ways to use it: Ctrl+L (commands), sgpt (questions), ollama run (images)
- Create a
gemma4-fastvariant for daily use, keepgemma4for deep thinking tasks - Thinking mode can't be fully disabled through sgpt, but the Modelfile approach helps through
ollama run - Keep cloud APIs for web search and complex reasoning. Local AI handles the rest.