Run an AI Agent with Ollama (No API Key Needed)

Cloud AI services charge per token, require API keys, and send your data to external servers. If you want a private AI agent that costs nothing to run, Ollama lets you run open-source models directly on your hardware.

This guide sets up PocketPaw with Ollama for a completely self-contained AI agent. No accounts, no API keys, no internet needed for inference.

Hardware Requirements

Ollama runs models on your CPU or GPU. Here’s what you need:

Model SizeRAM NeededSpeedQuality
3B (e.g., phi-3)2-4 GBFastBasic tasks, simple Q&A
7B (e.g., qwen2.5:7b)4-8 GBGoodGeneral purpose, recommended starting point
14B (e.g., qwen2.5:14b)8-16 GBModerateBetter reasoning, code generation
32B+ (e.g., qwen2.5:32b)16-32 GBSlowerNear cloud-quality for most tasks
Info

GPU acceleration makes a huge difference. With a decent NVIDIA or Apple Silicon GPU, even 14B models respond in seconds. CPU-only works but expect longer wait times on larger models.

Setup

Install Ollama

macOS or Linux:

Terminal window
curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com/download

Verify it’s running:

Terminal window
ollama --version

Pull a model

Start with a 7B model for a good balance of speed and quality:

Terminal window
ollama pull qwen2.5:7b

Other good options:

Terminal window
ollama pull llama3.1:8b # Meta's Llama 3.1
ollama pull mistral:7b # Mistral AI's flagship
ollama pull gemma2:9b # Google's Gemma 2
ollama pull deepseek-r1:7b # DeepSeek reasoning model

Install PocketPaw

Terminal window
pip install pocketpaw

Configure for Ollama

Terminal window
export POCKETPAW_LLM_PROVIDER="ollama"
export POCKETPAW_OLLAMA_MODEL="qwen2.5:7b"

That’s all the configuration you need. No API keys.

Start your agent

Terminal window
pocketpaw

Open http://localhost:8888 and start chatting. Everything runs on your machine.

Choosing the Right Model

Different models suit different tasks:

Use CaseRecommended ModelWhy
General assistantqwen2.5:7bGood all-rounder, fast, handles most tasks
Code generationdeepseek-coder-v2:16bPurpose-built for code, understands many languages
Reasoning & mathdeepseek-r1:7bChain-of-thought reasoning, step-by-step problem solving
Creative writingllama3.1:8bStrong at narrative, varied writing styles
Quick responsesphi-3:3.8bSmallest useful model, very fast on modest hardware

You can switch models anytime by changing the environment variable:

Terminal window
export POCKETPAW_OLLAMA_MODEL="deepseek-coder-v2:16b"

Tool Support with Local Models

PocketPaw’s tools (web search, file management, browser) work with Ollama models, but tool-calling quality depends on the model. Larger models handle tools more reliably.

For the best tool-calling experience with local models, use:

  • qwen2.5:14b or larger
  • llama3.1:8b (good tool-calling support)
  • mistral:7b (decent tool support)

Smaller models (3B) may struggle with complex multi-tool tasks.

Mixing Local and Cloud

You can use Ollama for most tasks and a cloud provider for complex ones:

Terminal window
# Default to Ollama
export POCKETPAW_LLM_PROVIDER="ollama"
export POCKETPAW_OLLAMA_MODEL="qwen2.5:7b"
# Optional: add cloud key for the model router to use on hard tasks
export POCKETPAW_ANTHROPIC_API_KEY="sk-ant-..."

PocketPaw’s model router can automatically escalate complex tasks to a cloud model while keeping simple queries local.

Performance Tips

  1. Use GPU acceleration. Ollama auto-detects NVIDIA CUDA and Apple Metal.
  2. Keep the model loaded. First response is slow (loading), subsequent ones are fast.
  3. Match model to RAM. If the model is larger than available RAM, it spills to disk and gets very slow.
  4. Close other heavy apps. ML inference is memory-hungry.

Next Steps