Contents

Ollama: Run AI Models on Your Machine Without the Cloud

Updated March 2026: This article has been revised to reflect the latest versions of Ollama (v0.18.x) and newly available models, including Llama 4, Gemma 3, Phi-4, Qwen 3, and DeepSeek-R1.

What is Ollama?

Imagine being able to talk to an AI like ChatGPT, but without internet, without subscriptions, and without your data ever leaving your machine. Sounds good, right? That’s exactly what Ollama lets you do.

Ollama is an open source tool that lets you download and run language models (LLMs) on your own computer. Simple as that. No third-party APIs, no tokens running out, no worrying about someone reading your prompts. Everything runs locally.

Think of Ollama as the Docker of AI models: you tell it which model you want, it downloads it, and spins it up ready for you to ask questions from your terminal or from any app that connects to its local API.


Why Run an LLM Locally?

You might be thinking: “If ChatGPT, Claude, and Gemini already exist… why bother running something on my machine?” Great question. Here’s why:

  • Total privacy: your data never leaves your computer. If you work with proprietary code, sensitive documents, or client data, this is huge.
  • No recurring costs: you don’t pay for tokens or monthly subscriptions. Once the model is downloaded, it’s yours.
  • No internet required: works offline. On a plane, on the subway, in a village with no signal. Doesn’t matter.
  • Customization: you can fine-tune, create your own base models, or adjust parameters like temperature and context.
  • Free experimentation: you can try 20 different models without your credit card crying.
  • Local integration: Ollama exposes a REST API on localhost:11434, letting you connect your own code, scripts, or apps directly.

The Most Notable Models

Ollama has a huge catalog of models. Here are the ones worth knowing:

ModelParametersWhat it’s good for
Llama 4109B (Scout) / 400B (Maverick)Meta’s newest. Multimodal (text + image), MoE architecture. Scout: up to 10M context, Maverick: up to 1M
Llama 3.370BNear-405B Llama 3.1 performance but much lighter. 128K context
Llama 3.18B / 70B / 405BThe most downloaded model on Ollama (112M+ pulls). Excellent for general conversation, code, and reasoning
DeepSeek-R11.5B - 671BSecond most popular (81M+ pulls). Specialized in deep reasoning and complex tasks
Mistral7B / 12B (Nemo) / 24B (Small)Fast and efficient. Great balance between quality and resources
Gemma 31B / 4B / 12B / 27BBy Google. Includes vision capabilities and up to 128K context
Phi-414BBy Microsoft. Ideal for reasoning tasks with moderate resources
Qwen 30.6B - 235BBy Alibaba. Very good at multiple languages and up to 256K context
CodeLlama7B / 13B / 34B / 70BSpecialized in code generation and explanation
Tip
You can browse the full catalog at ollama.com/library. Each model comes in different sizes (versions with different parameter counts), letting you choose based on your machine’s resources.

Why Are We Using a Small Model?

Let’s be honest here. Running a 70 billion parameter model on your laptop isn’t the same as running it in an NVIDIA datacenter with $40,000 GPUs. Even if you have a solid machine (in my case, a MacBook Pro with M4 Max chip and 36GB of RAM), we need to be realistic about what we can run smoothly.

Small models (7B-8B parameters) are the ideal choice for local use because:

  • Fast responses: they generate text almost instantly on modern hardware.
  • Low memory usage: a 7B model uses ~4-5GB of RAM, leaving room for the rest of your apps.
  • Surprising quality: models like Llama 3.1 8B or Mistral 7B are impressively good for their size. They can write code, explain concepts, translate, summarize, and much more.
  • Quick iteration: being lightweight, you can try different prompts and configurations without waiting forever.

For this tutorial, we’ll use Llama 3.1 with 8B parameters, which is probably the best general-purpose model you can run locally today.


Installing Ollama

macOS

The easiest way is with the official script or Homebrew:

# Option 1: Official script (also works on macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Option 2: With Homebrew
brew install ollama

# Option 3: Direct download from ollama.com/download

That’s it. Ollama installs as an app that runs in the background.

Linux

Just as easy on Linux with the official script:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and configures it as a system service. If you’d rather not use curl | sh (totally valid, it’s good to be cautious), you can manually download the binary from GitHub.

Once installed, start the service:

# Start Ollama
ollama serve

Windows

For Windows, you have three options:

Option 1: Native installer

Download the installer from ollama.com/download and run it. It installs like any other Windows application.

Option 2: PowerShell (one-liner)

Open PowerShell as administrator and run:

irm https://ollama.com/install.ps1 | iex

Option 3: WSL2 (recommended for developers)

If you already use WSL2 (Windows Subsystem for Linux), you can install Ollama inside your Linux distribution exactly as in the previous section:

curl -fsSL https://ollama.com/install.sh | sh
Note
If you have an NVIDIA GPU, Ollama will detect it automatically. For AMD GPUs, support is improving but may require additional configuration.

Downloading and Running Your First Model

Once Ollama is installed, downloading and running a model is a single command:

ollama run llama3.1

The first time, Ollama will download the model (~4.9GB for the 8B version). After that, it loads from your local disk. You’ll see something like:

pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████████████████████▏ 4.9 GB
pulling 948af2743fc7... 100% ▕████████████████████████████████▏ 1.5 KB
pulling 0ba8f0e314b4... 100% ▕████████████████████████████████▏   12 KB
pulling 56bb8bd477a5... 100% ▕████████████████████████████████▏   96 B
pulling 1e0dd4591101... 100% ▕████████████████████████████████▏  485 B
verifying sha256 digest
writing manifest
success
>>>

And you’re inside the interactive chat. That easy.


Essential Commands

Before we start chatting, here are the commands you’ll use all the time:

# See which models you have downloaded
ollama list

# Download a model without running it
ollama pull mistral

# Run a model (downloads it if you don't have it)
ollama run llama3.1

# Delete a model you no longer need
ollama rm mistral

# See which models are currently running
ollama ps

# View model information
ollama show llama3.1

Hands-on Example: Your First Prompt

Let’s get to the good stuff. With the model running, you can do pretty much anything. Here’s a real example to show you the power of a local LLM:

>>> I'm a backend developer and I need a Python function that validates
... whether an email is valid using regex. Give me the code with explanatory
... comments and a usage example.

And Llama 3.1 will reply with something like:

import re

def validate_email(email: str) -> bool:
    """
    Validates if a string has a valid email format.

    Uses a regular expression that checks:
    - Local part: letters, numbers, dots, hyphens, underscores
    - @ symbol
    - Domain: letters, numbers, dots, hyphens
    - TLD: at least 2 characters
    """
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

# Usage examples
test_emails = [
    "[email protected]",           # valid
    "[email protected]",      # valid
    "invalid@",                    # invalid
    "@nouser.com",                 # invalid
    "[email protected]",             # valid
]

for email in test_emails:
    result = "valid" if validate_email(email) else "invalid"
    print(f"{email:35}{result}")

Not bad for something running 100% on your machine, right?


Using Ollama from the Terminal (Without Interactive Chat)

You can also use Ollama directly from the command line without entering interactive mode. Super useful for scripts:

# Direct prompt
ollama run llama3.1 "Explain what a REST API is in 3 lines"

# Combined with pipes
echo "Translate this to Spanish: Hello world, this is a test" | ollama run llama3.1

The Local REST API

This is where things get really interesting. Ollama exposes a REST API on http://localhost:11434 that you can use from any programming language:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "What is Docker in one sentence?",
  "stream": false
}'

Ollama also has a dedicated chat endpoint at /api/chat, which is more natural for conversations with message history:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    { "role": "user", "content": "What is Docker in one sentence?" }
  ],
  "stream": false
}'

This API is compatible with the OpenAI format (at http://localhost:11434/v1/), which means many tools and libraries you already use with GPT-4 work directly with Ollama just by changing the base URL. It also has experimental compatibility with the Anthropic format (at /api/anthropic/), although this integration is more recent and may have limitations with streaming and tool calling.


What’s Next?

You now have Ollama running on your machine with a powerful model ready to answer whatever you need. But this is just the beginning.

In the next post, we’re taking this to the next level: we’re going to build a Python application that uses this LLM locally. We’ll create something practical that you can adapt to your own projects, connecting to the Ollama API to build a real tool — not just a terminal chat.

If you’re interested in AI applied to development, don’t miss it.


Useful Resources