Ollama + Python and Node.js: Build Your Own Local AI Chat

2026-01-12 1263 words 6 minutes ... views

/images/ollama-local-app/ollama-app-header.png

Updated March 2026: This article has been revised with the latest library versions: Ollama Python SDK v0.6.1, Ollama JS SDK v0.6.3, FastAPI v0.115, and Tailwind CSS v4.

Before We Start

In the previous post we installed Ollama and learned how to run an LLM on our machine. If you haven’t read it yet, I recommend starting there — you’ll have everything ready in 10 minutes.

Today we’re taking the next step: building a ChatGPT-like web application that connects to your local LLM. We’ll do it in two flavors:

Python with FastAPI + the official Ollama library
Node.js with Express + the official Ollama library

Both versions use Tailwind CSS v4 for the frontend and real-time streaming (Server-Sent Events) so you can see responses being generated token by token, just like ChatGPT.

Prerequisites

Before we begin, make sure you have installed:

Tool	Minimum Version	Purpose
Ollama	v0.18+	Run the local LLM
Python	3.13+	Python demo
Node.js	22+	Node.js demo
Git	2.x	Clone the repos

And you need to have at least one model downloaded in Ollama:

# If you don't have it yet
ollama run llama3.1

Tip

You can verify Ollama is running by visiting http://localhost:11434 in your browser. You should see “Ollama is running”.

App Architecture

Before diving into the code, let’s see how this works:

The user types a message in the browser.
The backend receives the message and sends it to Ollama via its local API.
Ollama generates the response token by token (streaming).
The backend relays each token to the browser using Server-Sent Events (SSE).
The browser displays each token in real time, like ChatGPT.

Demo 1: Python + FastAPI

Clone the Repository

git clone https://github.com/pescarcena/blog-pescarcena-code.git
cd blog-pescarcena-code/ollama-local-app-01

Project Structure

ollama-local-app-01/
├── app.py                 # FastAPI backend
├── requirements.txt       # Python dependencies
├── Dockerfile             # For running with Docker
├── templates/
│   └── index.html         # Frontend with Tailwind v4
└── static/                # Static files

Install Dependencies and Run

# Create virtual environment
python3 -m venv venv
source venv/bin/activate   # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the app
uvicorn app:app --reload

The app will be available at http://localhost:8000.

The Backend: `app.py`

Let’s look at the key parts of the code. The heart is the /api/chat endpoint:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from ollama import chat
import json

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    model: str = "llama3.1"
    history: list = []

@app.post("/api/chat")
async def chat_endpoint(req: ChatRequest):
    messages = [{"role": m["role"], "content": m["content"]} for m in req.history]
    messages.append({"role": "user", "content": req.message})

    def generate():
        stream = chat(
            model=req.model,
            messages=messages,
            stream=True,
        )
        for chunk in stream:
            content = chunk["message"]["content"]
            yield f"data: {json.dumps({'content': content})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

What’s happening here?

ChatRequest: receives the user’s message, the model to use, and the conversation history.
chat(stream=True): tells Ollama to send the response token by token instead of waiting for it to finish.
StreamingResponse: FastAPI relays each token to the browser as a Server-Sent Event.
history: maintains conversation context so the model “remembers” what was discussed before.

We also have an endpoint to list available models:

@app.get("/api/models")
async def list_models():
    import ollama
    models = ollama.list()
    return {"models": [m.model for m in models.models]}

The Frontend: Tailwind CSS v4

The frontend is a single HTML file that uses Tailwind v4 via CDN. This makes it super lightweight — no build step, no webpack, nothing:

<head>
    <script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
</head>

Note

The Tailwind CDN is perfect for demos and prototypes. For production, you’d use the Tailwind CLI or PostCSS.

The streaming magic on the frontend comes from reading the response as a stream:

const res = await fetch('/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ message, model: modelSelect.value, history })
});

const reader = res.body.getReader();
const decoder = new TextDecoder();

while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const text = decoder.decode(value);
    const lines = text.split('\n');

    for (const line of lines) {
        if (line.startsWith('data: ') && line !== 'data: [DONE]') {
            const data = JSON.parse(line.slice(6));
            fullResponse += data.content;
            aiBubble.textContent = fullResponse;
        }
    }
}

This reads data chunks as they arrive and updates the DOM in real time. The result is a smooth experience where you see words appearing one by one.

Demo 2: Node.js + Express

Clone the Repository

git clone https://github.com/pescarcena/blog-pescarcena-code.git
cd blog-pescarcena-code/ollama-local-app-02

Project Structure

ollama-local-app-02/
├── server.js              # Express backend
├── package.json           # Node.js dependencies
├── Dockerfile             # For running with Docker
└── public/
    └── index.html         # Frontend with Tailwind v4

Install Dependencies and Run

# Install dependencies
npm install

# Run the app
npm start

# Or in dev mode (auto-reload)
npm run dev

The app will be available at http://localhost:3000.

The Backend: `server.js`

The Node.js version is very similar in structure. Here’s the main endpoint:

import express from 'express';
import { Ollama } from 'ollama';

const app = express();
const ollama = new Ollama();

app.post('/api/chat', async (req, res) => {
    const { message, model = 'llama3.1', history = [] } = req.body;

    const messages = history.map(m => ({ role: m.role, content: m.content }));
    messages.push({ role: 'user', content: message });

    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');
    res.setHeader('Connection', 'keep-alive');

    const stream = await ollama.chat({ model, messages, stream: true });

    for await (const chunk of stream) {
        const content = chunk.message.content;
        res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }

    res.write('data: [DONE]\n\n');
    res.end();
});

The ollama library API for Node.js is practically identical to Python’s:

ollama.chat({ model, messages, stream: true }) returns an async iterable.
We use for await...of to iterate over chunks.
Each chunk is sent to the browser as SSE.

The Frontend

The frontend is identical in functionality to Python’s (same HTML/JS). The only visual difference is the theme color: indigo for Python and emerald for Node.js, so you can easily tell them apart.

Running with Docker

Both projects include a Dockerfile so you can run them without installing local dependencies.

Python

cd ollama-local-app-01

# Build the image
docker build -t ollama-chat-python .

# Run the container
docker run -p 8000:8000 --add-host=host.docker.internal:host-gateway ollama-chat-python

Node.js

cd ollama-local-app-02

# Build the image
docker build -t ollama-chat-nodejs .

# Run the container
docker run -p 3000:3000 --add-host=host.docker.internal:host-gateway ollama-chat-nodejs

Important

The --add-host=host.docker.internal:host-gateway flag is required on Linux so the container can access Ollama running on your host. On macOS and Windows with Docker Desktop, host.docker.internal already works automatically.

How Would This Work on Kubernetes?

If you’re thinking “hey, can I put this in a Kubernetes cluster?”, the answer is absolutely yes. In fact, it’s a very interesting use case.

Imagine this architecture:

On Kubernetes you could:

Scale the chat app horizontally with HPA based on demand.
Dedicate GPU nodes to run Ollama and serve the model.
Use an internal Service so the chat app communicates with Ollama without exposing the LLM externally.
Configure resource limits so the model doesn’t eat up all the node’s memory.
Implement health checks to automatically restart if Ollama stops responding.

In an upcoming post, we’ll tackle exactly this: how to deploy an LLM with Ollama on Kubernetes, including GPU configuration, manifests, and best practices for running models in a cluster.

Summary

	Python (FastAPI)	Node.js (Express)
Repository	ollama-local-app-01	ollama-local-app-02
Ollama SDK	Python v0.6.1	JS v0.6.3
Port	8000	3000
Streaming	`StreamingResponse` + SSE	`res.write()` + SSE
Frontend	Tailwind v4 CDN	Tailwind v4 CDN

Both versions do exactly the same thing: they give you a working chat with your local LLM, with real-time streaming and model selector. Choose the one that best fits your stack.

Contents

Ollama + Python and Node.js: Build Your Own Local AI Chat

Before We Start

Prerequisites

App Architecture

Demo 1: Python + FastAPI

Clone the Repository

Project Structure

Install Dependencies and Run

The Backend: `app.py`

The Frontend: Tailwind CSS v4

Demo 2: Node.js + Express

Clone the Repository

Project Structure

Install Dependencies and Run

The Backend: `server.js`

The Frontend

Running with Docker

Python

Node.js

How Would This Work on Kubernetes?

Summary

Resources

Contents

Ollama + Python and Node.js: Build Your Own Local AI Chat

Before We Start

Prerequisites

App Architecture

Demo 1: Python + FastAPI

Clone the Repository

Project Structure

Install Dependencies and Run

The Backend: app.py

The Frontend: Tailwind CSS v4

Demo 2: Node.js + Express

Clone the Repository

Project Structure

Install Dependencies and Run

The Backend: server.js

The Frontend

Running with Docker

Python

Node.js

How Would This Work on Kubernetes?

Summary

Resources

The Backend: `app.py`

The Backend: `server.js`