Ollama + Python and Node.js: Build Your Own Local AI Chat

Updated March 2026: This article has been revised with the latest library versions: Ollama Python SDK v0.6.1, Ollama JS SDK v0.6.3, FastAPI v0.115, and Tailwind CSS v4.
Before We Start
In the previous post we installed Ollama and learned how to run an LLM on our machine. If you haven’t read it yet, I recommend starting there — you’ll have everything ready in 10 minutes.
Today we’re taking the next step: building a ChatGPT-like web application that connects to your local LLM. We’ll do it in two flavors:
- Python with FastAPI + the official Ollama library
- Node.js with Express + the official Ollama library
Both versions use Tailwind CSS v4 for the frontend and real-time streaming (Server-Sent Events) so you can see responses being generated token by token, just like ChatGPT.
Prerequisites
Before we begin, make sure you have installed:
| Tool | Minimum Version | Purpose |
|---|---|---|
| Ollama | v0.18+ | Run the local LLM |
| Python | 3.13+ | Python demo |
| Node.js | 22+ | Node.js demo |
| Git | 2.x | Clone the repos |
And you need to have at least one model downloaded in Ollama:
# If you don't have it yet
ollama run llama3.1http://localhost:11434 in your browser. You should see “Ollama is running”.App Architecture
Before diving into the code, let’s see how this works:
- The user types a message in the browser.
- The backend receives the message and sends it to Ollama via its local API.
- Ollama generates the response token by token (streaming).
- The backend relays each token to the browser using Server-Sent Events (SSE).
- The browser displays each token in real time, like ChatGPT.
Demo 1: Python + FastAPI
Clone the Repository
git clone https://github.com/pescarcena/blog-pescarcena-code.git
cd blog-pescarcena-code/ollama-local-app-01Project Structure
ollama-local-app-01/
├── app.py # FastAPI backend
├── requirements.txt # Python dependencies
├── Dockerfile # For running with Docker
├── templates/
│ └── index.html # Frontend with Tailwind v4
└── static/ # Static filesInstall Dependencies and Run
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the app
uvicorn app:app --reloadThe app will be available at http://localhost:8000.
The Backend: app.py
Let’s look at the key parts of the code. The heart is the /api/chat endpoint:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from ollama import chat
import json
app = FastAPI()
class ChatRequest(BaseModel):
message: str
model: str = "llama3.1"
history: list = []
@app.post("/api/chat")
async def chat_endpoint(req: ChatRequest):
messages = [{"role": m["role"], "content": m["content"]} for m in req.history]
messages.append({"role": "user", "content": req.message})
def generate():
stream = chat(
model=req.model,
messages=messages,
stream=True,
)
for chunk in stream:
content = chunk["message"]["content"]
yield f"data: {json.dumps({'content': content})}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")What’s happening here?
ChatRequest: receives the user’s message, the model to use, and the conversation history.chat(stream=True): tells Ollama to send the response token by token instead of waiting for it to finish.StreamingResponse: FastAPI relays each token to the browser as a Server-Sent Event.history: maintains conversation context so the model “remembers” what was discussed before.
We also have an endpoint to list available models:
@app.get("/api/models")
async def list_models():
import ollama
models = ollama.list()
return {"models": [m.model for m in models.models]}The Frontend: Tailwind CSS v4
The frontend is a single HTML file that uses Tailwind v4 via CDN. This makes it super lightweight — no build step, no webpack, nothing:
<head>
<script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
</head>The streaming magic on the frontend comes from reading the response as a stream:
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message, model: modelSelect.value, history })
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n');
for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const data = JSON.parse(line.slice(6));
fullResponse += data.content;
aiBubble.textContent = fullResponse;
}
}
}This reads data chunks as they arrive and updates the DOM in real time. The result is a smooth experience where you see words appearing one by one.
Demo 2: Node.js + Express
Clone the Repository
git clone https://github.com/pescarcena/blog-pescarcena-code.git
cd blog-pescarcena-code/ollama-local-app-02Project Structure
ollama-local-app-02/
├── server.js # Express backend
├── package.json # Node.js dependencies
├── Dockerfile # For running with Docker
└── public/
└── index.html # Frontend with Tailwind v4Install Dependencies and Run
# Install dependencies
npm install
# Run the app
npm start
# Or in dev mode (auto-reload)
npm run devThe app will be available at http://localhost:3000.
The Backend: server.js
The Node.js version is very similar in structure. Here’s the main endpoint:
import express from 'express';
import { Ollama } from 'ollama';
const app = express();
const ollama = new Ollama();
app.post('/api/chat', async (req, res) => {
const { message, model = 'llama3.1', history = [] } = req.body;
const messages = history.map(m => ({ role: m.role, content: m.content }));
messages.push({ role: 'user', content: message });
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const stream = await ollama.chat({ model, messages, stream: true });
for await (const chunk of stream) {
const content = chunk.message.content;
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
res.write('data: [DONE]\n\n');
res.end();
});The ollama library API for Node.js is practically identical to Python’s:
ollama.chat({ model, messages, stream: true })returns an async iterable.- We use
for await...ofto iterate over chunks. - Each chunk is sent to the browser as SSE.
The Frontend
The frontend is identical in functionality to Python’s (same HTML/JS). The only visual difference is the theme color: indigo for Python and emerald for Node.js, so you can easily tell them apart.
Running with Docker
Both projects include a Dockerfile so you can run them without installing local dependencies.
Python
cd ollama-local-app-01
# Build the image
docker build -t ollama-chat-python .
# Run the container
docker run -p 8000:8000 --add-host=host.docker.internal:host-gateway ollama-chat-pythonNode.js
cd ollama-local-app-02
# Build the image
docker build -t ollama-chat-nodejs .
# Run the container
docker run -p 3000:3000 --add-host=host.docker.internal:host-gateway ollama-chat-nodejs--add-host=host.docker.internal:host-gateway flag is required on Linux so the container can access Ollama running on your host. On macOS and Windows with Docker Desktop, host.docker.internal already works automatically.How Would This Work on Kubernetes?
If you’re thinking “hey, can I put this in a Kubernetes cluster?”, the answer is absolutely yes. In fact, it’s a very interesting use case.
Imagine this architecture:
On Kubernetes you could:
- Scale the chat app horizontally with HPA based on demand.
- Dedicate GPU nodes to run Ollama and serve the model.
- Use an internal Service so the chat app communicates with Ollama without exposing the LLM externally.
- Configure resource limits so the model doesn’t eat up all the node’s memory.
- Implement health checks to automatically restart if Ollama stops responding.
In an upcoming post, we’ll tackle exactly this: how to deploy an LLM with Ollama on Kubernetes, including GPU configuration, manifests, and best practices for running models in a cluster.
Summary
| Python (FastAPI) | Node.js (Express) | |
|---|---|---|
| Repository | ollama-local-app-01 | ollama-local-app-02 |
| Ollama SDK | Python v0.6.1 | JS v0.6.3 |
| Port | 8000 | 3000 |
| Streaming | StreamingResponse + SSE | res.write() + SSE |
| Frontend | Tailwind v4 CDN | Tailwind v4 CDN |
Both versions do exactly the same thing: they give you a working chat with your local LLM, with real-time streaming and model selector. Choose the one that best fits your stack.