Skip to main content

Command Palette

Search for a command to run...

Running LLMs Locally with Ollama: A Practical, Hands-On Guide

Published
5 min read
Running LLMs Locally with Ollama: A Practical, Hands-On Guide
T

Versatile Software Engineer with 2+ years of experience in full-stack development, specializing in JavaScript frameworks (React JS, Vue JS, Svelte), Node.js and Python. Proven track record of delivering scalable, user-centric web applications in Agile environments.

For a long time, running Large Language Models felt like something reserved for big companies with deep pockets and cloud budgets. If you wanted to experiment seriously, you almost always ended up sending prompts to an external API and hoping for the best.

That’s changed.

With tools like Ollama, running modern LLMs locally is no longer a novelty, it’s a genuinely practical option. You get better privacy, tighter feedback loops during development, and far more control over how your models behave.

In this post, I’ll walk through how to get started with Ollama, run your first model, and integrate it into a real application. No theory-heavy detours, just the things you actually need to know to be productive.

What Ollama Actually Does (and Why It’s Useful)

At its core, Ollama is a local runtime for large language models. It handles all the annoying parts; model downloads, quantization, GPU acceleration and exposes a simple interface you can use from the command line or over HTTP.

What makes Ollama especially nice is how little ceremony it requires. You don’t need to manage Python environments, juggle model weights manually, or think too hard about hardware detection. You install it, run a model, and start experimenting.

That simplicity is why Ollama has become a popular choice for:

  • local experimentation,

  • internal tools,

  • privacy-sensitive applications,

  • and prototyping RAG systems without cloud dependencies.

Installing Ollama

Getting Ollama running takes only a minute.

On macOS and Linux:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, you can download the official installer from the Ollama website.

Once installed, Ollama runs as a background service. You usually don’t need to start it manually, it spins up automatically the first time you run a command.

You can confirm everything is working with:

ollama --version

Running Your First Model

This is where Ollama really shines. To start a model, you just run:

ollama run llama3.2:1b

If the model isn’t already on your machine, Ollama downloads it for you and starts an interactive session in your terminal. From there, you can chat with the model immediately.

You’re not limited to Llama either. Switching models is as simple as changing the name:

ollama run mistral 
ollama run gemma3 
ollama run phi4

That ease of switching makes it incredibly simple to compare models and decide which one fits your use case best.

Using Ollama from Python

Once you move past quick experiments, you’ll probably want to integrate Ollama into an actual project. Fortunately, it exposes a local API, and there’s a small Python client that makes things straightforward.

Here’s a minimal example for text generation:

import ollama

response = ollama.generate(
    model="llama3",
    prompt="Explain quantum computing in simple terms"
)

print(response["response"])

If you’re building conversational experiences, the chat API feels very natural:

messages = [
    {"role": "user", "content": "Who won the World Cup in 2022?"},
    {"role": "assistant", "content": "Argentina won the FIFA World Cup in 2022."},
    {"role": "user", "content": "Who was the top scorer?"}
]

response = ollama.chat(
    model="llama3",
    messages=messages
)

print(response["message"]["content"])

Because everything runs locally, responses are fast and predictable; no network latency, no rate limits.

Streaming Responses for a Better User Experience

If you’ve ever built a chat interface, you know how important streaming is. Waiting for a full response before showing anything just feels slow. Ollama supports streaming out of the box:

for chunk in ollama.generate(
    model="llama3",
    prompt="Write a short story about a robot learning to paint",
    stream=True,
):
    print(chunk["response"], end="", flush=True)

This small change makes a huge difference in perceived performance, especially in web or desktop UIs.


A Simple Local Chatbot with Streamlit

To make things more concrete, let’s put Ollama behind a simple UI.

Using Streamlit, you can build a local chatbot in just a few lines of code:

import streamlit as st
import ollama

st.title("Local LLM Chatbot")

models = ollama.list()
model_names = [m["name"] for m in models["models"]]
selected_model = st.selectbox("Choose a model", model_names)

if "messages" not in st.session_state:
    st.session_state.messages = []

for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

if prompt := st.chat_input("Ask something"):
    st.session_state.messages.append(
        {"role": "user", "content": prompt}
    )

    with st.chat_message("assistant"):
        placeholder = st.empty()
        response_text = ""

        for chunk in ollama.chat(
            model=selected_model,
            messages=st.session_state.messages,
            stream=True,
        ):
            response_text += chunk["message"]["content"]
            placeholder.markdown(response_text + "▌")

        placeholder.markdown(response_text)

    st.session_state.messages.append(
        {"role": "assistant", "content": response_text}
    )

This gives you a fully local, streaming chatbot; no cloud services, no API keys, and no external dependencies beyond Ollama itself.

Custom Models with Modelfiles

As your use case becomes more specific, you may want more control over how the model behaves. That’s where Modelfiles come in.

A Modelfile lets you define a model configuration; base model, system prompt, parameters in a single place:

FROM llama3

SYSTEM You are a concise and helpful technical assistant.
PARAMETER temperature 0.7
PARAMETER num_ctx 8192

You can then create and run the model like this:

ollama create mymodel -f Modelfile
ollama run mymodel

This approach works especially well for internal tools, assistants with a fixed personality, or RAG-based systems where consistency matters.

Performance Tips from Real Use

Running LLMs locally is very doable, but a few practical tips go a long way:

  • Use GPU acceleration if you have it; Ollama enables it automatically.

  • Smaller models are often good enough; don’t jump to the biggest one by default.

  • Quantized models can dramatically reduce memory usage with minimal quality loss.

  • Be mindful of context length; larger contexts cost more RAM and VRAM than you might expect.

Ollama has made local LLMs genuinely accessible. What used to require a lot of setup and guesswork is now something you can try in minutes. If you care about privacy, fast iteration, or simply understanding how these models behave under the hood, running them locally is worth your time. And with Ollama, the barrier to entry is low enough that there’s really no excuse not to experiment.

If you’ve been relying entirely on hosted APIs, give local models a try, you might be surprised by how far they’ve come.

L

This is a good concise read.

F

Thanks for this. It brought so much clarity!

G

This is a great innovation

B

A quite interesting article

C

This is very insightful.

R

Nice guide 👍 clear, practical, and thoughtful.