Today everyone is suddenly trying to master the AI. Me isn't an exception. Even though I had some academic knowledge in the field, I was never motivated or curious enough to dive into it before it has become mainstream. Well, better late than never, and today I'm going to expore the topic of running language models (and basically any model) locally.

Let's go.

# The big "WHY?"

Nowadays there are many vendors that provide LLM inference as a service: OpenAI, Google, Anthropic, etc. So it's only natural to ask the big question: "Why would I want to run a model locally?" Before diving into the technicalities, let's first discuss the reasons why it may be beneficial.

# Reason 1: Privacy

Any time you use a cloud-based service, you have to trust the vendor to handle your data correctly, and the large language models SAAS are no exception. Furthermore, it would be stupid from their side not to log anything users type in, primarily for the following reasons:

To improve model accuracy over time, to fine-tune them
To keep track of user needs (IMO time when LLM providers start selling ads is not far away)
To detect and block harmful content

With all that being said, users should be aware of anything they type in is stored for further use by the vendor. This reveals a set of problems for organizations that are not willing to share their data. Primarily, such organizations belong to the following categories:

Banking sector
Insurance companies
Health care and hospitals
Government/military structures

The only chance for such organizations to avoid this is to build their own LLM infrastructure.

# Reason 2: Price

The price is typically quoted per million tokens, and for personal use doesn't feel like a big deal. As of May 2025, the price of 1M tokens input/output is:

GPT-4.1: $2/$8
Claude 3.7 Sonnet: $3/$15
Gemini 2.0 Flash: $0.10/$0.40

However, for large organizations, the expenses may add up quickly. Even for simple use cases, like a website chatbot, the expenses may soon go as high as $1000/month and above.

Costs is something that should be carefully considered, especially for startups.

# Reason 3: Censorship

When it comes to LLMs, censorship is a tricky topic. Certainly, common sense, people's dignity and legal regulations should prevail, but sometimes it's desired to have certain freedom. Most of vendors apply restrictions on what a model can accept and produce: be it memes with celebrities or offensive language. Some models are forbidden to speculate about politics, for example, DeepSeek.

To avoid unwanted restrictions, it's possible to run debiased models locally.

# Reason 4: Model customization

What if you want a model that can speak in a certain way? Or, a model that can provide information about a certain topic? There are things such as RAG, agentic models and so on, but there is one thing that not every provider offers: model fine-tuning.

Fine tuning is a process of training a model on a specific dataset. It's a powerful tool that allows you to customize a model to your needs.

# Reasons why you shouldn't

There are certain reasons why you may consider refraining from running a model locally:

You don't have a powerful hardware/fast internet connection on premises
You can't find a good open source model that matches your needs

The first problem is solvable by renting GPU power on a cloud provider, but then you have to deal with the problem of privacy and costs. The second problem is more complex and requires engineering efforts.

# Getting ready

This time I am going to go with a model that runs on a local machine, as if a company has its own data center and wants to run LLMs for its own use.

# Choosing a model

To find a model that suits your needs, Hugging Face is a great place to start, as it has a lot of models and a good documentation. It's basically a catalog of models you can run on premises.

Just go to the Models section and search for a model that matches your needs by browsing through the categories, licenses, etc. Good examples are DeepSeek-R1, Llama-3.1-8B or Gemma-2-9B.

A prominent example is also Perplexity R1 model, which is a fine-tuned version of DeepSeek R1 with censorship removed.

When choosing a model there are a few things to consider:

The amount of parameters, i.e. the size of the model. The model can have 2B, 70B, 175B parameters and more. Typically, the more the merrier, but keep in mind, that in order to run, a model needs to fit into memory completely. Models with lesser parameters are typically faster, but they tend to hallucinate more. The corellation between the amount of parameters and the memory the model needs is the following:
- 1.5B => ~1GB
- 7B => ~4.5GB
- 14B => ~9GB
- 32B => ~20GB
- 650B => ~400GB
Context window, i.e. the maximum number of tokens that the model can handle in a single request. The larger the context window, the more information the model can use to generate a response.
Model specialty, i.e. the type of the model. There are models that are specialized in certain tasks, like translation, summarization, etc.

# Inference: CPU vs GPU

A model is just a bunch of numbers, it's pretty useless by itself. It needs to have an inference engine it can run on.

Now that we know how to choose a model, let's discuss how the model is going to be run. Take a look at the hardware at your disposal. There are two options to choose from: CPU-based inference and GPU-based inference (technically speaking, there are also TPUs by Google, but they are not available for retail).

Currently there are numerous inference engines, each engine specializes on CPU, GPU inference, or both. Here are some of the most popular ones:

Inference Engine	Vendor / Maintainer	CPU/GPU Support
ONNX Runtime	Microsoft	CPU and GPU
llama.cpp	Community (Georgi Gerganov)	Primarily CPU, some GPU
vLLM	UC Berkeley / Community	Primarily GPU
DeepSpeed	Microsoft	GPU (NVIDIA, AMD)
Hugging Face Transformers	Hugging Face	CPU and GPU
TensorRT-LLM	NVIDIA	GPU (NVIDIA)
TensorFlow Lite	Google	CPU and GPU (Mobile/Edge)
OpenVINO	Intel	CPU and Integrated GPU
TVM	Apache Software Foundation	CPU and GPU (Customizable)

What about projects like Ollama or LM Studio? Well, they are not only inference engines, but rather fullstack LLM frameworks that include a model runner, model downloader, model converter, etc. While they are great, for this project they are not relevant, as we only need an inference engine.

🚨 It's important to understand, that CPU-based inference is typically less performant, due to the following reasons:

CPU is not optimized for tensor operations (matrix and vector multiplications), but GPUs are better in this (in a way).
CPU-based architecture cannot be effectively scaled horizontally, but with GPUs you can have arrays of compute units (say "Hi!" to NVIDIA).

# Service architecture

In a nutshell, a model running in the cloud is none other than a microservice, that has the following functionality:

Exposes API for model inference. It must support certain formats and allow streaming of the response.
Has a way to serve requests in parallel.
Has rate limiting enabled.
Can efficiently scale horizontally.
Has observability enabled, i.e. it can be monitored and managed.
Has a way to update the model with zero downtime.

Let's build one! Ofcourse, it won't have all the features, but it will be a good starting point.

# My choice

At my disposal I have a PC machine with AMD Ryzen 7 5800X 8-core CPU, 64GB RAM and NVIDIA GeForce RTX 3060 Ti 8GB GPU. I need to keep in mind, that besides loading the model itself, I need to reserve some memory for running the inference, the microservice and other processes. I also have a macbook I would prefer to run my frontend on.

With all that being said, my best choice is to go with a CPU-based inference and a model with no more than 70B parameters.

The inference engine I chose was llama.cpp as it supports CPU inference. To make things easier and quicker, I chose Zephyr 7B model converted to GGUF format, originally derived from Mistral-7B-v0.1. The model weights about 5-8Gb.

In theory I could have taken the original Zephyr-7B-beta model and converted it to GGUF format myself, as llama.cpp supports conversion from Hugging Face.

Another good option would be to use Zephyr with original Hugging Face Transformers, as they natively support CPU inference as well.

On my PC machine I had Windows 12 installed, and honestly I had no plans to install Linux as a second system, as I would have to do 10 years ago. Today it's not even needed, as now Windows has WSL 2 that allows running a Linux kernel next to it.

So the plan was the following:

Download the model, create a simple microservice in Python that exposes the /chat endpoint, and run it on my PC inside WSL.
Create a small frontend app in React and run it on my Macbook.
Optionally create the metrics service, that will expose information about CPU and memory utilization, and connect it to Prometheus and Grafana.
Create a tunnel using Ngrok or Localtunnel to expose the service to the internet.

Let's get to work!

# Work

# Setup the PC

First, we needed to install WSL 2, if not already installed, and then install Ubuntu 22.04 LTS on it.

wsl --install
wsl --install -d Ubuntu-22.04
wsl --set-default Ubuntu-22.04

📃 Copy

Inside WSL 2 we can do pretty much anything a real Ubuntu allows, and it will share the same memory and hardware with the host. It's important to note, that WSL is not a sandbox, so it's not safe to run potentially harmful code in there, as it can affect the host machine.

At this point we will need to have Python and Hugging Face CLI installed.

On Windows, we should install and configure Ngrok.

# Pull the model

Create a project directory inside WSL 2 and pull the model:

mkdir llm-local
cd llm-local

📃 Copy

We can go to the Files and Versions section of the model and pick the best one: Q8_0.

Then start downloading the model, sit back, relax and enjoy the flight, as it may take a while (took me about an hour):

huggingface-cli download TheBloke/zephyr-7B-beta-GGUF zephyr-7b-beta.Q8_0.gguf --local-dir .

📃 Copy

# The backend

We setup the environment and install the dependencies needed:

python -m venv .venv
source .venv/bin/activate
pip install llama-cpp-python fastapi pydantic psutil uvicorn
pip freeze > requirements.txt

📃 Copy

As I am no expert in Python (yet), I quickly vibecoded a simple service that exposes the /chat endpoint and streams the response back:

from llama_cpp import Llama
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Dict, AsyncGenerator
from fastapi.responses import StreamingResponse, PlainTextResponse
import asyncio
from fastapi.middleware.cors import CORSMiddleware
import psutil
import time
import re

# GLOBAL VARIABLES
zephyr_model_path = "./zephyr-7b-beta.Q8_0.gguf"
START_TIME = time.time()

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

zephyr_llm = Llama(
    model_path=zephyr_model_path,
    n_ctx=4096,
    n_threads=8,
)

class ChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    stream: bool = False
    max_tokens: int = 2048

def convert_to_chatml(messages: List[Dict[str, str]]) -> str:
    """
    Convert messages from the OpenAI/Llama-2 format to ChatML format for Zephyr 7B.
    """
    chatml_prompt = ""

    for message in messages:
        role = message["role"]
        content = message["content"]

        if role == "system":
            chatml_prompt += f"<|system|>\n{content}\n</s>\n"
        elif role == "user":
            chatml_prompt += f"<|user|>\n{content}\n</s>\n"
        elif role == "assistant":
            chatml_prompt += f"<|assistant|>\n{content}\n</s>\n"

    # Add the assistant tag at the end to prompt the model to generate
    if not chatml_prompt.endswith("<|assistant|>\n"):
        chatml_prompt += "<|assistant|>\n"

    return chatml_prompt


@app.post("/chat")
async def chat(request: ChatRequest):
    chatml_prompt = convert_to_chatml(request.messages)

    response = zephyr_llm.create_completion(
        prompt=chatml_prompt,
        max_tokens=request.max_tokens,
        stream=True,
        stop=["</s>", "<|user|>"]  # Stop generation at these tokens
    )
    return StreamingResponse(stream_chatml_generator(response), media_type="text/event-stream")

async def stream_chatml_generator(response):
    for chunk in response:
        if "choices" in chunk and len(chunk["choices"]) > 0:
            choice = chunk["choices"][0]
            if "text" in choice:
                content = choice["text"]
                if content:
                    # Clean the chunk content before sending
                    if content:
                        yield f"data: {content}\n\n"
    yield "data: [DONE]\n\n"


@app.get("/")
async def root():
    return {"message": "Zephyr API is running"}

📃 Copy

The code is licensed under the MIT license

What's important here:

The model is loaded once at startup and then reused for multiple requests.
The response is streamed back to the client. The MIME type text/event-stream indicates, that the Server-Sent Events protocol is used.
We won't store the conversation history on the server side, the client provides it upon each request. Normally this isn't good for production, but for this small PoC it's okay. The model will load the entire conversation into the context window and provide a completion after the prompt.
The endpoint accepts data in the OpenAI/Llama-2 format (an array of objects with role and content) and converts it into the ChatML format before sending to the model, as Zephyr 7B expects, as clearly stated in the readme. If not done so, the model will not understand the input and will get confused between the user and assistant roles.

# The frontend

The time was almost up, so I vibecoded a simple frontend in React, that uses the /chat endpoint to get the response from the model. The most interesting part of the application is the useAIStream hook in the chat.ts file:

👉 📃 ./chat.ts

import { useState, useCallback } from 'react';
import { useMutation } from '@tanstack/react-query';

// Define types for messages and API responses
type Role = 'system' | 'user' | 'assistant';

interface Message {
  role: Role;
  content: string;
}

interface AIStreamParams {
  messages: Message[];
}

export function useAIStream() {
  const [streamedContent, setStreamedContent] = useState<string>('');
  const [isComplete, setIsComplete] = useState<boolean>(false);

  // Reset the streamed content
  const resetStream = useCallback(() => {
    setStreamedContent('');
    setIsComplete(false);
  }, []);

  // React Query mutation for handling the streaming request
  const mutation = useMutation({
    mutationFn: async ({ messages }: AIStreamParams) => {
      try {
        // Reset the stream content when starting a new request
        resetStream();

        // Prepare the request body
        const body = JSON.stringify({
          messages,
          stream: true,
        });

        // Make the fetch request
        const response = await fetch(`${process.env.REACT_APP_TUNNEL}chat`, {
          method: 'POST',
          headers: {
            'Content-Type': 'application/json',
            Accept: 'text/event-stream',
          },
          body,
        });

        if (!response.ok) {
          throw new Error(`HTTP error! status: ${response.status}`);
        }

        // Get the response body as a readable stream
        const reader = response.body?.getReader();
        if (!reader) {
          throw new Error('Failed to get stream reader');
        }

        // Process the stream
        const decoder = new TextDecoder();
        let buffer = '';

        while (true) {
          const { done, value } = await reader.read();
          if (done) {
            setIsComplete(true);
            break;
          }

          // Decode the chunk and add it to the buffer
          buffer += decoder.decode(value, { stream: true });

          // Process the buffer for SSE events
          const lines = buffer.split('\n');
          buffer = lines.pop() || ''; // Keep the last incomplete line in buffer

          for (const line of lines) {
            if (line.startsWith('data: ')) {
              try {
                const data = line.slice(6); // Remove 'data: ' prefix
                if (data === '[DONE]') {
                  setIsComplete(true);
                  continue;
                }

                setStreamedContent((prev) => prev + data);
              } catch (e) {
                console.error('Error parsing SSE data:', e);
              }
            }
          }
        }

        return streamedContent;
      } catch (error) {
        console.error('Streaming error:', error);
        throw error;
      }
    },
  });

  // Return the hook API
  return {
    streamContent: streamedContent,
    isStreaming: mutation.isPending,
    isComplete,
    error: mutation.error,
    startStream: mutation.mutate,
    resetStream,
  };
}

📃 Copy

The code is licensed under the MIT license

...and the ChatComponent.tsx component:

👉 📃 ./_ChatComponent.tsx

'use client';

import React, { useState } from 'react';
import Input from '@mui/joy/Input';
import Button from '@mui/joy/Button';
import { useAIStream } from '../../hooks/chat';

type Message = {
  role: 'system' | 'user' | 'assistant';
  content: string;
};

const ChatComponent = () => {
  const [messages, setMessages] = useState<Message[]>([
    { role: 'system', content: 'You are a friendly chat bot' },
  ]);
  const [input, setInput] = useState('');

  const { streamContent, isStreaming, isComplete, startStream, resetStream } =
    useAIStream();

  // Handle sending a new message
  const handleSendMessage = () => {
    if (!input.trim() || isStreaming) return;

    // Add user message to the messages array
    const updatedMessages = [
      ...messages,
      { role: 'user', content: input },
    ] as Message[];

    console.log('MESSAGES:');
    console.log(updatedMessages);

    setMessages(updatedMessages);

    // Clear input field
    setInput('');

    // Start streaming the AI response
    startStream({ messages: updatedMessages });
  };

  // When streaming is complete, add the assistant's message to the chat history
  React.useEffect(() => {
    if (isComplete && streamContent) {
      setMessages((prev) => [
        ...prev,
        { role: 'assistant', content: streamContent },
      ]);
      resetStream();
    }
  }, [isComplete, streamContent, resetStream]);

  return (
    <div className="chat-container">
      <div className="messages-container">
        {messages.slice(1).map((message, index) => (
          <div
            key={index}
            className={`message ${message.role}`}
            style={{ marginTop: '1rem' }}
          >
            <strong>{message.role}:</strong> {message.content}
          </div>
        ))}

        {/* Display the streaming content */}
        {isStreaming && (
          <div
            className="message assistant streaming"
            style={{ marginTop: '1rem' }}
          >
            <strong>assistant:</strong> {streamContent}
          </div>
        )}
      </div>

      <div
        className="input-container"
        style={{
          display: 'flex',
          justifyContent: 'space-between',
          marginTop: '1rem',
        }}
      >
        <Input
          type="text"
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyDown={(e) => {
            if (e.key === 'Enter' && !isStreaming && input.trim()) {
              e.preventDefault();
              // Assuming you have a function like handleSendMessage() to send the message
              handleSendMessage();
            }
          }}
          placeholder="Type your message..."
          disabled={isStreaming}
          style={{ flexGrow: 1, marginRight: 10 }}
        />
        <Button
          onClick={handleSendMessage}
          disabled={isStreaming || !input.trim()}
        >
          {isStreaming ? 'Thinking...' : 'Send'}
        </Button>
      </div>
    </div>
  );
};

export default ChatComponent;

📃 Copy

The code is licensed under the MIT license

Kudos to Claude for the frontend code.

BTW, if the dashboard seems ugly, there are good alternatives to it, such as OpenWebUI.

# Putting it all together

Inside the WSL 2, start the backend:

uvicorn backend:app --host 0.0.0.0 --port 8000

📃 Copy

On Windows, start the Ngrok tunnel:

ngrok http http://localhost:8080

📃 Copy

On Macbook paste the ngrok URL into the REACT_APP_TUNNEL environment variable and start the front-end:

npm start

📃 Copy

As the frontend is running, try sending a message to the model: "Hey, are you there?" 👽🛸

# Conclusion and further steps

Now that we understand the basics of running models locally, we can apply this knowledge to running other models on different other platforms. For instance, it would be particuarly interesting to try running lightweight models on Raspberry Pi.

Another vector of exploration would be to dive into the world of agentic AI (e.g. using LangGraph) and MCP, but this is a topic for another post.

One way or another, I definitely need to purchase a decent GPU :)

***

The code for this projectis available on GitHub. Have fun!

Sergei Gannochenko

Business-focused product engineer, in ❤️ with tech and making customers happy.

AI, Golang/Node, React, TypeScript, Docker/K8s, AWS/GCP, NextJS

20+ years in dev

Demystifying running a language model locally

Table of contents

# The big "WHY?"

# Reason 1: Privacy

# Reason 2: Price

# Reason 3: Censorship

# Reason 4: Model customization

# Reasons why you shouldn't

# Getting ready

# Choosing a model

# Inference: CPU vs GPU

# Service architecture

# My choice

# Work

# Setup the PC

# Pull the model

# The backend

# The frontend

# Putting it all together

# Conclusion and further steps