AI Hardware

Setting Up a Local LLM Workstation: Step-by-Step

From bare metal to running your first 70B model locally — the complete setup guide with real commands and zero fluff.

12 min readApril 2025

Running LLMs locally gives you privacy, speed, and zero per-token costs. Getting there requires navigating driver installations, quantization formats, memory constraints, and model selection. This guide compresses what took most developers a weekend of trial and error into a single readable document.

Developer setting up local AI workstation
Setting up a local LLM workstation properly takes a few hours — and then runs indefinitely at near-zero cost.

Step 1: Hardware Prerequisites

Before installing anything, confirm your hardware meets minimum requirements:

Step 2: NVIDIA Driver and CUDA Setup (Linux)

# Add NVIDIA package repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install CUDA toolkit (12.4 as of 2025)
sudo apt-get install -y cuda-toolkit-12-4

# Verify installation
nvidia-smi
nvcc --version
Terminal showing GPU setup commands
Proper CUDA setup is the foundation — everything else builds on top of it.

Step 3: Install Ollama (Easiest Path)

Ollama is the fastest way to get a model running locally. One command installs a local model server with OpenAI-compatible API:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Llama 3.1 8B (fastest, ~12GB)
ollama run llama3.1

# Pull Llama 3.1 70B (best quality, ~40GB needed)
ollama pull llama3.1:70b

# List downloaded models
ollama list

Step 4: API Access

Ollama exposes an OpenAI-compatible API at localhost:11434. You can drop it into any existing OpenAI SDK code by just changing the base URL:

# Python example
from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="ollama", # any string works
)

response = client.chat.completions.create(
  model="llama3.1:70b",
  messages=[{"role": "user", "content": "Hello"}]
)

Step 5: Performance Tuning

Default Ollama settings leave performance on the table. Key tuning options:

# Increase context window (default 2048, extend to 8192)
OLLAMA_NUM_CTX=8192 ollama serve

# Set GPU layers (default auto, but explicit is more reliable)
OLLAMA_NUM_GPU=999 ollama serve # use all available VRAM

# Parallel requests (increase for serving to multiple users)
OLLAMA_NUM_PARALLEL=4 ollama serve
Code performance monitoring and optimization
Tuning GPU memory allocation and context window size can double effective inference speed.

Choosing the Right Quantization

Quantization reduces model size and speeds up inference at the cost of small quality degradation. The practical tradeoffs:

Rule of thumb: Use the largest model that fits in your VRAM at Q4_K_M before dropping to a smaller model at Q8. A 70B Q4_K_M outperforms a 13B Q8 by a significant margin on most reasoning tasks.

Need Help Setting Up Your AI Infrastructure?

We set up production-grade local LLM infrastructure for development teams. Start with a free consultation.

Talk to the Team
Devin Mallonee

Devin Mallonee

Founder & AI Agent Architect · CodeStaff

Devin has been building software products and remote teams since 2017. He founded CodeStaff to deploy purpose-built AI agents and workstations that replace repetitive work and scale operations for businesses of every size. He writes about AI strategy, agent architecture, and the practical reality of deploying AI in production.