Local AI Setup¶

Overview¶

Run LLM inference on the laptop (powerful GPU, not always-on), expose it to the home network, and optionally access it via Open WebUI on the always-on Ubuntu VM.

Component	Where	Notes
LM Studio (inference)	HP EliteBook laptop	Radeon 890M via Vulkan — full GPU offload confirmed
Ollama (alternative)	HP EliteBook laptop	Works, but LM Studio has better AMD GPU support on Windows
Open WebUI (UI)	Ubuntu VM Docker	Points at laptop's inference instance

LM Studio on Laptop (Windows) — Recommended¶

LM Studio is more reliable than Ollama for AMD GPU acceleration on Windows. Ollama on Windows defaulted to CPU on the Radeon 890M (very slow responses); LM Studio offloaded all model layers via Vulkan.

Confirmed: all layers offloaded to the GPU on a 9B-parameter model, ~5GB resident on GPU.

Thinking mode (Qwen 3.5+): leave thinking mode off by default. Toggle via the button in LM Studio's chat UI or use a /think prefix for complex prompts. Leaving it on caused multi-minute reasoning loops in casual use.

Recommended models¶

Model	Size	Notes
Qwen 3.5 9B	~6GB	Strong all-rounder, full GPU offload
Qwen 2.5 14B	~9GB	Best coding + reasoning in the 14B range
Gemma 3 12B	~8GB	Google model, efficient
Gemma 4 26B A4B	~8GB active	MoE — only 4B parameters active at a time, very efficient
Phi-4	~9GB	Microsoft, MIT licensed, great quality/size ratio
Qwen 2.5 32B	~20GB	More capable, uses ~20GB VRAM

Ollama on Laptop (Windows) — Alternative¶

Install: download the .exe from ollama.com/download — installs as a background service.

Verify: ollama list in PowerShell.

Check GPU acceleration: ollama ps — look for GPU layers. AMD GPU support on Windows uses Vulkan rather than ROCm.

Expose to network¶

By default Ollama only listens on localhost. To allow other devices to reach it:

set OLLAMA_HOST=0.0.0.0 && ollama serve

Or set OLLAMA_HOST=0.0.0.0 as a permanent Windows environment variable so it persists across restarts.

Other devices reach Ollama at: http://<laptop-ip>:11434

Laptop must be awake and Ollama running. Windows restarts will stop the service until relaunched.

Open WebUI on Ubuntu VM¶

ChatGPT-style browser interface with web search, file uploads, and model switching.

docker run -d -p 3000:80 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Access at http://<ubuntu-vm-ip>:3000. In settings, set the model URL to http://<laptop-ip>:11434 (Ollama) or http://<laptop-ip>:1234 (LM Studio default).

Enable the web search toggle in Open WebUI settings to partially address the training cutoff limitation.

Caveats¶

No internet access by default: local models only know their training data. Paste in relevant docs (Proxmox wiki, UniFi docs, etc.) or use Open WebUI's web search toggle.
AMD on Windows: Vulkan is used instead of ROCm — works well on Radeon 890M via LM Studio. Ollama's AMD support on Windows is less reliable.
Quality ceiling: 14B models are good for focused tasks (code, HA automations, doc Q&A) but not a full replacement for frontier models on complex reasoning.
Always-on limitation: inference depends on the laptop being awake. Not suitable for HA automations that need to run overnight.