Local AI Setup¶
Overview¶
Run LLM inference on the laptop (powerful GPU, not always-on), expose it to the home network, and optionally access it via Open WebUI on the always-on Ubuntu VM.
| Component | Where | Notes |
|---|---|---|
| LM Studio (inference) | HP EliteBook laptop | Radeon 890M via Vulkan — full GPU offload confirmed |
| Ollama (alternative) | HP EliteBook laptop | Works, but LM Studio has better AMD GPU support on Windows |
| Open WebUI (UI) | Ubuntu VM Docker | Points at laptop's inference instance |
LM Studio on Laptop (Windows) — Recommended¶
LM Studio is more reliable than Ollama for AMD GPU acceleration on Windows. Ollama on Windows defaulted to CPU on the Radeon 890M (very slow responses); LM Studio offloaded all model layers via Vulkan.
Confirmed: all layers offloaded to the GPU on a 9B-parameter model, ~5GB resident on GPU.
Thinking mode (Qwen 3.5+): leave thinking mode off by default. Toggle via the button in LM Studio's chat UI or use a /think prefix for complex prompts. Leaving it on caused multi-minute reasoning loops in casual use.
Recommended models¶
| Model | Size | Notes |
|---|---|---|
| Qwen 3.5 9B | ~6GB | Strong all-rounder, full GPU offload |
| Qwen 2.5 14B | ~9GB | Best coding + reasoning in the 14B range |
| Gemma 3 12B | ~8GB | Google model, efficient |
| Gemma 4 26B A4B | ~8GB active | MoE — only 4B parameters active at a time, very efficient |
| Phi-4 | ~9GB | Microsoft, MIT licensed, great quality/size ratio |
| Qwen 2.5 32B | ~20GB | More capable, uses ~20GB VRAM |
Ollama on Laptop (Windows) — Alternative¶
Install: download the .exe from ollama.com/download — installs as a background service.
Verify: ollama list in PowerShell.
Check GPU acceleration: ollama ps — look for GPU layers. AMD GPU support on Windows uses Vulkan rather than ROCm.
Expose to network¶
By default Ollama only listens on localhost. To allow other devices to reach it:
set OLLAMA_HOST=0.0.0.0 && ollama serve
Or set OLLAMA_HOST=0.0.0.0 as a permanent Windows environment variable so it persists across restarts.
Other devices reach Ollama at: http://<laptop-ip>:11434
Laptop must be awake and Ollama running. Windows restarts will stop the service until relaunched.
Open WebUI on Ubuntu VM¶
ChatGPT-style browser interface with web search, file uploads, and model switching.
docker run -d -p 3000:80 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:main
Access at http://<ubuntu-vm-ip>:3000. In settings, set the model URL to http://<laptop-ip>:11434 (Ollama) or http://<laptop-ip>:1234 (LM Studio default).
Enable the web search toggle in Open WebUI settings to partially address the training cutoff limitation.
Caveats¶
- No internet access by default: local models only know their training data. Paste in relevant docs (Proxmox wiki, UniFi docs, etc.) or use Open WebUI's web search toggle.
- AMD on Windows: Vulkan is used instead of ROCm — works well on Radeon 890M via LM Studio. Ollama's AMD support on Windows is less reliable.
- Quality ceiling: 14B models are good for focused tasks (code, HA automations, doc Q&A) but not a full replacement for frontier models on complex reasoning.
- Always-on limitation: inference depends on the laptop being awake. Not suitable for HA automations that need to run overnight.