Running a Local AI Agent on Proxmox with AMD ROCm GPU Passthrough

I run a Beelink EQR6 as my homelab server. It’s a compact AMD Ryzen 9 6900HX mini PC with 64GB of DDR5 RAM and a Radeon 680M integrated GPU. No discrete graphics card. Just a small, silent box running Proxmox that handles most of my self-hosted infrastructure.

Recently I decided to push it further. I wanted to run a local AI agent. Not just Ollama in isolation, but a full agentic framework that could draft content, manage tasks, respond to me on Telegram, and generally act like an AI employee I could delegate to.

This is the story of how that went. Spoiler: it worked. But the road had some sharp turns.

What We’re Building

Here’s the stack:

Proxmox VE 9.1.7 - bare metal hypervisor on the EQR6
Debian 12 Bookworm LXC - lightweight container (not a full VM)
AMD ROCm 6.0 - GPU compute framework for the Radeon 680M
Ollama - local LLM inference server
qwen2.5:7b - the language model doing the actual thinking
Hermes Agent - the agentic framework tying it all together

Why Hermes? It’s the fastest growing agent framework on GitHub right now. It has a compounding skills system — the more you use it, the smarter it gets at your specific workflows. Built-in Telegram/messaging gateway, MCP support, persistent memory, and a self-improvement loop. NetworkChuck covered it recently and it’s worth the attention.

Why a container instead of a VM? LXC containers share the host kernel. That means less overhead than a full VM, which matters when your “GPU” is actually integrated into the CPU and shares system memory. Every megabyte counts.

Step 1: Setting Up the Proxmox LXC

Create a Debian 12 Bookworm LXC with these specs:

Cores: 8
RAM: 32GB
Disk: 40GB
Nesting: enabled (required for systemd inside the container)

pct create 400 local:vztmpl/debian-12-standard_12.7-1_amd64.tar.zst \
  --hostname hermes-osms \
  --cores 8 \
  --memory 32768 \
  --rootfs local-lvm:40 \
  --features nesting=1

Why 32GB? The Radeon 680M is an integrated GPU. It carves its VRAM directly out of system memory. Running a 7B model means roughly 8GB goes to VRAM, leaving 24GB for the OS, Hermes, and everything else.

Step 2: GPU Passthrough (The Fun Part)

This is what separates GPU-accelerated inference from CPU crawl. We need the container to see the GPU.

The GPU lives on the host machine. By default, containers are sandboxed and cannot touch it. We need to poke specific holes in that sandbox so the container can talk to the GPU driver.

First, identify the GPU device nodes on the Proxmox host:

stat -c "%t %T" /dev/dri/card0       # 226:0
stat -c "%t %T" /dev/dri/renderD128  # 226:128
stat -c "%t %T" /dev/kfd             # 234:0

Then add these lines to /etc/pve/lxc/400.conf on the host:

lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 234:0 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file

cgroup2 is the kernel’s resource control layer. We’re telling it to allow the container to access those specific device numbers with read/write/mknod permissions, then bind-mounting the actual device files into the container.

After restarting the container, verify inside:

ls /dev/dri   # Should show card0 and renderD128
ls /dev/kfd   # Should exist

Step 3: ROCm 6.0 Inside the Container

ROCm is AMD’s compute framework — the equivalent of NVIDIA’s CUDA. Without it, Ollama falls back to CPU inference.

Important: Install userspace libraries only. Do NOT install DKMS kernel modules. The container shares the host kernel and this will cause conflicts.

apt-get install -y rocm-hip-libraries rocm-opencl-runtime --no-install-recommends
usermod -aG render,video root

The systemd-logind Fix

Proxmox LXC drops certain kernel namespaces for isolation. Modern systemd does not like this and crashes into a loop (status=226/NAMESPACE), causing severe SSH latency.

Fix it permanently:

mkdir -p /etc/systemd/system/systemd-logind.service.d/
cat > /etc/systemd/system/systemd-logind.service.d/override.conf << 'EOF'
[Service]
PrivateUsers=no
RestrictNamespaces=no
EOF
systemctl daemon-reload && systemctl restart systemd-logind

Verify ROCm sees the GPU:

rocminfo | grep -A5 "Agent 2"

You should see gfx1030 or gfx1035 (the Radeon 680M identifier).

Step 4: Installing Ollama

curl -fsSL https://ollama.com/install.sh | sh

The installer detects ROCm and configures GPU acceleration automatically. But here’s where we hit the first real gotcha.

The gfx1035 Problem

When you check the Ollama logs, you might see:

error="runner crashed" detail="error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1035"

The Radeon 680M identifies itself as gfx1035 to the driver. Ollama’s ROCm library has precompiled GPU kernels for gfx1030 (RDNA2) but not specifically for gfx1035. It crashes looking for code that does not exist.

The fix: tell ROCm to pretend it’s a gfx1030. They’re close enough architecturally that this works fine.

mkdir -p /etc/systemd/system/ollama.service.d/
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
EOF
systemctl daemon-reload && systemctl restart ollama

Pull and test the model:

ollama pull qwen2.5:7b
ollama run qwen2.5:7b "respond with one word: hello"
ollama ps

You want to see 100% GPU in the PROCESSOR column. That confirms ROCm passthrough is working.

Step 5: Installing Hermes Agent

apt-get install -y python3 python3-pip pipx
pipx install hermes-agent
pipx ensurepath

Configure the provider. Hermes does not have a native local Ollama option in its wizard. Use Custom endpoint:

URL: http://localhost:11434/v1
API mode: Chat Completions
Model: qwen2.5:7b
API key: none required

The Context Window Gotcha

Hermes requires a minimum 64K context window. Ollama defaults qwen2.5:7b to 32K. You will see:

Failed to initialize agent: Model qwen2.5:7b has a context window of 32,768 tokens,
which is below the minimum 64,000 required by Hermes Agent.

Create a custom Ollama model with extended context:

cat > /tmp/Modelfile << 'EOF'
FROM qwen2.5:7b
PARAMETER num_ctx 131072
EOF
ollama create qwen2.5:7b-128k -f /tmp/Modelfile

Step 6: Performance Reality Check

Here’s the honest truth.

Ollama runs on the 680M at 100% GPU. ROCm passthrough works. The GPU is doing the inference.

The problem is Hermes. Every single message — before generating one token of response — Hermes prefills roughly 6-10K tokens of system prompt: tool definitions, memory, personality, instructions. On an integrated GPU sharing system RAM, that prefill is slow. We saw response times of around 2 minutes for a simple question.

This is not an Ollama problem. Isolated Ollama inference on the 680M is fine. The issue is an agentic framework loading a massive context on every turn hitting a memory-constrained iGPU. Those are two things that don’t pair well.

For async use — fire a task, walk away, come back to the result — this is workable. For interactive conversation, 2-minute round trips are not.

The fix is in the next section.

Step 7: Security Considerations

Stop here for a second.

You’ve just given a software process the ability to read your files, run commands, browse the web, and send messages as you. That is not a minor thing. Most people treat the security section like the warranty card in the box. They know they should read it. They don’t.

I’ve been doing this long enough to watch people bolt security on at the end and call it done. That’s not security. That’s a patient with a sucking chest wound. Technically still alive. Not okay. An AI agent with no meaningful sandbox and API keys sitting in a plaintext file isn’t a productivity tool until you’ve thought through what happens when something goes sideways.

What Hermes Does Well

Hermes ships with a security scanner called Tirith that pattern-matches commands before execution. Secrets redaction is enabled by default. For a self-hosted single-user setup, this is a reasonable baseline.

Where the Gaps Are

No tool sandbox. Hermes runs tools as regular Python processes with full process permissions. If a tool is compromised or a prompt injection tricks the agent into running malicious code, it has access to everything the process can reach.

Compare this to IronClaw, which runs every tool inside a WebAssembly sandbox with explicit capability permissions per tool. A compromised IronClaw tool cannot make HTTP requests to hosts not on its allowlist, cannot read files outside its declared scope, and cannot exfiltrate secrets it was not explicitly given. Think of it this way: Hermes tools are contractors with a master key to the building. IronClaw tools are contractors who can only enter the specific rooms on the work order, and security checks their bag on the way out.

Secrets in plaintext. ~/.hermes/.env stores your API keys as a plaintext file. Mitigate this:

chmod 600 /root/.hermes/.env

For stronger protection, use systemd credentials to inject secrets at service start:

echo -n "sk-ant-..." | systemd-creds encrypt --name=anthropic-key -

[Service]
LoadCredential=anthropic-key:/path/to/encrypted.cred
ExecStart=/bin/sh -c 'ANTHROPIC_API_KEY=$(cat $CREDENTIALS_DIRECTORY/anthropic-key) hermes ...'

Tirith scanner degraded. On our install, Tirith flagged as unavailable. Check with hermes doctor.

Python attack surface. Hermes has a large dependency tree. Pin versions and update deliberately.

Practical Mitigations

Keep the agent in a dedicated LXC with no access to other containers
Lock down ~/.hermes/.env with chmod 600
Disable toolsets you are not using in config.yaml
Set approvals.mode: manual so destructive actions require confirmation
Do not expose the Hermes API port externally without authentication
Use Twingate or similar for remote access rather than opening ports

The Architecture That Actually Works

Part of this was a deliberate experiment: we wanted to know what $20 in API credits actually drives like in day-to-day use.

Given the hardware reality, running Haiku as the primary model and Ollama as fallback is the right call. Response times drop from 2 minutes to seconds. Output quality improves significantly. And the cost is lower than you’d expect.

Haiku 4.5 costs $0.80/MTok input and $4.00/MTok output. With Hermes’s system prompt overhead, a typical interactive session costs roughly $0.01. A $20 API credit load covers around 2,000 sessions — months of normal use.

Ollama still earns its place: compression, session search, and title generation all run locally for free. These are short background tasks where the iGPU overhead doesn’t matter.

Task	Model	Cost
Interactive chat, blog drafts	Haiku API (primary)	~$0.01/session
Compression, session search, title gen	Ollama qwen2.5:7b-128k	Free
Haiku unavailable	Ollama fallback	Free

Config in ~/.hermes/config.yaml:

model:
  default: claude-haiku-4-5-20251001
  provider: anthropic

fallback_model:
  provider: custom
  model: qwen2.5:7b-128k
  base_url: http://localhost:11434/v1
  api_mode: chat_completions
  context_length: 131072

auxiliary:
  compression:
    provider: custom
    model: qwen2.5:7b-128k
    base_url: http://localhost:11434/v1
    context_length: 131072
  session_search:
    provider: custom
    model: qwen2.5:7b-128k
    base_url: http://localhost:11434/v1
    context_length: 131072
  title_generation:
    provider: custom
    model: qwen2.5:7b-128k
    base_url: http://localhost:11434/v1
    context_length: 131072

Want Faster Local Inference?

If you want to keep inference local but remove the iGPU bottleneck, options ranked by effort:

Discrete GPU over LAN. If you already have an NVIDIA card in another machine on your LAN, install Ollama on it, expose it, and point Hermes at that IP. An RTX 4080 Super runs qwen2.5:7b at roughly 20x the speed of the 680M. No code changes beyond the base_url.

Groq API (free, fastest). Groq’s free tier gives you Llama 3.3 70B at 300+ tok/s with 14,400 requests per day. Better model than qwen2.5:7b, faster than any iGPU, zero hardware changes.

Cerebras. Similar free tier, ~2,000 tok/s on Llama models.

Secondhand RTX 3090. ~$350 used. 24GB VRAM, dedicated to the EQR6 via PCIe passthrough. Eliminates the LAN dependency.

Option	Approx Speed	Cost
Groq API	~300 tok/s	Free tier
RTX 4080 Super over LAN	~80-100 tok/s	$0 (existing hardware)
RTX 3090 (dedicated)	~60 tok/s	~$350
Radeon 680M iGPU	Slow under agent load	$0

What’s Next

Telegram gateway - control Hermes from your phone
Skills system - teach Hermes your specific workflows
MCP exposure - wire Hermes as an MCP server so other tools can delegate to it
Discrete GPU node - remove the iGPU bottleneck entirely

Key Takeaways

iGPU ROCm passthrough in Proxmox LXC works. The gfx1035 override is required for the Radeon 680M.
Ollama on the 680M runs at 100% GPU. It works. It’s not the bottleneck.
Hermes’s ~6-10K token system prompt prefill on every turn is what kills interactive performance on modest hardware.
Haiku as primary + Ollama as fallback solves the speed problem for ~$0.01 per session.
$20 in API credits = ~2,000 sessions. Months of normal use.
Groq’s free tier beats every iGPU option if you want fast free inference without hardware cost.
Hermes’s security model is fine for personal homelab use. For sensitive workloads, evaluate IronClaw’s WASM sandbox approach.