I run a Beelink EQR6 as my homelab server. It’s a compact AMD Ryzen 9 6900HX mini PC with 64GB of DDR5 RAM and a Radeon 680M integrated GPU. No discrete graphics card. Just a small, silent box running Proxmox that handles most of my self-hosted infrastructure.
Recently I decided to push it further. I wanted to run a local AI agent. Not just Ollama in isolation, but a full agentic framework that could draft content, manage tasks, respond to me on Telegram, and generally act like an AI employee I could delegate to.
This is the story of how that went. Spoiler: it worked. But the road had some sharp turns.
What We’re Building
Here’s the stack:
- Proxmox VE 9.1.7 - bare metal hypervisor on the EQR6
- Debian 12 Bookworm LXC - lightweight container (not a full VM)
- AMD ROCm 6.0 - GPU compute framework for the Radeon 680M
- Ollama - local LLM inference server
- qwen2.5:7b - the language model doing the actual thinking
- Hermes Agent - the agentic framework tying it all together
Why Hermes? It’s the fastest growing agent framework on GitHub right now. It has a compounding skills system — the more you use it, the smarter it gets at your specific workflows. Built-in Telegram/messaging gateway, MCP support, persistent memory, and a self-improvement loop. NetworkChuck covered it recently and it’s worth the attention.
Why a container instead of a VM? LXC containers share the host kernel. That means less overhead than a full VM, which matters when your “GPU” is actually integrated into the CPU and shares system memory. Every megabyte counts.
Step 1: Setting Up the Proxmox LXC
Create a Debian 12 Bookworm LXC with these specs:
- Cores: 8
- RAM: 32GB
- Disk: 40GB
- Nesting: enabled (required for systemd inside the container)
pct create 400 local:vztmpl/debian-12-standard_12.7-1_amd64.tar.zst \
--hostname hermes-osms \
--cores 8 \
--memory 32768 \
--rootfs local-lvm:40 \
--features nesting=1
Why 32GB? The Radeon 680M is an integrated GPU. It carves its VRAM directly out of system memory. Running a 7B model means roughly 8GB goes to VRAM, leaving 24GB for the OS, Hermes, and everything else.
Step 2: GPU Passthrough (The Fun Part)
This is what separates GPU-accelerated inference from CPU crawl. We need the container to see the GPU.
The GPU lives on the host machine. By default, containers are sandboxed and cannot touch it. We need to poke specific holes in that sandbox so the container can talk to the GPU driver.
First, identify the GPU device nodes on the Proxmox host:
stat -c "%t %T" /dev/dri/card0 # 226:0
stat -c "%t %T" /dev/dri/renderD128 # 226:128
stat -c "%t %T" /dev/kfd # 234:0
Then add these lines to /etc/pve/lxc/400.conf on the host:
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 234:0 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file
cgroup2 is the kernel’s resource control layer. We’re telling it to allow the container to access those specific device numbers with read/write/mknod permissions, then bind-mounting the actual device files into the container.
After restarting the container, verify inside:
ls /dev/dri # Should show card0 and renderD128
ls /dev/kfd # Should exist
Step 3: ROCm 6.0 Inside the Container
ROCm is AMD’s compute framework — the equivalent of NVIDIA’s CUDA. Without it, Ollama falls back to CPU inference.
Important: Install userspace libraries only. Do NOT install DKMS kernel modules. The container shares the host kernel and this will cause conflicts.
apt-get install -y rocm-hip-libraries rocm-opencl-runtime --no-install-recommends
usermod -aG render,video root
The systemd-logind Fix
Proxmox LXC drops certain kernel namespaces for isolation. Modern systemd does not like this and crashes into a loop (status=226/NAMESPACE), causing severe SSH latency.
Fix it permanently:
mkdir -p /etc/systemd/system/systemd-logind.service.d/
cat > /etc/systemd/system/systemd-logind.service.d/override.conf << 'EOF'
[Service]
PrivateUsers=no
RestrictNamespaces=no
EOF
systemctl daemon-reload && systemctl restart systemd-logind
Verify ROCm sees the GPU:
rocminfo | grep -A5 "Agent 2"
You should see gfx1030 or gfx1035 (the Radeon 680M identifier).
Step 4: Installing Ollama
curl -fsSL https://ollama.com/install.sh | sh
The installer detects ROCm and configures GPU acceleration automatically. But here’s where we hit the first real gotcha.
The gfx1035 Problem
When you check the Ollama logs, you might see:
error="runner crashed" detail="error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch: gfx1035"
The Radeon 680M identifies itself as gfx1035 to the driver. Ollama’s ROCm library has precompiled GPU kernels for gfx1030 (RDNA2) but not specifically for gfx1035. It crashes looking for code that does not exist.
The fix: tell ROCm to pretend it’s a gfx1030. They’re close enough architecturally that this works fine.
mkdir -p /etc/systemd/system/ollama.service.d/
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
EOF
systemctl daemon-reload && systemctl restart ollama
Pull and test the model:
ollama pull qwen2.5:7b
ollama run qwen2.5:7b "respond with one word: hello"
ollama ps
You want to see 100% GPU in the PROCESSOR column. That confirms ROCm passthrough is working.
Step 5: Installing Hermes Agent
apt-get install -y python3 python3-pip pipx
pipx install hermes-agent
pipx ensurepath
Configure the provider. Hermes does not have a native local Ollama option in its wizard. Use Custom endpoint:
- URL:
http://localhost:11434/v1 - API mode: Chat Completions
- Model:
qwen2.5:7b - API key: none required
The Context Window Gotcha
Hermes requires a minimum 64K context window. Ollama defaults qwen2.5:7b to 32K. You will see:
Failed to initialize agent: Model qwen2.5:7b has a context window of 32,768 tokens,
which is below the minimum 64,000 required by Hermes Agent.
Create a custom Ollama model with extended context:
cat > /tmp/Modelfile << 'EOF'
FROM qwen2.5:7b
PARAMETER num_ctx 131072
EOF
ollama create qwen2.5:7b-128k -f /tmp/Modelfile
Step 6: Performance Reality Check
Here’s the honest truth.
Ollama runs on the 680M at 100% GPU. ROCm passthrough works. The GPU is doing the inference.
The problem is Hermes. Every single message — before generating one token of response — Hermes prefills roughly 6-10K tokens of system prompt: tool definitions, memory, personality, instructions. On an integrated GPU sharing system RAM, that prefill is slow. We saw response times of around 2 minutes for a simple question.
This is not an Ollama problem. Isolated Ollama inference on the 680M is fine. The issue is an agentic framework loading a massive context on every turn hitting a memory-constrained iGPU. Those are two things that don’t pair well.
For async use — fire a task, walk away, come back to the result — this is workable. For interactive conversation, 2-minute round trips are not.
The fix is in the next section.
Step 7: Security Considerations
Running an AI agent means giving software the ability to read files, execute commands, browse the web, and send messages on your behalf. That deserves a clear-eyed look at the attack surface.
What Hermes Does Well
Hermes ships with a security scanner called Tirith that pattern-matches commands before execution. Secrets redaction is enabled by default. For a self-hosted single-user setup, this is a reasonable baseline.
Where the Gaps Are
No tool sandbox. Hermes runs tools as regular Python processes with full process permissions. If a tool is compromised or a prompt injection tricks the agent into running malicious code, it has access to everything the process can reach.
Compare this to IronClaw, which runs every tool inside a WebAssembly sandbox with explicit capability permissions per tool. A compromised IronClaw tool cannot make HTTP requests to hosts not on its allowlist, cannot read files outside its declared scope, and cannot exfiltrate secrets it was not explicitly given. Think of it this way: Hermes tools are contractors with a master key to the building. IronClaw tools are contractors who can only enter the specific rooms on the work order, and security checks their bag on the way out.
Secrets in plaintext. ~/.hermes/.env stores your API keys as a plaintext file. Mitigate this:
chmod 600 /root/.hermes/.env
For stronger protection, use systemd credentials to inject secrets at service start:
echo -n "sk-ant-..." | systemd-creds encrypt --name=anthropic-key -
[Service]
LoadCredential=anthropic-key:/path/to/encrypted.cred
ExecStart=/bin/sh -c 'ANTHROPIC_API_KEY=$(cat $CREDENTIALS_DIRECTORY/anthropic-key) hermes ...'
Tirith scanner degraded. On our install, Tirith flagged as unavailable. Check with hermes doctor.
Python attack surface. Hermes has a large dependency tree. Pin versions and update deliberately.
Practical Mitigations
- Keep the agent in a dedicated LXC with no access to other containers
- Lock down
~/.hermes/.envwithchmod 600 - Disable toolsets you are not using in config.yaml
- Set
approvals.mode: manualso destructive actions require confirmation - Do not expose the Hermes API port externally without authentication
- Use Twingate or similar for remote access rather than opening ports
The Architecture That Actually Works
Part of this was a deliberate experiment: we wanted to know what $20 in API credits actually drives like in day-to-day use.
Given the hardware reality, running Haiku as the primary model and Ollama as fallback is the right call. Response times drop from 2 minutes to seconds. Output quality improves significantly. And the cost is lower than you’d expect.
Haiku 4.5 costs $0.80/MTok input and $4.00/MTok output. With Hermes’s system prompt overhead, a typical interactive session costs roughly $0.01. A $20 API credit load covers around 2,000 sessions — months of normal use.
Ollama still earns its place: compression, session search, and title generation all run locally for free. These are short background tasks where the iGPU overhead doesn’t matter.
| Task | Model | Cost |
|---|---|---|
| Interactive chat, blog drafts | Haiku API (primary) | ~$0.01/session |
| Compression, session search, title gen | Ollama qwen2.5:7b-128k | Free |
| Haiku unavailable | Ollama fallback | Free |
Config in ~/.hermes/config.yaml:
model:
default: claude-haiku-4-5-20251001
provider: anthropic
fallback_model:
provider: custom
model: qwen2.5:7b-128k
base_url: http://localhost:11434/v1
api_mode: chat_completions
context_length: 131072
auxiliary:
compression:
provider: custom
model: qwen2.5:7b-128k
base_url: http://localhost:11434/v1
context_length: 131072
session_search:
provider: custom
model: qwen2.5:7b-128k
base_url: http://localhost:11434/v1
context_length: 131072
title_generation:
provider: custom
model: qwen2.5:7b-128k
base_url: http://localhost:11434/v1
context_length: 131072
Want Faster Local Inference?
If you want to keep inference local but remove the iGPU bottleneck, options ranked by effort:
Discrete GPU over LAN. If you already have an NVIDIA card in another machine on your LAN, install Ollama on it, expose it, and point Hermes at that IP. An RTX 4080 Super runs qwen2.5:7b at roughly 20x the speed of the 680M. No code changes beyond the base_url.
Groq API (free, fastest). Groq’s free tier gives you Llama 3.3 70B at 300+ tok/s with 14,400 requests per day. Better model than qwen2.5:7b, faster than any iGPU, zero hardware changes.
Cerebras. Similar free tier, ~2,000 tok/s on Llama models.
Secondhand RTX 3090. ~$350 used. 24GB VRAM, dedicated to the EQR6 via PCIe passthrough. Eliminates the LAN dependency.
| Option | Approx Speed | Cost |
|---|---|---|
| Groq API | ~300 tok/s | Free tier |
| RTX 4080 Super over LAN | ~80-100 tok/s | $0 (existing hardware) |
| RTX 3090 (dedicated) | ~60 tok/s | ~$350 |
| Radeon 680M iGPU | Slow under agent load | $0 |
What’s Next
- Telegram gateway - control Hermes from your phone
- Skills system - teach Hermes your specific workflows
- MCP exposure - wire Hermes as an MCP server so other tools can delegate to it
- Discrete GPU node - remove the iGPU bottleneck entirely
Key Takeaways
- iGPU ROCm passthrough in Proxmox LXC works. The gfx1035 override is required for the Radeon 680M.
- Ollama on the 680M runs at 100% GPU. It works. It’s not the bottleneck.
- Hermes’s ~6-10K token system prompt prefill on every turn is what kills interactive performance on modest hardware.
- Haiku as primary + Ollama as fallback solves the speed problem for ~$0.01 per session.
- $20 in API credits = ~2,000 sessions. Months of normal use.
- Groq’s free tier beats every iGPU option if you want fast free inference without hardware cost.
- Hermes’s security model is fine for personal homelab use. For sensitive workloads, evaluate IronClaw’s WASM sandbox approach.