Unraveling the Mystery of the Proxmox OOM Killer: Lessons Learned from the Lab

Share

Earlier this year I wrote about the day Hercules silently died twice before 4pm. The short version: the Linux OOM killer on the Proxmox host decided the 36GB QEMU process for my main VM was the most expendable thing on the machine, and acted accordingly. Twice.

That post covered the incident. This one covers what I actually changed, and what I’d do differently now.

What we changed immediately

The obvious fix was reducing the RAM allocated to Hercules. It had been sitting at 60GB — which is generous for a VM running about 25 containers, most of which are small web services. On the day of the incident I dropped it to 20GB. The VM runs fine. Container memory pressure is visible in Grafana but nothing has been killed.

The less obvious thing: I should have done this audit months earlier. I had just never looked at the Proxmox host’s total allocated vs physical RAM. When I finally did, the math was embarrassing. Total allocated across all VMs was well over physical capacity. The host was relying entirely on swap and luck.

What the logs actually showed

The forensics were straightforward once I knew where to look. The Proxmox host’s /var/log/syslog had clean evidence:

kernel: oom_kill_process: Kill process [pid] (qemu-system-x86) score [N] or sacrifice child
kernel: Killed process [pid] (qemu-system-x86) total-vm:60000000kB, anon-rss:36000000kB

It happened at 05:13 and again at 15:38. The first kill I slept through. The second one I noticed when my phone started pinging about services going down.

The VM itself had no warning. From inside Hercules there was nothing in the logs — the process just stopped. That’s the nature of a host-level OOM kill: the guest doesn’t know it’s happening.

What I’d do differently

Memory ballooning. Proxmox supports KVM memory ballooning, which lets the hypervisor reclaim idle RAM from guests dynamically. A VM allocated 20GB doesn’t have to hold 20GB all the time — it can release pages back to the host when they’re not in use. I haven’t enabled this yet, but it’s the right long-term answer. Static allocation is simple to reason about, but it doesn’t reflect how memory is actually used.

OOM alerting. I now have node-exporter running on both purrbox and hera, with Grafana watching metrics. I don’t yet have an explicit OOM alert on the Proxmox host itself — which is a gap. A simple alert on node_vmstat_oom_kill incrementing would have told me about the 05:13 kill before I woke up.

Overcommit audit as a routine.** When you add a VM, it’s easy to be generous with the RAM slider. “I’ll give it 32GB, it probably won’t use that much.” Multiply that reasoning across five VMs and you’ve quietly built a system that only works until it doesn’t. A 10-minute audit of total allocated vs physical available, done every time you add a VM, prevents this class of problem entirely.

The recovery story

When Hercules goes down, the recovery path is: log into the Proxmox web UI, find VM 102, hit start. That’s it. All the containers come back up automatically because they’re set to restart unless stopped. The whole process takes about three minutes including Docker pulling itself together.

The inconvenient part is that it requires manual intervention. There’s no auto-restart for a VM that was killed by the hypervisor’s OOM killer. Proxmox doesn’t know the VM was killed unexpectedly vs shut down intentionally. I’ve thought about scripting a watchdog but haven’t gotten there yet. Adding it to the list.

The actual lesson

Don’t overallocate RAM on a Proxmox host and then forget about it. The OOM killer doesn’t care how many services are depending on the process it chooses. It just picks the biggest thing and kills it.

The fix was four keystrokes in the Proxmox UI. The root cause was months of not checking. 🐾