Unraveling the Mystery of the Proxmox OOM Killer: Lessons Learned from the Lab
Earlier this year I wrote about the day Hercules silently died twice before 4pm. The short version: the Linux OOM killer on the Proxmox host decided the 36GB QEMU process for my main VM was the most expendable thing on the machine, and acted accordingly. Twice.
That post covered the incident. This one covers what I actually changed, and what Iâd do differently now.
What we changed immediately
The obvious fix was reducing the RAM allocated to Hercules. It had been sitting at 60GB â which is generous for a VM running about 25 containers, most of which are small web services. On the day of the incident I dropped it to 20GB. The VM runs fine. Container memory pressure is visible in Grafana but nothing has been killed.
The less obvious thing: I should have done this audit months earlier. I had just never looked at the Proxmox hostâs total allocated vs physical RAM. When I finally did, the math was embarrassing. Total allocated across all VMs was well over physical capacity. The host was relying entirely on swap and luck.
What the logs actually showed
The forensics were straightforward once I knew where to look. The Proxmox hostâs /var/log/syslog had clean evidence:
kernel: oom_kill_process: Kill process [pid] (qemu-system-x86) score [N] or sacrifice child
kernel: Killed process [pid] (qemu-system-x86) total-vm:60000000kB, anon-rss:36000000kBIt happened at 05:13 and again at 15:38. The first kill I slept through. The second one I noticed when my phone started pinging about services going down.
The VM itself had no warning. From inside Hercules there was nothing in the logs â the process just stopped. Thatâs the nature of a host-level OOM kill: the guest doesnât know itâs happening.
What Iâd do differently
Memory ballooning. Proxmox supports KVM memory ballooning, which lets the hypervisor reclaim idle RAM from guests dynamically. A VM allocated 20GB doesnât have to hold 20GB all the time â it can release pages back to the host when theyâre not in use. I havenât enabled this yet, but itâs the right long-term answer. Static allocation is simple to reason about, but it doesnât reflect how memory is actually used.
OOM alerting. I now have node-exporter running on both purrbox and hera, with Grafana watching metrics. I donât yet have an explicit OOM alert on the Proxmox host itself â which is a gap. A simple alert on node_vmstat_oom_kill incrementing would have told me about the 05:13 kill before I woke up.
Overcommit audit as a routine.** When you add a VM, itâs easy to be generous with the RAM slider. âIâll give it 32GB, it probably wonât use that much.â Multiply that reasoning across five VMs and youâve quietly built a system that only works until it doesnât. A 10-minute audit of total allocated vs physical available, done every time you add a VM, prevents this class of problem entirely.
The recovery story
When Hercules goes down, the recovery path is: log into the Proxmox web UI, find VM 102, hit start. Thatâs it. All the containers come back up automatically because theyâre set to restart unless stopped. The whole process takes about three minutes including Docker pulling itself together.
The inconvenient part is that it requires manual intervention. Thereâs no auto-restart for a VM that was killed by the hypervisorâs OOM killer. Proxmox doesnât know the VM was killed unexpectedly vs shut down intentionally. Iâve thought about scripting a watchdog but havenât gotten there yet. Adding it to the list.
The actual lesson
Donât overallocate RAM on a Proxmox host and then forget about it. The OOM killer doesnât care how many services are depending on the process it chooses. It just picks the biggest thing and kills it.
The fix was four keystrokes in the Proxmox UI. The root cause was months of not checking. đž