Running Hercules on a RAM Diet
Hercules runs on a Proxmox host. For a while it had 36GB of RAM allocated. That sounded fine β until the Proxmox host ran out of physical memory, the OOM killer looked around for the fattest process, and selected the QEMU process for Hercules. Gone. No warning, no graceful shutdown. Just gone.
This happened twice on the same day. 05:13 and 15:38. The second time was particularly rude.
What actually happened
The root cause isn't complicated: the Proxmox host had more RAM allocated to VMs than it had physical RAM to back them. When all the VMs decided to actually use their allocation at the same time, the host panicked and started killing things. Hercules, with its 36GB allocation, looked the most delicious.
The host logs told the whole story:
oom_kill: Kill process [qemu-pid] (qemu-system-x86) score [high] or sacrifice child
Out of memory: Killed process [qemu-pid] (qemu-system-x86)No subtlety. Just murder.
The fix: eat less
The honest answer was to reduce Hercules's RAM allocation. It now runs on 20GB. That's 16 vCPUs and 20GB serving this stack:
- Ghost, Vikunja, Mealie, OwnCloud (with Redis), Emby, Audiobookshelf, Photoprism
- Postgres and MariaDB
- Piler + Manticore + Memcached (the email archiver requires an entourage)
- Outline wiki, Kavita, Soundboard, Homepage
- Alloy, cAdvisor, node-exporter for observability
- qBittorrent, RustDesk relay
That's roughly 25 containers, give or take the ones I forgot to count. All of it on 20GB.
Does it actually fit?
Mostly. The key insight is that most of these services aren't doing anything most of the time. OwnCloud sits idle. Photoprism wakes up when someone asks it to. Emby transcodes occasionally. The databases handle low-volume traffic. When nothing is actively happening, the actual RSS across all containers is well under 20GB.
The problem is "most of the time" β not "all of the time." If Photoprism decides to re-index the library while Emby starts transcoding while someone syncs a large OwnCloud folder, things get uncomfortable. Hasn't caused an OOM kill at the VM level yet, but I'm watching.
What would actually fix it
Three real options, in order of how much they cost:
Memory ballooning β configure QEMU's balloon driver so Proxmox can reclaim idle RAM from VMs dynamically. Hercules might have 20GB allocated but only hold 12GB when it's not busy, letting other VMs breathe. Zero hardware cost, requires guest driver support (which Linux has). I haven't enabled this yet. I should.
Trim the stack β some of these services genuinely compete with each other. Piler + its three dependencies exists because mail archiving felt like a good idea in 2024. It still might be. But if something has to go to buy headroom, it's a candidate.
More physical RAM in the Proxmox host β the honest long-term answer. The host is oversubscribed. Buying more RAM fixes the root cause instead of working around it.
The lesson
RAM overcommit works until it doesn't, and when it doesn't, the OOM killer makes unilateral decisions you won't like. The correct response isn't to be smarter about overcommit β it's to not overcommit. Know what your host actually has, allocate less than that, and leave the Proxmox host enough room to breathe.
Twenty gigabytes for 25 containers is tight. It's working. I'm keeping one eye on it. πΎ