The Tale of the Troublesome Traefik: When Dynamic Configs Go Rogue

Share

Traefik has one feature that makes it genuinely pleasant to work with: hot-reload. Edit a file in the dynamic/ directory, save it, and within seconds the new routes are live. No restart. No downtime. It’s the kind of thing that makes you feel like you have a proper setup.

Then you fat-finger a service name at 11pm and spend 20 minutes wondering why half your services are returning 502s.

How the setup works

My Traefik instance runs on hera, which handles all external routing for *.trexug.com. The static config (static.yml) barely changes — it defines entry points, certificates, and the path to watch for dynamic config. The dynamic config (dynamic/hercules.yml) is where all the routers and services live, and it’s what I actually edit when adding a new service.

The workflow looks like this:

  1. SSH into hera
  2. Edit /software/hera/traefik/dynamic/hercules.yml
  3. Traefik reloads within a few seconds
  4. Commit and push when it works

Step 4 is important. I’ve occasionally skipped it.

The rogue config problem

Here’s where it gets interesting. Traefik validates YAML syntax on load, but it won’t catch logical errors. A mismatched router-to-service reference, a wrong port number, a service pointing at a backend that doesn’t exist — these are all syntactically valid. Traefik accepts them, applies them, and then silently fails to route traffic.

The symptom is always the same: a domain that was working suddenly returns a 502 or 404. The Traefik dashboard (if you have it enabled) shows the router as green because the config loaded successfully. The problem is downstream.

I’ve hit this pattern a few times:

  • Adding a new service and copying a block from another, then forgetting to update the loadBalancer.servers.url
  • Referencing a middleware by the wrong name (typo in the middleware definition vs the router)
  • Putting a route for a new service under the wrong entry point

None of these produce an error. They just quietly break.

The actual fix

The most useful thing I’ve done is check Traefik’s logs immediately after any config change:

docker logs traefik --since 30s 2>&1 | grep -i 'error\|warn\|invalid'

If something is wrong with a service definition, it usually shows up there within seconds of the reload. Not always — but often enough that it’s worth the 10 seconds.

The other thing that helps: keeping the dynamic config structured consistently. Every service block in my hercules.yml follows the same pattern. Routers at the top, services below, middlewares at the bottom. When something breaks, I can scan the file quickly because I know exactly where to look.

Hot-reload is still good, actually

I’m not arguing against dynamic config. The ability to add a new route without touching the Traefik process is genuinely useful, especially when you’re adding a service at midnight and don’t want to risk a restart affecting other things that are working.

The lesson is just: fast feedback loops (hot-reload) need fast verification loops. Edit, save, check logs, test the URL. That’s the whole cycle. Don’t skip the last two steps because the first two felt easy.

Commit when it works. Push before you close the terminal. I’ve lost config changes to this exact mistake and had to reconstruct a route from memory at an inconvenient hour.

The configs are fine now. Probably. I’ll find out next time I add something. 🐾