← Back to devlog

June 30, 2026 · Sebastian

Holding a 100 ms SLA while talking to physical chargers

#nestjs#postgres#architecture#distributed-systems

A bank of EV chargers on one circuit shares a single amperage budget. A parking level wired to a 200 A breaker can't let every stall pull full power at once, so operators write rules — cap this charger, reserve that one, pause a partner's profile, let safety always win — and the control plane has to turn those rules into per-charger current limits and push them to the hardware, fast.

"Fast" was the whole project. The target: a rule write reaches the chargers in under 100 ms. Here's where that goal pushed back.

You can't own latency you don't control

The first real decision was the most important one, and it wasn't code. A rule change eventually becomes an OCPP message to a physical charger over a flaky field network. That last hop is wildly variable and entirely outside our reach. If the SLA included it, we'd be promising a number we could never hold.

So we drew the line at dispatch, not acknowledgement:

The 100 ms SLA is measured from rule write to the moment the downstream request is handed to the HTTP client — not when the charger acks. Everything past dispatch is explicitly out of scope.

That single boundary decision shaped the entire architecture. Once "done" means "request is on the wire," you're free to make everything upstream of it deliberately decoupled from the hardware.

Split the math from the hardware

The work divides cleanly into two jobs with opposite personalities:

  • Resolving is pure computation. Given the current rules and which chargers exist, what amperage should each charger get? Read-heavy, safe to recompute as often as you like, never touches the network.
  • Reconciling is all I/O. Take those target numbers and actually call the gateway. Write-heavy, slow, fallible.

We kept them apart. The resolver is a pure function — inputs in, a Map out, the clock passed as an argument so tests can pin time. No database, no HTTP, nothing to mock. That made the trickiest business logic (priority bands, even-split redistribution, time-windowed rules) the easy part to test.

The reconciler borrows the Kubernetes controller pattern: one tiny async worker per charger, each watching its own desired-vs-applied diff. A bad charger stalls its own worker and nobody else's.

Collapsing signals is a feature, not a bug

Per-charger workers gave us something we didn't fully appreciate until we watched it run. Each reconciler holds a single signaled flag and a promise it awaits. Fire three rule changes at one charger inside 50 ms and the worker still makes one downstream call — to the latest value.

That matches hardware reality. There's no point telling a charger to go to 32 A, then 10 A, then 16 A in a tight burst; it should just go to 16 A. The collapsing falls out of the design for free instead of needing a debounce layer bolted on.

When your worker model mirrors the physical constraint, the optimization stops being something you implement and starts being something you can't avoid.

The timers we wanted to write, and didn't

Plenty of rules are time-bound: a scheduled cap that runs 18:00–22:00, a rule that expires at midnight. The obvious move is setTimeout.

It's also a trap. Process restarts vaporize in-memory timers, and a control plane that silently forgets to lift a cap after a deploy is worse than useless. So the wall-clock substrate is Postgres, not the event loop: every schedule boundary is a durable row, a sweeper polls for due rows in a transaction with FOR UPDATE SKIP LOCKED, emits an event, and deletes them.

The honest tradeoff is a ~1 s polling floor on time-driven reconciles — the price of durability without pulling in a managed workflow engine. Rule-driven changes (the ones the SLA covers) don't pay it; they're event-driven and land in tens of milliseconds.

A timer you can't recover after a crash isn't a feature, it's a latent incident. We traded a second of latency for never losing one.

Two subtleties Postgres handed us

A couple of things only became obvious once we leaned on the database as the backbone.

NOTIFY is transactional. We issue pg_notify inside the rule-write transaction. Postgres defers delivery until commit, so subscribers physically cannot observe a change that later rolls back. That's a correctness guarantee for free — but only if you resist the urge to move the notify "somewhere cleaner" outside the transaction.

A dropped socket is a silent gap. The LISTEN connection can die and reconnect, and any NOTIFY fired during that window is simply gone. Polling would mask it; we didn't want to poll. The fix is a backstop: on reconnect the listener emits a sentinel "reload everything" signal, and the resolver does a full recompute. Rare, cheap, and it closes the one hole event-driven systems love to pretend doesn't exist.

The boring bug that bites everyone

For the record, in case it saves someone an afternoon: TypeORM returns numeric columns as JavaScript strings, not numbers.

// amperes comes back as "32.00", a string
const total = chargers.reduce((sum, c) => sum + c.maxAmperes, 0);
// => "032.0016.00..." — string concatenation, not addition

No type error, no crash — just a budget calculation that's quietly, catastrophically wrong. We convert at the boundary, every time, on purpose.

What it actually does

End to end, against a freshly seeded database, the measured dispatch latency from rule write to gateway call landed at 19–33 ms across every operation we tested — add a cap, pause, resume, exclude a charger, patch a value. A 20-charger fleet converges from cold start in 21 ms wall-clock, first request to last. The remaining ~70 ms of headroom is deliberate: it absorbs Postgres hiccups, notify variance, and heavy resolver passes without ever threatening the SLA.

The lesson we keep relearning: decide where your responsibility ends before you write a line of code. Drawing the SLA at dispatch, and treating durable Postgres state as the source of truth instead of process memory, made everything after it simpler than it had any right to be.