Pilot vs. sim: where the 30% throughput surprise comes from

There is a number that comes up in conversations with people running their first AMR pilot: roughly 30 percent. It is the size of the gap, in either direction, between what their warehouse-sim said throughput would be and what their pilot actually delivered. Sometimes the pilot is 30 percent worse. Less often but it happens, the pilot is 30 percent better. Either gap is uncomfortable. We have spent enough time looking at where that gap comes from that we can describe it without listing twenty possibilities. There are five that account for most of it.

This post walks through those five. WareMax was designed against the things this list reveals; we will note where in each section.

1. Service-time distribution, not service-time mean

The single most common cause of sim-vs-pilot divergence is using a mean service time at the pick station instead of a distribution. A vendor demo of a pod-to-person station is doing one item per six seconds, and the spec is “10 seconds per pick.” So the sim uses 10. The pilot reports 14. People are confused.

Real pick service times are lognormal, with the long tail of “operator paused, item didn’t scan, replenishment was needed, the SKU was in a sub-bin.” The lognormal mean is higher than the median by an amount that depends on the variance, and the SLA-relevant tail is much heavier than an exponential approximation. WareMax models station service time as lognormal by default (distribution: lognormal, base: 12.0, per_item: 3.0 — the base is the per-pod overhead, the per_item is the per-line increment, both before the lognormal noise is applied). Calibrating these two parameters against pilot data closes a noticeable chunk of the gap on day one.

The asymmetric version of this is also worth knowing. If your sim used a too-wide distribution — say, an exponential where reality is a tight lognormal — the pilot will look better than the sim, because the simulated tail was pessimistic. Either direction, the fix is the same: measure the distribution, fit it, configure it.

2. Order shape, not order rate

The second-most-common cause: matching the order rate (orders per minute) but not the order shape. Orders are bursty (Poisson is reasonable but you should check that hour-of-day variation isn’t being averaged out), lines per order are overdispersed (negative binomial, not Poisson), and SKU popularity is Zipfian (not uniform). The last one matters most for replica contention: if 70 percent of picks hit the top 10 SKUs, and those SKUs only have one replica each, you have created a destination-contention bottleneck that does not show up under a uniform-popularity assumption.

WareMax models all three:

orders:
  arrival_process: { type: poisson, rate_per_min: 1.0 }
  lines_per_order: { type: negative_binomial, mean: 2.0 }
  sku_popularity:  { type: zipf, alpha: 1.1 }

When a pilot underperforms by 30 percent and the sim used uniform popularity, two-thirds of the gap is usually here. The fix is to fit alpha from your real SKU-velocity data. (As a rule of thumb: a sharply skewed catalog has alpha between 1.2 and 1.5; a flat catalog is closer to 1.0.)

3. The thing the sim said you should not be congestion-bound on, you are

Here is a failure mode that is genuinely subtle. A traffic-naive sim — one that does not model node and edge capacities, or models them but uses unrealistically high defaults — will tell you that the throughput limit is service. The pilot, with real aisle widths and real traffic, will tell you that the throughput limit is congestion. The gap looks like a dispatching problem when it is actually a topology problem.

WareMax models congestion via the wait_at_node traffic policy with per-node and per-edge capacities (node_capacity_default: 4, edge_capacity_default: 4 are the defaults; these should be tuned). When the congestion_weight is set above zero, routing becomes occupancy-weighted: robots are nudged away from currently-occupied corridors. This is the right modeling distinction: traffic in DES is about blocking (the queueing semantics of a node), not about avoidance (the routing weight). Treating them as the same thing is what leads to sims that overpromise.

A diagnostic: in WareMax, the per-task delay decomposition has a congestion bucket. If your pilot is underperforming and this bucket is small in your sim runs, your sim does not believe traffic is the bottleneck — but the pilot is telling you it is. Calibrating capacities is the next step.

4. The dispatching policy you tested in sim is not the one running in the pilot

This sounds tautological but it is not. WMSes ship with dispatching logic that is sometimes documented, often partially configurable, and rarely the exact match for the heuristic you tested in sim. Your sim ran nearest_robot; your pilot’s WMS runs something the integrator calls “balanced assignment,” which is some workload-balanced variant with a tunable parameter you do not control. The throughput is different, and now you are debugging an integration problem in production.

The fix is on the simulator side: ship the obvious heuristics in the same box so you can A/B test them against each other and against whatever the WMS does. WareMax ships five (nearest_robot, least_busy, round_robin, auction, workload-balanced) and an rl_agent slot. The compare CLI subcommand runs multiple policies on identical seeds and reports the difference with confidence intervals:

waremax compare scenario.yaml \
    --param policies.task_allocation=nearest_robot \
    --param policies.task_allocation=least_busy \
    --param policies.task_allocation=round_robin

If your WMS exposes telemetry that lets you reconstruct its dispatching choices, you can also implement a TaskAllocationPolicy matching the WMS’s logic (one-method Rust trait, one-arm change in policy_factory.rs) and run that in sim. The integrator may not enjoy the conversation that follows, but at least it is grounded.

5. The sim was not actually reproducible, so your “validation” was noise

The honest one. If a previous run of your sim was different from the current run because the underlying RNG, hash-map iteration order, or tie-breaking is non-canonical, then any comparison you draw against the pilot is also noisy. You cannot tell whether the gap is a calibration problem or a determinism problem, because you cannot rerun the sim exactly.

This is what WareMax’s determinism guarantee is for. ChaCha8 RNG seeded from a u64. Id-based tie-breaking throughout. A strict ping-pong handshake between the agent and the sim in the RL loop. Tests in waremax-rl/tests/determinism.rs that fail on any drift. As part of getting there, the project fixed several HashMap-iteration-dependent bugs that had been silently making “seeded” results irreproducible. It is the only way to take a pilot-vs-sim gap seriously: the sim has to be the same sim, every time, before the gap can be attributed to anything else.

What the workflow looks like with WareMax

When a pilot underperforms the sim and the team wants to close the gap, the WareMax workflow is roughly:

Fit service-time distribution from pilot pick-cycle logs. Update service_time_s in scenario YAML.
Fit order shape from pilot order data. Update arrival_process, lines_per_order, sku_popularity. The Zipf alpha is usually the most impactful single parameter.
Calibrate node and edge capacities from observed traffic patterns. Look at the congestion bucket in per-task attribution.
Match the dispatching policy — either pick the heuristic that most closely matches the WMS, or implement the WMS’s policy explicitly.
Re-run the full multi-seed grid with waremax-testing’s A/B machinery and Welch’s t. Confirm the sim-vs-pilot gap is now within the confidence interval.

You will not always close the gap to zero. Sometimes the gap is structural — the pilot warehouse has a constraint the sim cannot represent (a charging-station layout limit, an operator-skill distribution, a SKU receiving cadence). When that happens, WareMax will at least let you bound how much of the gap is not explained by the modeled effects, which tells you where to look next.

That — the ability to bound the unexplained gap — is the practical payoff of the rest of this post. It is what high-fidelity, reproducible, narrowly-scoped simulation buys you. Not certainty about the pilot, but a much shorter list of explanations for the difference between sim and pilot, with the most-likely ones already ruled in or out.