solving the belief Markov decision process…
backward induction over the subsidy grid
Interactive demo · belief MDP · subsidised RCTs

Optimizing Social Utility in Sequential Experiments

A regulator (the principal) subsidises a fraction ε of a developer's (the agent's) trial costs. The agent runs randomised trials sequentially, updating a Beta(α,β) belief, and stops once the principal's e-value test rejects H₀. Drag the subsidy and watch the agent's optimal policy — and the resulting social utility — respond.

Subsidy level  ε — fraction of the agent's total cost the principal covers upon approval
0.000(0%)
First trial size n₀
Approval prob.
Opt-out prob.
Agent utility
Social utility
Optimal ε★
recomputing…

① Optimal policy in belief space

For each belief (α,β), the agent's optimal trial size n. Clay = opt out (abandon); the sage frontier = enough evidence to approve.

opt out (n=0) H₀ rejected n: 1 nmax
recomputing…

② Optimal value function Vε(α,β)

The agent's expected profit ($M) from each belief under the optimal policy. Brighter = more valuable. Raising ε lifts the whole surface.

H₀ rejected $0 max

③ Social utility vs. subsidy

The principal's objective ŪS(ε) (solid) and the agent's approval probability (dashed). The vertical mark is your current ε; ★ is the optimum.

social utility ($M) approval probability

④ Agent utility vs. subsidy

The agent's anticipated utility ŪAε;ε) — piecewise-linear & convex (Prop. 8). Each kink is a switch in the agent's optimal policy.

agent utility ($M) �dotted = policy-switch points (partition 𝒫)
Model parameters  ·  change & recompute
Larger nmax/T ⇒ slower precompute (the cost grows ~nmax⁴·T).

What you're looking at

The agent holds a Beta(α,β) belief about efficacy θ*. Running a trial of size n draws X ~ Beta-Binomial and updates the belief to (α+X, β+n−X).

The principal multiplies e-values into a test process; approval (rejecting H₀) happens once f(α,β) ≥ 1/κ, a linear frontier in belief space — the sage edge in panels ① & ②.

The agent solves a finite-horizon belief MDP by backward induction (Alg. 2/3), choosing each trial size to maximise expected profit; it opts out when continuing isn't worth the cost.

The subsidy story

At ε = 0 the moonshot is too risky: the agent abandons the trial at the start, so society gets nothing. As ε rises, the de-risked agent begins experimenting — the clay opt-out region in panel ① recedes.

But every subsidy dollar is a transfer the principal pays only on approval, so Ūˢ(ε) trades off more approvals against higher cost. The interior peak ε★ (panel ③) is what Algorithm 1 finds exactly via divide-and-conquer.

Because the agent's utility is piecewise-linear & convex in ε (panel ④, Prop. 8), the social utility is piecewise-linear on each policy interval and maximised at a partition endpoint.