Exploit the arm that looks best, or explore one that might be better — the choice every learner faces.

A row of slot machines, each with hidden odds. Every pull is both a payout and a data point — the reward tells you a little more about the arm you just tried, and nothing at all about the ones you didn’t. Bet only on the current favourite and you may never discover a better machine; explore too much and you squander pulls on the losers you’ve already ruled out. The art is the balance, and good rules — ones with the right kind of curiosity — make the cost of learning it vanish over time.

exploring — beliefs still widepolicy ε-greedypulls 0regret 0.0on best 0%

switch Policyand watch the regret curve bend — Thompson & UCB flatten toward the silver ln t while Greedy and Random stay straight · raise Kto make the search harder · toggle Reveal to see the true rates

The maths Thompson 1933 · Robbins 1952 · Auer 2002

Regretcost of learning

R_T = T·μ* − 𝔼[ Σ_t μ_a(t) ]

What you lose by not knowing the best arm from the start — the gap between always playing the true best (mean μ*) and what your policy actually earned. A good policy makes this grow like ln T, not T.

Explore vs. exploitε-greedy

a = argmax_aQ(s,a) w.p. 1−ε, else random

Most of the time take the best action you know; a fraction ε of the time gamble on a random one — because the only way to find a better path is to risk a worse one. Too little ε and the agent locks onto the first route it finds; too much and it never commits.

Upper confidence boundoptimism

a_t = argmax_a [ Q_a + c√(ln tn_a) ]

Optimism in the face of uncertainty: judge each arm by the top of its confidence interval, not its mean. A rarely-pulled arm gets a big bonus, so it is tried; a well-known poor arm is dropped. No randomness, and provably logarithmic regret.

Thompson samplingposterior sampling

θ_a ∼ Beta(α_a, β_a); play argmax_a θ_a

Keep a Bayesian belief (a Beta posterior) about each arm’s win rate, draw one sample from each, and play the arm whose sample is highest. Uncertainty makes wide posteriors likely to win a draw — exploration for free, from probability matching alone.

Every pull trades exploitation (play the best-looking arm) against exploration (learn whether another arm is secretly better) — the same trade-off The Maze(INST·42) faces at every state of a full reinforcement-learning problem, of which the bandit is the degenerate one-state case. The Anneal(INST·34) solves a structurally similar trade-off with a cooling schedule instead of a belief update — explore freely while hot, exploit as it cools. Thompson sampling’s posterior draw is the same Bayesian-belief machinery as The Lens(INST·25) — a Kalman filter narrowing a continuous estimate is a Beta posterior narrowing a discrete choice, wearing different maths for the same idea: represent uncertainty explicitly, and let it shrink exactly where evidence arrives.

Policy

Arms K 5

Exploration ε 0.10

01 The title card

booting INST·76…

The Rack Next The Secretary

Exploit the arm that looks best, or explore one that might be better — the choice every learner faces.

Exploit the arm that looks best, or explore one that might be better — the choice every learner faces.

TheIceJi

Exploit the arm that looks best, or explore one that might be better — the choice every learner faces.