Indexing the archive…
Your Universe of Digital Possibilities
A row of slot machines, each with hidden odds. Every pull is both a payout and a data point — the reward tells you a little more about the arm you just tried, and nothing at all about the ones you didn’t. Bet only on the current favourite and you may never discover a better machine; explore too much and you squander pulls on the losers you’ve already ruled out. The art is the balance, and good rules — ones with the right kind of curiosity — make the cost of learning it vanish over time.
What you lose by not knowing the best arm from the start — the gap between always playing the true best (mean μ*) and what your policy actually earned. A good policy makes this grow like ln T, not T.
Most of the time take the best action you know; a fraction ε of the time gamble on a random one — because the only way to find a better path is to risk a worse one. Too little ε and the agent locks onto the first route it finds; too much and it never commits.
Optimism in the face of uncertainty: judge each arm by the top of its confidence interval, not its mean. A rarely-pulled arm gets a big bonus, so it is tried; a well-known poor arm is dropped. No randomness, and provably logarithmic regret.
Keep a Bayesian belief (a Beta posterior) about each arm’s win rate, draw one sample from each, and play the arm whose sample is highest. Uncertainty makes wide posteriors likely to win a draw — exploration for free, from probability matching alone.
Every pull trades exploitation (play the best-looking arm) against exploration (learn whether another arm is secretly better) — the same trade-off The Maze(INST·42) faces at every state of a full reinforcement-learning problem, of which the bandit is the degenerate one-state case. The Anneal(INST·34) solves a structurally similar trade-off with a cooling schedule instead of a belief update — explore freely while hot, exploit as it cools. Thompson sampling’s posterior draw is the same Bayesian-belief machinery as The Lens(INST·25) — a Kalman filter narrowing a continuous estimate is a Beta posterior narrowing a discrete choice, wearing different maths for the same idea: represent uncertainty explicitly, and let it shrink exactly where evidence arrives.