ACCURACY AND CALIBRATION

J Robert G Williams & Richard Pettigrew

When we’re uncertain about something, we often assign probabilities to the different possibilities. The meteorologist, uncertain what weather tomorrow will bring, might say they’re 70% confident it will rain heavily, and 40% confident temperatures will exceed 15°C. The epidemiologist, uncertain of how fast a virus is spreading in a particular community, might be 20% confident that its basic reproduction number (R number) is 0.9, 30% confident it’s 1.0, and 50% confident it’s 1.1.

When is one assignment of probabilities better than another? According to the answer we explore, it’s better when it’s more accurate. If it does rain heavily tomorrow, the meteorologist’s probability that it will rain is more accurate the higher that probability is, and less accurate the lower it is; and if it doesn’t rain heavily, the probability is more accurate the lower it is, and less accurate the higher it is. But these basic constraints are compatible with many ways of measuring accuracy. In our article, we give a new argument that any legitimate measure of the accuracy of probabilities is generated by what statisticians call a strictly proper scoring rule. So, first: what is a strictly proper scoring rule?

A strictly proper scoring rule measures the accuracy of individual probabilities, such as the meteorologist’s probability of 70% that it will rain heavily tomorrow, or the epidemiologist’s probability of 20% that the R number is less than one. The scoring rule takes such a probability, along with a specification of whether the possibility to which it is assigned is true or false, and returns a non-positive real number, or −∞, that measures how accurate the probability is when the possibility has the specified truth value. A scoring rule is strictly proper if it is a continuous function of the probability whose accuracy it measures, and if each probability expects itself to be most accurate. So, for instance, a probability of 70% expects any probability other than 70% to be less accurate than it expects itself to be. A measure of the accuracy of a whole assignment of probabilities is then generated from the strictly proper scoring rule by adding together the accuracy of the individual probabilities.

Strictly proper scoring rules have many nice features, so it’d be significant if we had a good argument that they are the right measures of accuracy to use. Before we see that argument, let’s have a look at those nice features.

First, a foundational consequence: A crucial question in the foundations of statistics is why probabilities that represent confidence in different possibilities must obey certain laws or axioms. Why should some aspect of our psychology that encodes our uncertainty about the world satisfy the axioms of the probability calculus that Kolmogorov formulated? Here’s a specific demand of these axioms: they say we should not be more confident that the R number of a virus is greater than 1.2 than we are that it is greater than 1.1, since it’s always greater than 1.1 if it’s greater than 1.2. What’s the basis for this demand? Statisticians advocate Bayes’s rule as the right way to update our beliefs when we learn new evidence—but is this rule the right one? And if I’m unsure of the prevalence of a disease in a community, but I know it’s either 1 in 1000 or 1 in 100, why should my probability that a randomly selected member of the community is infected lie between 1% and 0.1%?

It turns out that if we think assignments of probabilities are better or worse the more or less accurate they are and if we measure accuracy using strictly proper scoring rules, we can explain why these laws for reasoning under uncertainty hold. To see this, pick any way of measuring accuracy that is generated by a strictly proper scoring rule. Then we have three mathematical results. First, if our probabilities don’t satisfy Kolmogorov’s axioms, there are alternative probabilities that do satisfy those axioms that are guaranteed to be more accurate by the lights of that measure of accuracy. This seems a good reason to satisfy those axioms! Second, our prior probabilities expect our posterior probabilities to be most accurate just in case the posteriors are obtained from our priors using Bayes’s rule. A good reason to update in this way! Third, when we are uncertain of the objective probabilities, such as the true prevalence of a disease in a community, if our subjective probabilities are not our expectations of the objective probabilities, then there are alternative subjective probabilities that every possible objective probability expects to be more accurate. So this is the first nice feature: strictly proper scoring rules furnish powerful arguments in the foundations of statistics and the theory of reasoning under uncertainty.

Strictly proper scoring rules also have a more applied role. For instance, we often wish to identify expert reasoners or successful statistical models, and looking for a track record of accuracy is the obvious thought. Performance evaluation by measuring accuracy incentivizes our predictor to report those probabilities they expect to be most accurate. This will, in general, only reflect their own levels of confidence if the accuracy measure is strictly proper. It is no coincidence that one of the most popular strictly proper scoring rules, known as the Brier score, was developed as a way of measuring the success of meteorological predictions! This is the second nice feature of strictly proper scoring rules: they are the way to measure a predictor’s track record of accuracy without distorting the predictor’s incentives.

The third and final nice feature of strictly proper scoring rules is their rich connection to the wider theoretical landscape of properties of probability functions. Here are two examples: First, they furnish us with a measure of the distance from one assignment of probabilities to another. We say that the distance your assignment of probabilities lies from mine is how much accuracy I would expect to lose were I to abandon my probabilities and adopt yours. If accuracy is measured by a strictly proper scoring rule, then the measure of distance generated in this way belongs to a family known as the Bregman divergences, which have a range of appealing properties, and include the well-known mean squared error and the Kullback–Leibler divergence. Second, each strictly proper scoring rule generates its own measure of the entropy of a probability function. That entropy is defined to be the probability function’s expectation of its own inaccuracy. Very opinionated probability functions that heap lots of probability on a single possibility will expect themselves to be very accurate, because they assign a lot of probability to a possibility in which they are very accurate, and so their entropy will be low; less opinionated probability functions will expect themselves to be more inaccurate, and so their entropy will be higher.

Given all these attractive features, what we want is a principled, independent reason to think that the right way to characterize accuracy is via some strictly proper measure. That reason would underpin the accuracy foundations for statistical reasoning, show us why performance evaluation by accuracy is not only desirable but appropriate, and give us an indirect grip on probabilistic distance and entropy. This is where the characterization result of our article comes in.

Our characterization of strictly proper scoring rules involves two widely assumed formal features, and one new axiom that we lift from Frank P. Ramsey’s hugely influential 1926 paper, ‘Truth and Probability’, which also gave us what is now called the Dutch book argument and the first representation theorem for rational preferences. The first formal feature is that the accuracy of a whole assignment of probabilities is generated by a measure of the accuracy of individual probabilities by summing them up—we call this ‘additivity’. The second formal feature is that the accuracy of an individual probability is a continuous function of that probability—we call this requirement ‘continuity’. And here is the idea we take from Ramsey: Imagine we are faced with a large tray filled with toadstools—there are m of them. Some are wholesome, some are not—k of them are wholesome. Knowing nothing that will help us distinguish the wholesome from the rest, we commit to assigning, for each toadstool, the same probability that it is wholesome. In this case, Ramsey contends, the best probability we can assign to each is k/m, that is, the proportion of wholesome toadstools among all toadstools on the tray. This might not be the best probability assignment there is; but it is the best among those that assign, for each toadstool, the same probability that it is wholesome. When a measure of accuracy renders this assignment best among all the assignments that assign the same probability, we say it passes the calibration test. Our central result is this: the measures of accuracy that both satisfy the requirements of additivity and continuity and also pass the calibration test are exactly those generated by the strictly proper scoring rules.

Listen to the audio essay

FULL ARTICLE

Williams, J. R. G. and Pettigrew, R. [2026]: ‘Consequences of Calibration’, British Journal of the Philosophy of Science, 77
<doi.org/10.1086/725097>

J Robert G Williams
University of Leeds
j.r.g.williams@leeds.ac.uk

Richard Pettigrew
University of Bristol
richard.pettigrew@bristol.ac.uk

© The Authors (2024)

FULL ARTICLE

Williams, J. R. G. and Pettigrew, R. [2026]: ‘Consequences of Calibration’, British Journal of the Philosophy of Science, 77,<doi.org/10.1086/725097>.