Ranging farther afield¶
Today I'll be taking advantage of my stated intention to pull back from the stream of recent papers, and look at some papers for their impact or fundamental importance as I see it. So today I'm doing something unusual, highlighting a paper not from last week, but from four years ago, and not directly from AI, but from the field of probability theory: Cox's Theorem and the Jaynesian Interpretation of Probability.
I've been reading a book by E. T. Jaynes, called Probability Theory: The Logic of Science, a brilliant and practical exposition of the Bayesian view of probability theory, partially on the recommendation of another AI researcher. The thoughts of an ideal reasoner would have Bayesian structure, so I am both personally and professionally interested in mastering the concepts.
Overview¶
Cox's theorem is an attempt to derive probability theory from a small, common-sense set of uncontroversial desiderata, and to demonstrate its uniqueness as an extension of two-valued (true/false) logic to degrees of belief. That's a big deal. As today's paper mentions, Peter Cheeseman has called Cox's theorem the "strongest argument for the use of standard (Bayesian) probability theory". But Cox's theorem is non-rigorous as originally formulated, and many people have patched up the holes for use in their various fields. Often today, if someone refers to "Cox's theorem", they usually mean one of the fixed-up versions.
Jaynes' version unfortunately contains a mistake, and today's paper fixes it by replacing some of the axioms with the simple requirement that probability theory remain consistent with respect to repeated events.
It may be difficult without reading the book to see why this paper is important to AI, so perhaps in the near future I'll discuss that at greater length. For today, however, I'll simply be explaining each of the axioms, and setting you up to read the paper more easily. It is certainly worth a close reading, to ground your confidence in the interpretation of probability theory as a logical system that extends true-false logic to handle uncertainty, so you can reap the associated benefits.
Abstract¶
There are multiple proposed interpretations of probability theory: one such interpretation is true-false logic under uncertainty. Cox's Theorem is a representation theorem that states, under a certain set of axioms describing the meaning of uncertainty, that every true-false logic under uncertainty is isomorphic to conditional probability theory. This result was used by Jaynes to develop a philosophical framework in which statistical inference under uncertainty should be conducted through the use of probability, via Bayes' Rule. Unfortunately, most existing correct proofs of Cox's Theorem require restrictive assumptions: for instance, many do not apply even to the simple example of rolling a pair of fair dice. We offer a new axiomatization by replacing various technical conditions with an axiom stating that our theory must be consistent with respect to repeated events. We discuss the implications of our results, both for the philosophy of probability and for the philosophy of statistics.
Axioms $\newcommand{\P}{\mathbb{P}} \newcommand{\F}{\mathscr{F}}$¶
This paper proposes a new axiomatization of probability theory, with five axioms. As a variant of Cox's theorem, these axioms are supposed to represent a set of "common sense" desiderata for a logical system under uncertainty. That is, each of these axioms are things we naturally want to be true of any logical system under uncertainty. Cox's original axioms were more intuitively essential to me, however, so I'll also try to give justifications for demanding each of the following axioms, as well as explaining them technically.
Remember the ultimate goal is to build probability theory up from a minimal set of absolute requirements for any logical system. The punchline is that probability theory as described historically by greats like Kolmogorov turns out to be the unique extension of true-false logic under uncertainty, and we can derive it from "common sense".
To emphasize the point that while we're writing these axioms we haven't yet got probability, following Jaynes I'll refer to our measure of certainty/uncertainty as "plausibility".
1. Plausibility must be representable by a real number¶
Let $\Omega$ be a set and $\mathscr{F}$ be a $\sigma$-algebra on $\Omega$.
Let $\P: \F \times (\F \setminus \emptyset) \rightarrow R \subseteq \mathbb{R}$ be a function, written using notation $\P(A|B)$.
It makes intuitive sense that we should be able to measure our uncertainty on a smooth, finite scale, so it makes sense to demand that our plausibility scale be chosen from some definite subset of the reals.
$\F$ being "a $\sigma$-algebra on $\Omega$" means that it is the set of every subset of $\Omega$ (including $\Omega$ and $\emptyset$), is closed under complement, and is closed under countable unions. (Being "closed under" some operation means that taking that operation on any element in the set yields an element that's also defined to be in the set.) The idea is that $\Omega$ comprises all primitive events, and $\F$ therefore includes every possible logical combination of these primitive events, in a way that makes it eqivalent to a Boolean algebra.
I found it clarifying that $\P(\Omega)=1$. That's what made it click for me that a set in $\F$ represents a disjunction of primitive events, and $\Omega$ contains all primitive events, so $\P(\Omega)$ is the probability that anything happens.
$\P(A|B)$ is a function of two arguments $A,B \in \mathscr{F}$, and B cannot be empty. The interpretation is, "The probability of some event A, given that event B is true." The second argument cannot be empty, Jaynes often describes it as "the background information", including everything else known (such as the rules of probability themselves, and the number of penguins in Antarctica).
The arguments of $\P$ are sets, but as the paper mentions, "by Stone's Representation Theorem, every Boolean algebra is isomorphic to an algebra of sets".
2. Sequential continuity¶
We have that $$A_1 \subseteq A_2 \subseteq A_3 \subseteq\ldots \text{ such that } A_i \nearrow A \text{ implies } \P (A_i | B)\nearrow \P(A | B )$$ for all $A, A_i, B$.
Another intuitive requirement for a system of logical inference is that our plausibility measure return arbitrarily small differences in plausibility for arbitrarily small changes in truth value. This concept is also known as "continuity".
If you can arrange a sequence of events (sets) so that earlier events (e.g., $A_1$) are included in later events (e.g., $A_3$), then there is "sequential continuity" between earlier sets and later sets in this sequence. In the notation of the paper, $A_1 \nearrow A_3$.
What this axiom is saying is that as long as there is sequential continuity between two logical propositions, there is also sequential continuity between their plausibilities. This formalizes our requirement for continuity. Also notice that if $\P (A_i | B)\nearrow \mathbb{P}(A | B )$ then $\P (A_i | B) \leq \mathbb{P}(A | B )$, because our definition of sequential continuity also implies that the cardinality of the sets is non-decreasing. This will be useful reading the proof.
3. Decomposability¶
$\P(AB | C )$ can be written as $$\P(A | C ) \circ \P(B | AC)$$ for some some function $\circ : (R \times R) \rightarrow R$.
This is the first axiom that I had trouble seeing as intuitive, and in fact I thought it was a bit question-begging at first because it looks like the product rule. It represents the demand that plausibilities of compound propositions be decomposable into plausibilities of the their constituents, and that that decomposition has a particular form. It's the demand that it follow a particular form that seems somewhat arbitrary to me at first. Of course we would want to be able to decompose compound uncertainty into more fundamental elements, or else probability theory wouldn't be very useful. But why should it take the form described of $\circ$?
The answer is that this form is minimal for decomposability. That is, it's the weakest statement that could be made about the details of decomposition. In English: "The plausibility of A and B is a function of the plausibility of one of those (say, $A$), and the plausibility of the other ($B$) once we can assume $A$ is true."
Note that logical conjunctions are commutative ($AB = BA$), so by this axiom $\P(AB | C )$ can also be written as $\P(B | C ) \circ \P(A | BC)$. They prove later also that $\circ$ is commutative, but that is not assumed in the axioms.
4. Negation¶
There exists a function $N : R \rightarrow R$ such that $$ \P(A^c | B)= N[ \P(A | B)] $$ for all $A,B$.
This axiom also seemed a bit question-begging to me, because it looks like the sum rule of probability theory, and because it seemed arbitrary that you would want uniquely determined probabilities for the negations of propositions.
Upon further reflection, however, this seems like a reasonable demand to be consistent with two-valued logic. Every proposition $A$ in true-false logic has a unique proposition $A^c$ representating its negation, (This superscript complement notation emphasizes the representation as propositions as sets, but is equivalent to $\bar A$, $\neg A$, etc.) so it makes sense that an extension of true-false logic to uncertainty would also include a method of determining the opposite.
In actual fact, this may be the most controversial axiom, since there are logics other than true-false logic that don't require the "law of the excluded middle" (they allow "maybe"). But if you are willing to accept that all well-formed propositions are either true or false, and our system of plausibility represents levels of certainty about their truth or falsehood, then this axiom represents a reasonable and necessary demand.
5. Consistency under extension¶
If $(\Omega, \mathscr{F}, \P)$ satisfies the axioms above, then $(\Omega \times \Omega, \mathscr{F} \otimes \mathscr{F}, \P \operatorname{\circ} \P)$ must as well, i.e., the definition $\P(A \times B | C \times D) = \P(A | C) \circ \P(B | D)$ is consistent.
This axiom represents the core of the authors' contribution. Although there were many correct variants of Cox's theorem, and many ways to axiomatize probability theory, they all had either disappointingly narrow scope, or had lost their intuitive nature in the formalization. The authors' of our paper replace several technical axioms from other axiomatizations with this one demand that their rules be consistent under extention to repeated events.
In English, this axiom is, "If the rules apply to a single trial (e.g., a single coinflip), then they also apply to a system of two independent trials (e.g., two coinflips)." To me, that's obviously intuitive, so it's delightful to find that it covers so much ground.
Examining their formal expression, with the coinflips example, with $A$ meaning "heads on the first coinflip" and B meaning "tails on the second coinflip":
$\P(A \times B | C \times D)$ means "the plausibility of heads-then-tails given two piles of background information $C$ and $D$". The axiom states this must equal $\P(A | C) \circ \P(B | D)$, meaning that the plausibility of a pair of coinflips coming up heads-tails is equal to the plausibility of a single coinflip coming up heads (given background information $C$), composed (using $\circ$) with another coinflip coming up tails (given background information $D$).
Parting thoughts¶
- I hope this exposition of the axioms helps you read the paper yourself, though I realize I may not have provided sufficient motivation to do so yet. That would make it a bit like my post deriving something surprising about Boltzmann machines without first explaining what Boltzmann machines are. I intend to rectify this in the future for both posts.
- I could make this a lot clearer for people with less set theory, group theory, or probability theory background. If that would be helpful to you, please leave me a comment on what specifically didn't make sense so I can get a feel for my audience.
- To memorize these and make reading the proof easier, I labeled each of the five axioms with some relevant symbol, and combined them into a mneumonic. In case that helps you too, here it is: $\mathbb{R}$ $\nearrow$ $\circ$ $N$ $\times$.