Question to author:
Your point about probability and decision trees is well taken and
I am in agreement with what you say here; a point that I had not
appreciated before. Thank you.
Let me rephrase the argument to see whether we are in agreement.
In handling a decision tree it is easy to see what probabilities
are needed to solve the problem. It is not so easy to see how
these might be assessed numerically.
However, I do not follow how causality completely resolves the issue.
Nor, of course, does exchangeability.
Author's reply
I am very glad that we have narrowed the problem down to
simple and concrete issues: (1) how to assess the probabilities
needed for a decision tree, (2) where those probabilities
come from, (3) how those probabilities can be encoded
economically, and perhaps even (4) whether those
probabilities must comply with certain rules of internal coherence,
especially when we construct several decision trees,
involving the same set of variables.
The reason I am so glad at this narrowing of the problem
is that it would greatly facilitate my argument for
explicit distinction between causal and probabilistic
relationships. In general, I have found Bayesian
statisticians to be the hardest breed of
statisticians to convince of the necessity of this distinction.
Why? Because whereas
classical statisticians
are constantly on watch against assumptions that cannot be
substantiated by hard data, Bayesians are more permissive
in this regard, and rightly so. However, by licensing human
judgment as a legitimate source of information,
Bayesian have become less meticulous in keeping tab of
the character and origin of that information.
Your earlier statement is typical of the Bayesian philosophy:
What I do not understand at the moment is the relevance of
this [i.e., causal thinking] to decision trees. At a decision
node, one conditions on the quantities known at the
time of the decision. At a random node, one includes
all relevant uncertain quantities under known conditions.
Nothing more than the joint distributions
(and utility considerations) are needed.
As Newcomb's paradox teaches us (see Section 4.1), it is not
exactly true that "at a decision
node, one conditions on the quantities known at the
time of the decision". If this were the case, then
all decision trees will turn into a
joke; "patients should avoid going to the doctor ``to reduce
the probability that one is seriously ill'' (Skyrms 1980,
p. 130); workers should never hurry to work, to reduce the
probability of having overslept; students should not prepare
for exams, lest this would
prove them behind in their studies; and so on.
In short, all remedial actions should be banished
lest they increase the probability that a remedy is
indeed needed." [Causality, Chapter 4, page 108]
But even after escaping this "conditioning" trap,
the Bayesian philosopher does not see any
difference between assessing probabilities
for branches emanating from decision nodes and those
emanating from chance nodes. For a Bayesian, both
assessments are probability assessments. That the
former involves a mental simulation of hypothetical
experiment while the latter involves the mental envisioning
of passive observations is irrelevant, because Bayesians are
preoccupied with defending a different distinction:
The Bayesian paradigm makes a sharp distinction between
probability as belief and probability as frequency,
calling the latter, chance. [Lindley, June 24, mesg].
This preocupation renders them less sensitive to the
fact that beliefs come in a variety of shades and colors
and that the mind processes beliefs about outcomes of
experiments differently than it processes beliefs about
outcomes of passive observations.
So far, I have found two effective ways to win the hearts of Bayesians, one involving the notion of "economy" (see my discussion with Nimrod Megiddo, posted on this page), the other the notion of "coherence".
Given a set of n variables of interest, there is a huge number of decision trees that can conceivably be constructed from these variables, each corresponding to a different choice of temporal ordering and different choice of decision nodes and chance nodes from those variable. The question naturally arises, how can a decision maker ensure that probability assessments for all these decision trees be reproducible. Surely we cannot assume that human store explicitly all these potential decision trees in their heads. For reproducibility, we must assume that all these assessments must be derived from some economical representation of knowledge about decisions and chance events. Causal relationships can thus be viewed as the economical representation from which decision trees are constructed. Indeed, as I wrote to N. Megiddo, if we were in need of instructing a robot to construct such decision trees upon demand, in according with our knowledge and belief, our best approach would be to feed the robot a pair of inputs (G, P) where G is a causal graph and P is our joint distribution over the variables of interest (subjective distribution, if we were Bayesian). With the help of this pair of objects, the robot should be able to construct consistently all the decision trees required, for any partition of the variables into decision and chance nodes, and replicate precisely our construction. This is one way a Bayesian could appreciate causality without offending the traditional stance that "it is nothing more than the joint distributions..."
The second approach involves "coherence". Coherence is something Bayesians are very proud of, because De Finetti, Savage and others have labored so hard to construct qualitative axioms that prevent probability judgments from being totally whimsical, and that compel beliefs to conform to the calculus of probability.
We can ask the Bayesian philosopher to tell us whether judgments about joint probabilities, say P(x,y), should in some way cohere with judgments about decision-based probabilities, say P(y|do(x)). If the Bayesian claims that he/she does not understand what P(y|do(x)) means, we can help by equating P(y|do(x)) with the probabilities associated with the outcomes Y=y and Y=y' that emanate from a decision node with two alternatives X=x and X=x'. We can then ask the Bayesian whether these probabilities should bear any connection to the usual conditional probabilities, P(y|x), namely the probability assessed for outcome Y=y that emanates (in some other decision tree) from a chance event X =x.
I believe it will not be too hard to convince our Bayesian that these two assessment could not be totally arbitrary, but must obey some restrictions of coherence. For example, the inequality P(y|do(x)) >= P(y, x) should be obeyed for all events x and y. The next step is to impress our Bayesian with the fact that the do(*) operator, as defined in Chapter 3 of the book, ensures us that coherence restrictions of this kind are automatically satisfied whenever P(y|do(x)) is derived from a causal network according to the rules of Chapter 3.
These two arguments should be inviting for
a Bayesian to start drawing mathematical benefits from
causal calculus, while maintaining caution and skepticism,
of course, but, as they say in the Talmud:
"From benefits comes understanding"
(free translation of "mitoch shelo lishma, ba lishma).
Bayesians will eventually embrace causal vocabulary, I have no doubt.
Next discussion (Pearl: General criterion for
parameter identification (Chapter 5, pp. 149-154))