From: Dennis Lindley

Subject: On causality and decision trees (cont.)

**Question to author:**

Your point about probability and decision trees is well taken and
I am in agreement with what you say here; a point that I had not
appreciated before. Thank you.
Let me rephrase the argument to see whether we are in agreement.
In handling a decision tree it is easy to see what probabilities
are needed to solve the problem. It is not so easy to see how
these might be assessed numerically.
However, I do not follow how causality completely resolves the issue.
Nor, of course, does exchangeability.

**Author's reply**

I am very glad that we have narrowed the problem down to
simple and concrete issues: (1) how to assess the probabilities
needed for a decision tree, (2) where those probabilities
come from, (3) how those probabilities can be encoded
economically, and perhaps even (4) whether those
probabilities must comply with certain rules of internal coherence,
especially when we construct several decision trees,
involving the same set of variables.

The reason I am so glad at this narrowing of the problem
is that it would greatly facilitate my argument for
explicit distinction between causal and probabilistic
relationships. In general, I have found Bayesian
statisticians to be the hardest breed of
statisticians to convince of the necessity of this distinction.
Why? Because whereas
classical statisticians
are constantly on watch against assumptions that cannot be
substantiated by hard data, Bayesians are more permissive
in this regard, and rightly so. However, by licensing human
judgment as a legitimate source of information,
Bayesian have become less meticulous in keeping tab of
the character and origin of that information.
Your earlier statement is typical of the Bayesian philosophy:

What I do not understand at the moment is the relevance of

this [i.e., causal thinking] to decision trees. At a decision

node, one conditions on the quantities known at the

time of the decision. At a random node, one includes

all relevant uncertain quantities under known conditions.

Nothing more than the joint distributions

(and utility considerations) are needed.

As Newcomb's paradox teaches us (see Section 4.1), it is not
exactly true that "at a decision
node, one conditions on the quantities known at the
time of the decision". If this were the case, then
all decision trees will turn into a
joke; "patients should avoid going to the doctor ``to reduce
the probability that one is seriously ill'' (Skyrms 1980,
p. 130); workers should never hurry to work, to reduce the
probability of having overslept; students should not prepare
for exams, lest this would
prove them behind in their studies; and so on.
In short, all remedial actions should be banished
lest they increase the probability that a remedy is
indeed needed." [*Causality,* Chapter 4, page 108]

But even after escaping this "conditioning" trap,
the Bayesian philosopher does not see any
difference between assessing probabilities
for branches emanating from decision nodes and those
emanating from chance nodes. For a Bayesian, both
assessments are probability assessments. That the
former involves a mental simulation of hypothetical
experiment while the latter involves the mental envisioning
of passive observations is irrelevant, because Bayesians are
preoccupied with defending a different distinction:

The Bayesian paradigm makes a sharp distinction between

probability as belief and probability as frequency,

calling the latter, chance. [Lindley, June 24, mesg].

This preocupation renders them less sensitive to the
fact that beliefs come in a variety of shades and colors
and that the mind processes beliefs about outcomes of
experiments differently than it processes beliefs about
outcomes of passive observations.

So far, I have found two effective ways to win the hearts of Bayesians, one involving the notion of "economy" (see my discussion with Nimrod Megiddo, posted on this page), the other the notion of "coherence".

Given a set of *n* variables of interest, there is a huge
number of decision trees that can conceivably be constructed from
these variables, each corresponding to a different
choice of temporal ordering and different choice of
decision nodes and chance nodes from those variable.
The question naturally arises, how can a decision maker
ensure that probability assessments
for all these decision trees be *reproducible.*
Surely we cannot assume that human store explicitly all these
potential decision trees in their heads.
For reproducibility, we must assume that
all these assessments must be derived from some economical
representation of knowledge about decisions and chance
events. Causal relationships can thus be viewed as
the economical representation from which decision trees are
constructed. Indeed, as I wrote to N. Megiddo, if we were
in need of instructing a robot to construct such decision
trees upon demand, in according with our knowledge and
belief, our best approach would be to feed
the robot a pair of inputs (*G, P*) where *G* is a causal
graph and *P* is our joint distribution over the variables
of interest (subjective distribution, if we were Bayesian).
With the help of this pair of objects, the
robot should be able to construct consistently all the
decision trees required, for any partition of the variables
into decision and chance nodes, and replicate precisely
our construction.
This is one way a Bayesian could appreciate causality
without offending the traditional stance that "it is nothing
more than the joint distributions..."

The second approach involves "coherence". Coherence is something Bayesians are very proud of, because De Finetti, Savage and others have labored so hard to construct qualitative axioms that prevent probability judgments from being totally whimsical, and that compel beliefs to conform to the calculus of probability.

We can ask the Bayesian philosopher to tell us whether
judgments about joint probabilities, say *P*(*x,y*), should
in some way cohere with judgments about decision-based
probabilities, say *P*(*y*|*do*(*x*)).
If the Bayesian claims that he/she does not
understand what *P*(*y*|*do*(*x*)) means,
we can help by equating *P*(*y*|*do*(*x*)) with the
probabilities associated with the outcomes *Y=y* and *Y=y'*
that emanate from a decision node with two alternatives
*X=x* and *X=x'*.
We can then ask the Bayesian whether these probabilities
should bear any connection to the usual conditional
probabilities, *P*(*y*|*x*), namely the probability assessed
for outcome *Y=y* that emanates (in some other decision
tree) from a chance event *X =x*.

I believe it will not be too hard to convince our
Bayesian that these two assessment could not
be totally arbitrary, but must obey some
restrictions of coherence. For example, the inequality
*P*(*y*|*do*(*x*)) >= *P*(*y, x*)
should be obeyed for all events *x* and *y*.
The next step is to impress our Bayesian with the fact that
the *do*(*) operator, as defined in Chapter 3 of the book,
ensures us that coherence restrictions of this kind
are automatically satisfied whenever *P*(*y*|*do*(*x*))
is derived from a causal network according to the rules of Chapter 3.

These two arguments should be inviting for
a Bayesian to start drawing mathematical benefits from
causal calculus, while maintaining caution and skepticism,
of course, but, as they say in the Talmud:

"From benefits comes understanding"

(free translation of "mitoch shelo lishma, ba lishma).

Bayesians will eventually embrace causal vocabulary, I have no doubt.

Next discussion (Pearl: *General criterion for
parameter identification (Chapter 5, pp. 149-154)*)