CAUSALITY - Discussion (Lindley (3))

CAUSALITY - Discussion (Lindley (3)) Date: July 3, 2000
From: Dennis Lindley
Subject: On causality and decision trees (cont.)

Question to author:

Your point about probability and decision trees is well taken and I am in agreement with what you say here; a point that I had not appreciated before. Thank you. Let me rephrase the argument to see whether we are in agreement. In handling a decision tree it is easy to see what probabilities are needed to solve the problem. It is not so easy to see how these might be assessed numerically. However, I do not follow how causality completely resolves the issue. Nor, of course, does exchangeability.

Author's reply
I am very glad that we have narrowed the problem down to simple and concrete issues: (1) how to assess the probabilities needed for a decision tree, (2) where those probabilities come from, (3) how those probabilities can be encoded economically, and perhaps even (4) whether those probabilities must comply with certain rules of internal coherence, especially when we construct several decision trees, involving the same set of variables.

The reason I am so glad at this narrowing of the problem is that it would greatly facilitate my argument for explicit distinction between causal and probabilistic relationships. In general, I have found Bayesian statisticians to be the hardest breed of statisticians to convince of the necessity of this distinction. Why? Because whereas classical statisticians are constantly on watch against assumptions that cannot be substantiated by hard data, Bayesians are more permissive in this regard, and rightly so. However, by licensing human judgment as a legitimate source of information, Bayesian have become less meticulous in keeping tab of the character and origin of that information. Your earlier statement is typical of the Bayesian philosophy:
          What I do not understand at the moment is the relevance of
          this [i.e., causal thinking] to decision trees. At a decision
          node, one conditions on the quantities known at the
          time of the decision. At a random node, one includes
          all relevant uncertain quantities under known conditions.
          Nothing more than the joint distributions
          (and utility considerations) are needed.

As Newcomb's paradox teaches us (see Section 4.1), it is not exactly true that "at a decision node, one conditions on the quantities known at the time of the decision". If this were the case, then all decision trees will turn into a joke; "patients should avoid going to the doctor ``to reduce the probability that one is seriously ill'' (Skyrms 1980, p. 130); workers should never hurry to work, to reduce the probability of having overslept; students should not prepare for exams, lest this would prove them behind in their studies; and so on. In short, all remedial actions should be banished lest they increase the probability that a remedy is indeed needed." [Causality, Chapter 4, page 108]

But even after escaping this "conditioning" trap, the Bayesian philosopher does not see any difference between assessing probabilities for branches emanating from decision nodes and those emanating from chance nodes. For a Bayesian, both assessments are probability assessments. That the former involves a mental simulation of hypothetical experiment while the latter involves the mental envisioning of passive observations is irrelevant, because Bayesians are preoccupied with defending a different distinction:
          The Bayesian paradigm makes a sharp distinction between
          probability as belief and probability as frequency,
          calling the latter, chance. [Lindley, June 24, mesg].

This preocupation renders them less sensitive to the fact that beliefs come in a variety of shades and colors and that the mind processes beliefs about outcomes of experiments differently than it processes beliefs about outcomes of passive observations.

So far, I have found two effective ways to win the hearts of Bayesians, one involving the notion of "economy" (see my discussion with Nimrod Megiddo, posted on this page), the other the notion of "coherence".

Given a set of n variables of interest, there is a huge number of decision trees that can conceivably be constructed from these variables, each corresponding to a different choice of temporal ordering and different choice of decision nodes and chance nodes from those variable. The question naturally arises, how can a decision maker ensure that probability assessments for all these decision trees be reproducible. Surely we cannot assume that human store explicitly all these potential decision trees in their heads. For reproducibility, we must assume that all these assessments must be derived from some economical representation of knowledge about decisions and chance events. Causal relationships can thus be viewed as the economical representation from which decision trees are constructed. Indeed, as I wrote to N. Megiddo, if we were in need of instructing a robot to construct such decision trees upon demand, in according with our knowledge and belief, our best approach would be to feed the robot a pair of inputs (G, P) where G is a causal graph and P is our joint distribution over the variables of interest (subjective distribution, if we were Bayesian). With the help of this pair of objects, the robot should be able to construct consistently all the decision trees required, for any partition of the variables into decision and chance nodes, and replicate precisely our construction. This is one way a Bayesian could appreciate causality without offending the traditional stance that "it is nothing more than the joint distributions..."

The second approach involves "coherence". Coherence is something Bayesians are very proud of, because De Finetti, Savage and others have labored so hard to construct qualitative axioms that prevent probability judgments from being totally whimsical, and that compel beliefs to conform to the calculus of probability.

We can ask the Bayesian philosopher to tell us whether judgments about joint probabilities, say P(x,y), should in some way cohere with judgments about decision-based probabilities, say P(y|do(x)). If the Bayesian claims that he/she does not understand what P(y|do(x)) means, we can help by equating P(y|do(x)) with the probabilities associated with the outcomes Y=y and Y=y' that emanate from a decision node with two alternatives X=x and X=x'. We can then ask the Bayesian whether these probabilities should bear any connection to the usual conditional probabilities, P(y|x), namely the probability assessed for outcome Y=y that emanates (in some other decision tree) from a chance event X =x.

I believe it will not be too hard to convince our Bayesian that these two assessment could not be totally arbitrary, but must obey some restrictions of coherence. For example, the inequality P(y|do(x)) >= P(y, x) should be obeyed for all events x and y. The next step is to impress our Bayesian with the fact that the do(*) operator, as defined in Chapter 3 of the book, ensures us that coherence restrictions of this kind are automatically satisfied whenever P(y|do(x)) is derived from a causal network according to the rules of Chapter 3.

These two arguments should be inviting for a Bayesian to start drawing mathematical benefits from causal calculus, while maintaining caution and skepticism, of course, but, as they say in the Talmud:
"From benefits comes understanding"
(free translation of "mitoch shelo lishma, ba lishma).

Bayesians will eventually embrace causal vocabulary, I have no doubt.

Next discussion (Pearl: General criterion for parameter identification (Chapter 5, pp. 149-154))