I do not agree that "causality" is the key to resolving the paradox (but this is also a matter of definition) and that tools for looking at it did not exist twenty years ago. Coming from game theory, I think the issue is not difficult for people who like to draw decision trees with "decision" nodes distinguished from "chance" nodes.
I drew two such trees on the attached Word document which I
think clarify the correct decision in different circumstances.
Click here for viewing the trees.
The fact that you have constructed two different decision trees for the same input tables implies that the key to the construction was not in the data, but in some information you obtained from the story behind the data, What is that information?
The literature of decision tree analysis has indeed been in existence for at least fifty years but, to the best of my knowledge, it has not dealt seriously with the problem posed above: "what information we use to guide us into setting up the correct decision tree?"
We agree that giving a robot the frequency tables ALONE, would not be sufficient for the job. But what else would Mr. robot (or a statistician) need? Changing the story from F= "female" to F= "Blood pressure" seems to be enough for people, because people understand informally the distinct rolls that gender and blood pressure play in the scheme of things. Can we characterize these rolls formally, so that our robot would be able to construct the correct decision tree?
My proposal: give the robot (or a statistician or a decision-tree expert) a pair (T, G), where T is the set of frequency tables and G is a causal graph and, lo and behold, the robot would be able to set up the correct decision tree automatically. This is what I meant by saying that the resolution of the paradox lies in causal considerations. Moreover, one can go further and argue: "if the information in (T, G) is sufficient, why not skip the construction of a decision tree altogether, and get the right answer directly from (T, G)?" This is the gist of chapters 3-4 in the book, which can be a topic for a separate discussion: Would the rich literature on decision tree analysis benefit from conversion to the more economical encoding of decision problems in the syntax of (T, G)? The introduction of influence diagrams (in 1981) was a step in this direction and, as Section 4.1.2 indicates, the second step might not be too far off.
From: Nimrod Megiddo (IBM Almaden)
Subject: Simpson's paradox and decision trees (cont.)
My point remains simply the following. The term "causality" introduces into the problem issues that do not have to be there, such as determinism, free will, cause and effect, etc. What does matter is a specification that, in the outcome fixing process, fixing the value of variable X occurs before fixing the value of a variable Y, and Y depends on X. You like to call this situation a causality relation. Of course in a mathematical theory you can choose any name you like, but then people are initially tempted to develop some intuition, which may be wrong due to the external meaning of the name you choose. The interpretation of this intuition outside the mathematical model often has real-life implications that may be wrong, for example, that X really causes Y. The decision tree is simple a way to demonstrate the additional chronological information, and simple directed graphs can of course encode that information more concisely. When you have to define precisely what these graphs mean, you refer to a fuller description like the trees.
So, in summary, my only objection is to the use of the word "causality" and I never had doubts that chronological order information was crucial to a correct decision making based on past data.
As a thought experiment, imagine that we wish to write a program that automatically constructs decision trees from stories like those in Fig 6.2(a)-(b)-(c). The program is given the empirical frequency tables and is allowed to ask us questions about chronological and dependence relationships among C, E and F, but is not allowed to use any causal vocabulary. Would the program be able to distinguish between (a) and (c)? Note that all statistical-dependence information can be obtained from the frequency tables and, moreover, dependence information relative to manipulating the control variable (which Section 1.5 defines as "causal" information) would not, in itself, be sufficient. See Section 6.3 for discussion of why the program will fail.
Next Discussion (Kenny: Causality and the mystical error terms)