
Page 2
Causal Models to Address Confounding
As usual, we need to specify what we are trying to find out and why. Let’s continue with an earlier example, and imagine that we are investigating the effect of mothers’ education on child nutritional status  perhaps as part of an analysis of broad policies that could affect nutrition in the long term. We are using crosssectional data, derived from a DHS or MICS survey (this actual example comes from a DHS survey). The research question is:
Does maternal education level have an association with child wt/age independent of socioeconomic and environmental variables?
Of course, an immediate issue is whether we have variables that are any use at measuring any of these factors  let’s assume so, for the moment.
We need to make explicit the relevant relationships between the variables we have in terms of likely causal pathways. This also reminds us of what we would like to measure and what the variables mean. In this case, for example, we might think that one route by which maternal education affects child nutrition is via improved caring practices, but we don’t have a good variable for those. A fairly simple conceptual model (or ‘spider’ as we fondly refer to such creatures) can be quickly drawn  try to draw this Multiway Spider and CLICK HERE to compare. The variables that we are going to refer to are: maternal education, housing (roof), toilet, water supply, and a derived proxy for income (‘vincome’ for ‘virtual income’  note that this is not usually available and has been included here to give a continuous variable for practice purposes).
From one and two way analyses, we know that all these variables are associated to some degree with wt/age. Formally, the potential confounders would also need to be associated with education. It may be useful to examine the bivariate associations between all these variables, by running the Correlation Matrix CLICK HERE. On the other hand, this is not essential, and given the temptation to fish indiscriminately for associations, some analysts advise restricting the number of correlations run.
We are examining the association between low education levels of the mother with malnutrition (low wt/age) in the child. As a simple regression, we get the following results:
Dependent Variable  Independent Variable  Constant  Coefficient *  t (P)  N  Adjusted R square 
WAZ, wt/age zscore  DLOWEDN, low education level  1.025  0.448  4.887 (0.000)  698  0.032 
(*Note that the unstandardized coefficient is always used for these purposes)
Those with low education have on average a wt/age score 0.448 units less than the other group (there are only two) with higher education; the difference is highly significant. The equation can be written: wt/age = 1.025  0.448 (DLOWEDN), where DLOWEDN = 1 when education level is incomplete primary or less, otherwise = 0.
It’s not intuitive whether nearly 0.5 units of wt/age is a lot (it is), so the prevalence difference equivalent might be calculated. The 01 prevalence variable for WAZ in the regression equation results in:
Dependent Variable  Independent Variable  Constant  Coefficient  t (P)  N  Adjusted R square 
WAPREV, prev waz <2sd 
DLOWEDN, low education level  0.181  0.195  5.675 (0.000)  698  0.043 
The equivalent change in prevalence to the 0.448 change in zscore is 19.5%  an average prevalence of around 30%. Exactly the same result would be obtained using the ‘compare means’ routine, GLM, ANOVA, or others.
There is a strong and highly significant association between low education of the mother and nutritional status of the child. Is it spurious due to bettereducated people being richer, or is there a direct effect? We can study this by progressively entering the potential confounding variables and seeing what happens to the associations.
In this step, the computing produces reams of printout, and extracting from these onto summary tables is a good idea for convenience and for ease of interpretation. A suitable format is to extend the tables by adding columns, but swapping columns and rows gives more space for testing several variables and their combinations.
For the details of the Multiway Regression computing steps that are being summarized CLICK HERE.
The results look like this:
Dependent variable = WAZ
Independent Variable 
Model number: coefficient (t, p) 

1 
2 
3 
4 
5 
6 
7 

DLOWEDN  0.448 (4.887, 0.000) 
 
0.420 (4.456, 0.000) 
0.441 (4.325, 0.000) 
0.386 (4.019, 0.000) 
 
0.321 (3.210, 0.001) 
DBADROOF   
0.253 (2.594, 0.010) 
0.158 (1.608, 0.108) 
0.137 (1.337, 0.182) 
0.124 (1.208, 0.227) 
 
0.066 (0.616, 0.538) 
NOTOILET   
 
 
0.102 (0.777, 0.477) 
0.075 (0.558, 0.577) 
 
0.013 (0.089, 0.929) 
DPIPED   
 
 
 
0.240 (1.659, 0.098) 
 
0.120 (0.688, 0.492) 
DWELL   
 
 
 
0.036 (0.338, 0.735) 
 
0.017 (0.152, 0.879) 
VINCOME   
 
 
 
 
0.449 (5.416, 0.000) 
0.383 (4.476, 0.000) 
N  698 
694 
694 
694 
694 
637 
633 
Adj R sq  0.032 
0.008 
0.034 
0.034 
0.035 
0.043 
0.057 
This tells us quite a lot of what we need to know. For a start, the answer to whether education has an independent effect on wt/age appears to be ‘Yes’  the coefficient remains highly significant and becomes only marginally smaller (0.448 in model 1 to 0.321 in model 7) when controlling for a set of potentially important confounders. At least two, for roofing and income, are by themselves significantly associated with wt/age (models 2 and 6), thus are acting as useful controls. Although others do not have significant associations, they are nonetheless clearly not alternative explanations for the association of education with wt/age. The size of the effect is: taking account of other factors, improved education appears now to lead to an increase in around 0.325 units in wt/age zscore. This can now be translated into prevalence terms. Rerunning model 7 with low wt/age prevalence would give a reasonable estimate of the size of predicted prevalence change. We would expect around 15% difference based on the numbers given earlier. Why not rerun model 7 with low wt/age (waprev) as the dependent variable and see what it comes to?  CLICK HERE to see Multiway Regression Prevalence results.
There is some multicollinearity evident, as expected: for example, the coefficient for roofing falls substantially, and becomes insignificant, when in the model (no. 2) with education.
With 6 independent variables (and there are several hundred in a typical dataset) there are already a large number of possible combinations that could be tried  it is essential to know what one is trying to investigate when specifying a model. Any number of combinations can be run quickly with a modern computer; the constraints are in interpreting the results and finding useful answers to relevant questions.
A related approach with similar variables is taken for other research questions. While education is open to change by long term policies, we may be interested in more immediate program decisions for example studying possible effects of water and sanitation interventions on wt/age. Here the research question might be stated as:
Do water supply and/or sanitation have likely effects on
malnutrition, accounting
for (or independent from) their association with improved education and other
socioeconomic factors?
From the table above, we know that water and sanitation variables are not highly associated with underweight in the regression equation when education is also in the equation (‘model’ and ‘equation’ are here, as is usual, used interchangeably). Model 5 in the table above tells us that ‘DPIPED’ has a pvalue of about 0.1, representing some degree of association. We also would expect, on principle, that those with better education would have improved water and sanitation. This is supported by the reduction in the coefficient for ‘DLOWEDN’ in model 5 (0.386) compared to model 1 (0.448), which is a sign of multicollinearity.
We can investigate the multicollinearity directly by looking at the bivariate correlation coefficients. Strictly, more appropriate methods for examining associations between categorical variables should be used, chisquared for instance. But in practical analysis, matrices of Pearson’s (not Roger) correlation coefficients are commonly used as a rough check on collinearity.
In this case  CLICK
HERE for the Full
Matrix  the correlation coefficients (r) are:
DLOWEDN with: piped water,
r = 0.152 (p=0.000); with well water, r = 0.038 (0.265); with no toilet access, r =0.169
(0.000). In interpreting correlation coefficients, a first question is whether the sign is
in the expected direction. In this case, they all are: for example, the negative sign
between low education and piped water means that when there is lower education (DLOWEDN =
1) piped water tends not to be present (DPIPED = 0). Both the size of the coefficient and
its significance alter with sample size, so it is necessary to interpret with caution when
sample sizes are low (say < 50); equally, very small and relatively unimportant
coefficients can be significant for large sample sizes (in the thousands). For these and
other reasons  like sensitivity to outliers  the correlation coefficients are not a
primary tool for exploring associations and are seldom used in robust practice for
describing these. But as used here, they do warn us that there is indeed enough
collinearity between the independent variables to potentially interfere with the analysis.
In any event, we would first investigate the relationship of wt/age (waz) with the sanitation and water variables by regression, to begin with as the bivariate relationship, as shown in models 1 and 2 in the table below. These show that each have significant effects in the expected directions, that of piped water stronger than for toilet. At this point we would check that we are not dealing with small cell sizes  that there is a significant number with piped water, for example  this and other details of the full analysis summarized here are in the subfile reached by CLICKING HERE. By including both water and sanitation in the same model (3) we see that they do interfere with each other; piped water remains significant. We might keep both in the model, for now, to move on to looking at the likely confounding effect of education level. In model 4 DLOWEDN is now entered, and indeed the coefficient size and significance of both water and sanitation falls substantially.
Dependent variable = WAZ
Independent Variable 
Model number: coefficient (t, p) 

1 
2 
3 
4 
5 
6 
7 

Cases included  All 
All 
All 
All 
DLOWEDN=1 
DLOWEDN=0 
DLOWEDN=0 
NOTOILET  0.255 (2.020, 0.044) 
 
0.209 (1.641, 0.101) 
0.123 (0.965, 0.335) 
 
 
 
DPIPED   
0.369 (2.683, 0.007) 
0.335 (2.408, 0.016) 
0.237 (1.708, 0.088) 
0.096 (0.415, 
0.353 (2.121, 0.035) 
0.344 (2.048, 0.042) 
DLOWEDN   
 
 
0.409 (4.335, 0.000) 
NA 
NA 
NA 
DBADROOF   
 
 
 
 
 
0.100 (0.615, 0.546) 
N  697 
697 
697 
697 
411 
285 
283 
Adj R sq  0.004 
0.009 
0.011 
0.036 
0.002 
0.012 
0.010 
The interpretation of model 4 is that we can no longer be sure whether or not piped water has an effect on wt/age when education is taken into account, because education is correlated both with water supply and with wt/age. Had it remained highly significant, that would be good evidence for an independent effect; but the converse is less certain.
But all is not lost. We can continue by examining the relation of water supply to wt/age within category of DLOWEDN. This is done in models 5 and 6. (SPSS allows the selection of cases included in a model by values of another variable, DLOWEDN in this case; otherwise a ‘select if’ or equivalent condition can be imposed.)
Models 5 and 6 tell us that the effect of piped water on wt/age only occurs for the bettereducated group (DLOWEDN = 1). There is no effect at all for the less educated group. This can also be studied by testing the interaction term in the full model  you can try this out later  which will give an equivalent result. But the result probably goes far enough for the present research question as applied to program planning. This is a common way of dealing with multicollinearity under these types of conditions.
The last thing one might want to test is if the strong relationship in model 6 is itself confounded by other socioeconomic factors. For this, one might want to enter the variable for housing quality, DBADROOF, and see what happens to the water supply coefficient. As model 7 shows, it scarcely changes. With some more investigation, for example trying other possible confounders, and checking that we are not getting weird results because of a few outliers, or very small cell sizes (‘compare means’ in SPSS is good for this), we are now well on the way to establishing a possible causal relation between water supply and wt/age.