Regression is used:
with a continuous outcome -- dependent --variable
with continuous and categorical (usually 0
-1) determining -- independent -- variables, to:
-- look for interactions
check findings by tabulation, and vice versa
...but anyway, usually presented in tables and graphs, converting to prevalence to illustrate meaning
This section introduces simple regression using two independent variables. It will provide the fundamentals of regression analysis. Regression is especially powerful for exploring more than two independent variables, and it will be most useful for multi-way analysis in the following module.
The following graph shows a simple (one independent variable) regression equation and where each of the variables is represented.
DEVELOPING A REGRESSION EQUATION
To start, it is a good idea to have a look at the associations of the independent variables and the outcome using scatterplotting that relates to your research question. Here are the two questions we will use for analysis in this section:What is the relationship at district level of water and sanitation access on nutrition status in Bangladesh?
Does education have an association with nutrition status independent of SES in Kenya at the individual level?
As a first step in the analysis, use scatterplots to show one-way associations between independent variables (water, sanitation, education, SES) and outcome variables (nutritional status - either prevalence of low arm circumference in Bangladesh or mean weight for age z-scores in Kenya). Note that individual data gives prevalence as a continuous variable in district level datasets such as Bangladesh; but the weight for age z-scores itself at individual level is given in individual level datasets such as Kenya.
CLICK HEREto see an exercise in Scatterplotting. One type of Scatterplot results: Prevalence of low AC by water source in Bangladesh by District There is a positive association with the prevalence of low arm circumference and the use of other (unsafe) water for the household. This is shown by the slope of the fitted line -- we will check in a moment to see if this is significant. This positive slope is reasonable, considering contaminated water can cause illness and therefore loss of growth.
Here, the fitted line shows the association between the outcome and the independent variable (It is calculated by minimizing the square of the value of the sum of all the distances between the observation and the line-- hence, the least squares line). The line fits all of the points marked on the graph so that the deviation squared from the line for all points combined is the smallest possible for that set of points. The deviations from the line are used to predict other features of the relationship between outcome and independent variables. One of these is the coefficient of determination (R2), which tells the amount of variability in the outcome that is due to the independent variable (percent of change in malnutrition due to the water source). The scatterplot for low arm circumference and water source would have an associated regression equation that looks like the one that follows (remember the dependent variable on the left-hand side and the independent on the right-hand side):
EQUATION:Prevalence of low arm circumference = A + B (water)
or with real numbers using the categorized group for good and bad water source:
Prevalence of low arm circumference = 8.719 + 2.810 (bad water)
Sig. for water = 0.035
BANGLADESH REGRESSION WITH WATER AND SANITATION
Both water and sanitation in the equation for prevalence of low arm circumference will give an equation like this:
Prevalence of low arm circumference =A + B1 (water) + B2 (sanitation)
Actual regression output data for prevalence of low arm circumference as the OUTCOME and water and sanitation as the INDEPENDENT VARIABLES is produced below.
Regression Output for Prevalence of Low Arm Circumference
with Water and Sanitation in Bangladesh
INTERPRETATION: In the equation both water and latrine have a significant effect on nutritional status. The effect of latrine shows a -3.558 point change in use of safe latrine use for each point increase in prevalence of low arm circumference and the effect of water is shows a 3.296 point change for use of other water for each point increase in prevalence of low arm circumference. Calculating relationships only in this way is, however, unrealistic. It assumes that the effects of water and latrine can be additive. The four cells in the two-by-two table are easy to calculate (shown below) but we have 'forced' the relationship only to allow for additive effects. In more technical (but also more usual) language, the model may have been inadequately specified by not testing for the interaction term. What if you add a variable to test for a relationship between water source and latrine, which will control for their effect together in the model. This would be called an INTERACTION term.
Interaction terms are not something to throw around though, instead they are terms that are included because there is a strong argument to convince you that there is something about these two variables that might cause them to have a relationship with each other as well as with the outcome, or there is strong evidence in the literature showing that these variables have an interactive effect.
|Put the interaction
term (=variable) in the equation with the original variables as well; if it is significant
(usually p=<0.05 unless it is a highly interesting variable) keep it in; if not, drop
it and rerun the regression equation without it. The coefficients of the variables will be
affected by the interaction term because of multicollinearity.
|If interaction exists, there will be significantly different slopes in the two variable lines. So, plot the results as graphs; if necessary set the other variables to their mean for calculation.|
In this case it is important to run a variable in the regression equation to account for interaction. An interaction variable is the product of the two individual variables, for example latrine times water = interaction of water and latrine (wat_lat).
The Rule is: If the interaction term is significant (usually at p<0.05), it should remain in the final equation. Sometimes this leads to a lowering of the individual coefficients in the model due to the collinearity of the variables, which will be dealt with later and cannot be helped. If the interaction term is not significant, it is dropped and the model is rerun without it.
BANGLADESH REGRESSION EXAMPLE CONTINUES (with an Interaction term):
Prevalence of Low Arm Circumference by Latrine Use and Water supply
(parentheses represent number of districts)
|Water source||Latrine Bad||Latrine Good||TOTAL|
|Bad||14.5 % low AC (19)||8.5% low AC (19)||11.5% low AC (38)|
|Good||8.2% low AC (13)||8.4% low AC (12)||8.3% low AC (25)|
|TOTAL||12.0% low AC (32)||8.5% low AC (31)||10.2% low AC (63)|
There are 19 districts with both poor latrine and poor water access that have an overall higher prevalence of low AC (14.5%) compared to other areas. Those areas with either better access to latrine (>45% of the people in the district), better access to safe water (<50% use unsafe water), or both show a lower prevalence of malnutrition (all approximately 8 to 8.5%). It is clear from this presentation that there is not a noteworthy difference between the areas that have good water or good latrine and those that have both. This would lead us to believe that there is an interactive effect between water and sanitation. Take a look at the graph.
We can allow for the lines to have different slopes, which is the same thing as allowing the effects to be greater or less than additive by creating an interaction term (by multiplying water and latrine variables together) thus it gives the equation, or model:
Prevalence of low arm circumference =A + B1 (water) + B2 (sanitation) + B3(water*sanitation)
When we run this model, the results are:
Prevalence of low AC= 8.238 +6.283 (bad water) + 0.137 (safe latrine) -6.121(water& latrine interaction)
This equation includes an interaction term for water and latrine (wat_lat). The term for bad water is associated with a significant increase in the prevalence of malnutrition (p=0.000), but the term for bad latrine becomes insignificant (p=0.939) when the interaction term is added in the model, due to collinearity between the variables of water and latrine. The interaction term will therefore need to be retained in the model to control for this effect which would otherwise lead to inappropriate assumption that water had a lesser effect on nutritional status. As seen in the previous output, there was a 3.3 unit change, without the interaction term, per percent change in low arm circumference vs. 6.3% with the interaction term. For latrine there was a greater effect without the interaction term (3.6 unit change per percent change in low arm circumference vs. 0.14%) than is the case when controlling for the interaction.
TAKE A LOOK at a Regression model with an INTERACTION term for the Bangladesh dataset.
KENYA'S REGRESSION WITH EDUCATION AND ROOFING:
Now we will address the research question for Kenya:
Does Education have an effect on Nutritional status independent of SES?
EQUATION: WAZ Score = y-intercept + B1 (EDUCATION)+ B2 (ROOF)
When you run the regression, test for interaction and rerun without the interaction term if it is not significant. In this case, the interaction was dropped (not significant) and the results are:
WAZ Score = - 0.987 - 0.420 (LOW EDUCATION) - 0.158 (BAD ROOF)
INTERPRETATION: The regression analysis output final results are shown above for education and roofing with the outcome WAZ score. The regression coefficient (B= -0.420) for education is still significant (p=0.000) even when roofing, which is shown as DBADRO for dummy of bad roof, is added. As you will see when you run the next exercise, roofing is not significant when run with education (model 4) as it was in a model on its own (model 2), but is included to control for the the contribution of SES in the model. The coefficient for education only slightly decreases (from -0.448 to -0.420) when roofing is added, but it remains significant. The interaction between roofing and education was not significant, therefore the model was simplified to only include roofing and education individually. Our interest in knowing if education has an influence on nutritional status above and beyond socio-economic status (estimated by roofing) has now been answered, or at least we have begun. We can also look within strata of roofing to see if the education effect still exists, which will be covered in confounding and the graphing and table of prevalence routine is also helpful to "see" the relationships that are here.
Mean WAZ Score by Education and Roofing Quality
(parenthesis represent the number of individuals in the category)
|Educational Attainment (respondent)||Roof Quality Bad (grass/ thatch)||Roof Quality Good (corrugated iron)||TOTAL|
|Bad (less then primary)||-1.57 (170)||-1.40 (241)||-1.47 (411)|
|Good (primary +)||-1.28 (59)||-0.99 (225)||-1.02 (284)|
|TOTAL||-1.46 (229)||-1.20 (466)||-1.29 (695)|
The graph concurs with the results of the regression analysis, for it supports that there is not a significant difference in the slopes of the lines (therefore no interaction between educational attainment of the respondent and the roofing quality of the home). The graph does illustrate a meaningful association between malnutrition and educational attainment by showing the difference nutritional status between low education and high education groups is large regardless of roofing quality. You can also see a sizeable difference between subgroups of roofing, although it is not significant it is still important to note that those with poor roof are overall worse than those with better roof.
CLICK HERE to see the programming for Regression analysis with Interaction - WAZ by education and roofing in Kenya.
ASKING QUESTIONS TO INVESTIGATE CAUSAL RELATIONSHIPS
Although regression provides a very powerful tool for investigating causal relationships, the results must meet the test of PLAUSIBILITY and usually should show STATISTICAL SIGNIFICANCE(depending on sample size, exceptions are made). Always ask a question that relates to causality for program design.
This might include:
|Socio-economic and health factors|
|Nutritional factors for direct intervention, for example:|
Always pose a question first and stick with it till the answer is reached. If you get lost, repeat the question.
THE QUESTION:Does measles immunization improve nutritional status? How does age affect this relationship?
It is a good idea to start with a question and then draw up a hypothesis based on logical framework. Spiders become useful in expressing relationships to explore in a regression analysis. Look for relationships that might affect the nutritional status that are on the same level, but not in the causal pathway.
For example, look at socio-economic status and water supply in the model to predict nutritional status, but do not look at water supply and diarrhea because they are on the same causal pathway. Poor water can lead to diarrhea, which leads to malnutrition. It can be helpful to start the process of formulating a question by drawing out the relationships between the variables of interest.
SPIDERSHere is a small spider for the measles immunization and malnutrition model. It might not seem terribly obvious, but when considering relationships between variables, age could effect both the measles immunization status, since more older children are immunized, as well as nutritional status, since it is known that the many children falter in growth significantly during the first few years of life in developing countries. This spider shows a very simplistic drawing of relationships between an outcome variable and two independent variables that could have associations themselves.
The first step after drawing a spider is scatterplotting for measles immunization and WAZ score in Kenya. This will show the visual picture of any linear relationship between measles and nutritional status.
CLICK HERE here for instructions on how to run a Scatterplot for Measles immunization and Age.
How do the results look? The fit line in this plot shows a slope in the downward direction, so that as % measles immunization increases, the mean WAZ score decreases. This is not what one would expect. There is obviously something occurring in the relationship between measles immunization with the outcome and age, causing an unexpected result. This is an example of unexpected confounding, which will be described in more detail as we continue.
REGRESSION WITH MEASLES AND AGE
Try running a regression analysis to look a little more in depth at how the variables are behaving with one another. Also, this is an opportunity to test interaction between measles and age. What happens?
The regression with measles and age (to predict the nutritional status of a child - WAZ score) has a strong interaction between age and measles. Now that the interaction has been detected and entered into the model to account for the effect, it is necessary to test for confounding, which is covered in the next section. Confounding is suspected here because even after accounting for the interaction that is present, there is still an unexpected association between measles and age. Since the relationship expected between immunization and nutritional status is positive, and what was found was a negative relationship (regression coefficient for measles immunization is -0.875) even after the interaction variable (measles and age) there must be something that is not accounted for. It has previously been documented that age could have a confounding effect on measles and nutritional status, therefore it will be our next endeavor to learn what confounding is and how it crept up on us unexpectedly.
CLICK HERE to see the programming details of the Regression with Measles.
Find more on Confounding in the following page in this chapter:
Return to top of page