| FS Home |
|
| Section 1: Introduction | |
| Section 2: Coping Strategies | |
| Section 3: Computing | |
| Section 4: Analysis Ex. (HLS Bangladesh) | |
| Section 5: Analysis Ex. (HLS Kenya) |
One-Way Analysis
Targeting
Two-Way Analysis
Regression
Most analyses use associations between variables in one way or another.
This applies to targeting and to possible causality,
therefore influencing intervention designs. Thus the two major reasons for conducting this
type of analysis is for
1) targeting, and
2) identifying possible associations for causality.
Associations can first be studied as one-on-one (or one-way), referring to the association
of a dependent (or outcome)
variable with an independent (or determining/classifying) variable. At this stage we can
begin to get a feel for the structure of the data using simple tabulations. More than
that, valid associations usually show up for the first time with such tabulations. If
significant associations do not show up, then they are unlikely to appear magically at a
later stage (although that can happen, and is interesting when it does).
Let's use the following question to start with our one-way investigation.
What is the relationship at the household level of irrigation and
cultivation on
food sufficiency in Somalia?


There are many situations in which transformations of the dependent or independent variables are helpful in regression.
It should be remembered that it may not be possible to find a set of transformations that will satisfy all objectives. For example, a transformation on the dependent variable to simplify a nonlinear relationship will destroy both homogeneous variances and normality if the original variable satisfied these. Transformations for homogeneity of variances and normality generally go together, but given the choice, variance is usually given precedence over normality. (J. Rawlings, S. Pantula, D. Dickey, 1998).
Here are a few suggestions for transformations for particular distributions:
As can be seen in the above histograms, both of the variables
distributions are skewed to the left. Due to this we need to perform logarithmic
transformations in order to use both variables in a regression. After
the transformation our new variables have much more of a normal
distribution or bell-shaped curve than the non-transformed variables.


Notice the "n" in logirr99, it dropped to 335 from 432. The reason for this is that there is no log for zero values, so all zeros are eliminated. This is important because now the analysis is only looking at those that irrigated, thus leaving out data for those who did not irrigate. In order to do the linear analysis, the transformation must be done; an option would be to look for a better variable to transform where you would lose fewer or no cases. We are now ready to perform the regression analysis.
Here is an example of a scatterplot:

There is a positive association with the level of food sufficiency and the area of land
under cultivation for the household.
This is shown by the slope of the fitted line (we will check in a moment to see if this is
significant). This positive slope is reasonable, considering that generally the more land
under cultivation the greater the amount of food produced.
Here, the fitted line shows the association between the outcome, food sufficiency, and
the independent variable, area under cultivation (it is calculated by minimizing the
square of the value of the sum of all the distances between the observation and the line
-- hence, the least squares line). The line fits all of the points marked
on the graph so that the deviation squared from the line for all points combined is the
smallest possible for that set of points. The deviations from the line are used to predict
other features of the relationship between outcome and independent variables. One of these
is the coefficient of determination (R-squared), which tells the amount of variability in
the outcome that is due to the independent variable (percent of change in food sufficiency
due to the area cultivated). An r-square of
0.0031 is telling us that very little of the variability in food sufficiency is explained
by cultivation.
More complex analysis is mostly concerned with further investigation of the validity of simple associations, what modifies them, when and how they occur, and so on. (In most cases the associations have to be there in the first place). Experience shows that a good place to start is looking at associations between underweight (< -2 standard deviations weight-for-age ---the outcome variable) and variables like water supply and sanitation, maternal education, housing quality, and literacy, to name a few examples. These variables are often associated with anthropometric data. If no association exists, there may be doubts about the data set (but first make sure the data set is clean before proceeding!). Keep in mind, exploring simple associations is the first indispensable step to analysis, but don't rush to conclusions on the basis of raw associations -- there's a long way to go before you can make confident inferences for program design (or evaluation).
Here are two examples: (figures in brackets refer to # with access to health services or < -2sd/total # in cell)
| Access to Health Services | Prevalence <-2 sd WAZ | ||
Literacy |
Water |
||
Good* |
Bad |
Good* |
Bad |
15.7% |
13.0% |
22.1% |
32.8% |
p = 0.283 |
p = 0.004 |
||
| * able to read at predetermined level | * tap or well water | ||
Inside the cells on the left is the outcome variable delivery at a public or
private health center, which is a proxy (or substitute) for access to health
services, and on the right is prevalence of < -2 standard deviations for
waz. The two classifying variables, literacy and water, are dichotomized (or split
in two) in order to look for differences in health service access or prevalence of
malnutrition between high and low levels of literacy and water source quality. Water and
literacy are split in half according to what is considered the mean (or average) level of
that particular variable. All values above the mean are considered good/high, and all
others bad/low (as shown, good literacy = able to read at predetermined level; for water,
tap and well are considered best).
The percentages in the
cells represent those who have access to health services or waz < -2 sd, within each
category of literacy and water. The percentages are derived from the numbers in brackets
beneath them: 534 people are considered to have good literacy, and 84 of that 534 have
access to health services, equaling 15.7%. The difference in health service access between
good and bad literacy levels is small (2.7%) and insignificant (p>.05 at .283), but for
the prevalence of malnutrition by water source it is quite large (10.7%) and its p-value
is significant at .004. What does this mean? Both variables show differences in the
expected directions: people with good levels of literacy have greater access to health
services (probably related to SES status), and for those with a good water source the
prevalence of malnutrition is lower.
Does this mean that
higher literacy leads to better health care access? Possibly, but the difference might be
a product of where people live, as in rural vs. urban: those living in rural areas
probably have less access to education and health care. It might also be related to
socio-economic levels: people who have a higher income or status level may have higher
literacy, better education opportunities, and also better access to health services (a
next step might then be to compare high and low SES levels by water source, health care
access, literacy, etc.). The compare means
routine can be used to analyze these questions.


These two tables show the difference in % < -2 sd between safe and not safe sources of water. The overall mean is 29.5%, but the difference between the education levels is 10.7%, and we know that this difference is statistically significant because the p-value is < .05, at .001. The findings suggest that further investigation must be conducted, for example regression analysis, testing water further while controlling for other significantly associated variables.