FS Home

Section 1:  Introduction
Section 2:  Coping Strategies
Section 3:  Computing
Section 4:  Analysis Ex. (HLS Bangladesh)
Section 5:  Analysis Ex. (HLS Kenya)

                                       

bullet.jpg (717 bytes) One-Way Analysis    bullet.jpg (717 bytes) Targeting    bullet.jpg (717 bytes) Two-Way Analysis    bullet.jpg (717 bytes) Regression


Most analyses use associations between variables in one way or another. This applies to targeting and to possible causality,
therefore influencing intervention designs. Thus the two major reasons for conducting this type of analysis is for
                            1) targeting, and
                            2) identifying possible associations for causality.

Associations can first be studied as one-on-one (or one-way), referring to the association of a dependent (or outcome)
variable with an independent (or determining/classifying) variable. At this stage we can begin to get a feel for the structure of the data using simple tabulations. More than that, valid associations usually show up for the first time with such tabulations. If significant associations do not show up, then they are unlikely to appear magically at a later stage (although that can happen, and is interesting when it does). 


Let's use the following question to start with our one-way investigation.

What is the relationship at the household level of irrigation and cultivation on
food sufficiency in Somalia?

culthisto.jpg (14762 bytes)irr99histo.jpg (16657 bytes)

 

Data Transformations

There are many situations in which transformations of the dependent or independent variables are helpful in regression.

It should be remembered that it may not be possible to find a set of transformations that will satisfy all objectives. For example, a transformation on the dependent variable to simplify a nonlinear relationship will destroy both homogeneous variances and normality if the original variable satisfied these. Transformations for homogeneity of variances and normality generally go together, but given the choice, variance is usually given precedence over normality. (J. Rawlings, S. Pantula, D. Dickey, 1998).

Here are a few suggestions for transformations for particular distributions:

Notice the "n" in logirr99, it dropped to 335 from 432. The reason for this is that there is no log for zero values, so all zeros are eliminated. This is important because now the analysis is only looking at those that irrigated, thus leaving out data for those who did not irrigate. In order to do the linear analysis, the transformation must be done; an option would be to look for a better variable to transform where you would lose fewer or no cases. We are now ready to perform the regression analysis.

                                 Here is an example of a scatterplot:

FScult.jpg (16148 bytes)

There is a positive association with the level of food sufficiency and the area of land under cultivation for the household.
This is shown by the slope of the fitted line (we will check in a moment to see if this is significant). This positive slope is reasonable, considering that generally the more land under cultivation the greater the amount of food produced.

Here, the fitted line shows the association between the outcome, food sufficiency, and the independent variable, area under cultivation (it is calculated by minimizing the square of the value of the sum of all the distances between the observation and the line -- hence, the least squares line). The line fits all of the points marked
on the graph so that the deviation squared from the line for all points combined is the smallest possible for that set of points. The deviations from the line are used to predict other features of the relationship between outcome and independent variables. One of these is the coefficient of determination (R-squared), which tells the amount of variability in the outcome that is due to the independent variable (percent of change in food sufficiency due to the area cultivated). An r-square of
0.0031 is telling us that very little of the variability in food sufficiency is explained by cultivation.


More complex analysis is mostly concerned with further investigation of the validity of simple associations, what modifies them, when and how they occur, and so on. (In most cases the associations have to be there in the first place). Experience shows that a good place to start is looking at associations between underweight (< -2 standard deviations weight-for-age ---the outcome variable) and variables like water supply and sanitation, maternal education, housing quality, and literacy, to name a few examples. These variables are often associated with anthropometric data. If no association exists, there may be doubts about the data set (but first make sure the data set is clean before proceeding!). Keep in mind, exploring simple associations is the first indispensable step to analysis, but don't rush to conclusions on the basis of raw associations -- there's a long way to go before you can make confident inferences for program design (or evaluation).

Here are two examples: (figures in brackets refer to # with access to health services or < -2sd/total # in cell)                                      

Access to Health Services   Prevalence <-2 sd WAZ

Literacy

Water

Good*

Bad

Good*

Bad

15.7%
(84/534)

13.0%
(42/322)

22.1%
(47/213)

32.8%
(159/485)

p = 0.283

p = 0.004

* able to read at predetermined level * tap or well water

Inside the cells on the left is the outcome variable ‘delivery at a public or private health center’, which is a proxy (or substitute) for access to health services, and on the right is ‘prevalence of < -2 standard deviations for waz’. The two classifying variables, literacy and water, are dichotomized (or split in two) in order to look for differences in health service access or prevalence of malnutrition between high and low levels of literacy and water source quality. Water and literacy are split in half according to what is considered the mean (or average) level of that particular variable. All values above the mean are considered good/high, and all others bad/low (as shown, good literacy = able to read at predetermined level; for water, tap and well are considered best).
            The percentages in the cells represent those who have access to health services or waz < -2 sd, within each category of literacy and water. The percentages are derived from the numbers in brackets beneath them: 534 people are considered to have good literacy, and 84 of that 534 have access to health services, equaling 15.7%. The difference in health service access between good and bad literacy levels is small (2.7%) and insignificant (p>.05 at .283), but for the prevalence of malnutrition by water source it is quite large (10.7%) and its p-value is significant at .004. What does this mean? Both variables show differences in the expected directions: people with good levels of literacy have greater access to health services (probably related to SES status), and for those with a good water source the prevalence of malnutrition is lower.
            Does this mean that higher literacy leads to better health care access? Possibly, but the difference might be a product of where people live, as in rural vs. urban: those living in rural areas probably have less access to education and health care. It might also be related to socio-economic levels: people who have a higher income or status level may have higher literacy, better education opportunities, and also better access to health services (a next step might then be to compare high and low SES levels by water source, health care access, literacy, etc.).  The compare means routine can be used to analyze these questions.


These two tables show the difference in % < -2 sd between safe and not safe sources of water. The overall mean is 29.5%, but the difference between the education levels is 10.7%, and we know that this difference is statistically significant because the p-value is < .05, at .001.  The findings suggest that further investigation must be conducted, for example regression analysis, testing water further while controlling for other significantly associated variables.