Page 1

When and Why?

To recap, nutritional data is analyzed in the present context to contribute to policy and program design, analysis, and evaluation. In practice -- and as focused on here -- this means addressing questions like:

needs for targeting and coverage of programs;
analyzing how programs are actually reaching those who need them;
examining likely causal factors for malnutrition, whether contextual, open to long-term policy change, or to be addressed within specific programs;
evaluating whether program activities appear to be causing nutritional improvement --
this latter has not yet been explicitly looked at, but is exactly the same in principle as treating the activity as a potential causal factor.

In earlier chapters we have gone quite some way in determining what associations may exist in a cross-sectional dataset. If these associations are genuine, they are good candidates for causal factors; and if correctly chosen, may thus be suitable as processes for change by policies or programs. We have also looked at interactions which may make the interventions more or less effective. One main purpose of extending the analysis to look at a number of variables at one time is to try to decide which factors are in practice most likely to be actually causing malnutrition. To return to the example used earlier, we might be interested in the effect of education, and need to take account of the likelihood that better educated people are better off in terms of income, housing, environment, and so on.

Multi-way analysis, as used in this manner, still examines the association of one dependent variable with a set of independent, determining or classifying variables. If these are categorical (male-female, good-bad housing, etc) and the sample size is large enough, multi-way tables are effective and quite easy to envisage: 2 dimensions are common, as columns and rows in a spreadsheet, for instance; 3 dimensions are now used in spreadsheet programs, adding layers or sheets; these and even more dimensions can be envisaged on a 2 dimensional sheet by progressive sub-division of columns or rows. The same does not apply for continuous variables. For these the standard presentation is a graph or scatterplot, which only easily accommodates two variables (in 2 dimensions); three variables are sometimes displayed, as a rather clumsy 3 dimensional graph on a 2 dimensional page or screen. More than 3 dimensions boggles the mind if one tries to envisage it -- yet we have much less problem thinking about using 4 variables in a much-subdivided table. This is one of the difficulties in picturing multi-way analysis when some of the variables are continuous. Nonetheless, regression can be expanded seamlessly from one to many variables.

Although hard to envisage, multiple regression is such a convenient and usually robust technique that it is widely used for this application. In fact, although we could use multi-way tables and the associated ANOVAs, regression for nutritional data is generally used for categorical (often ‘0-1' or dichotomous) variables, as well as for continuous variables (the most powerful method), or a mixture of the two. Reverting to graphical examination, displaying or investigating by tables, are crucial as part of the approach, but regression more and more forms the core, and is the major technique used here.

Three applications are common, to study:

1. Whether bivariate associations are likely to be genuine, or are confounded by other associations; this applies particularly to making judgements on causality to suggest program content (‘causal models’); confounding is almost always present, and may be:

expected -- in the sense that a number of causes, like environment and socio-economic status, tend to be bundled, but program design may need to try to isolate the effects of one or a few factors in the bundle;
unexpected -- alternative explanations that you might not think of straight off, but which may invalidate your conclusions (and waste weeks of your work, or even lead to a flawed program design); the association of measles immunization with age and nutritional status is an example.

2. Interactions:

with other causal factors, affecting program design -- bottle feeding with sanitation, or water supply with latrines, are demonstrated examples;
with who people are (this is also an interaction if it affects malnutrition and its relation to causes), used primarily for targeting decisions -- age,gender, location are examples.

Click Here to review some feature discussed previously about interactions (use the web browser Back button to return)


All types of INDEPENDENT VARIABLES -- continuous, quasi-continuous, categorical, dichotomous, dummy (0-1) variables --may be used in regression analysis. Regression is the method of choice when the variables are continuous, and GLM/ANOVA is an alternative when the variables are categorical -- they reduce to the same answers anyway.

A guiding concern is that the relationship of the independent variable with the dependent (the latter assumed to be continuous) should be inherently as linear as possible. This means that the continuous variable may need to be transformed (CLICK HERE to link to the Transformations page). There are several approaches, discussed here for the independent variable when continuous.

First, for example with income or GDP, the logarithm (natural or base-10) is taken --
this allows for the common skewed distribution and lesser effect of higher incomes;
where the real relation should be with a simple power of the variable (including
reciprocal) but this is not known, the logarithm can again be used, rather than
transforming by the power itself
Second, especially with prevalence, a logit is calculated: logit p = ln((1-P)/P). This
allows for extremes of high or low prevalence to have less weight.
Third (this is not exactly a transformation, but has the same effect and intent), where
a more complex curve is expected, a second variable may be created and included
in the equation. This is the common approach for using weight or height as the
dependent variable, instead of weight-for-age, when AGE and AGE-SQUARED
are used on the right hand side of the equation. In theory, but not often in practice,
a cubed term could also be included.

A process of trying out transformations and scatterplotting is appropriate for investigating the best approach. Better still is to calculate the residual, and plot this against the predicted value of the dependent variable. What you are looking for is a nice even scatter of residuals around the horizontal y = 0 line.

When the variable is ‘quasi-continuous’ it may be tempting to treat it as continuous. Years of schooling is an example, as a measure of education; or numbers of children in the household. Here a similar question of linearity of the relationship applies as discussed above, and can be dealt with in a similar way.

A related but somewhat different issue arises when a few categories are artificially created and it’s not necessarily clear that the categories all are of a similar type, nor that the ‘distance between them’ has a similar meaning. For example, scales are sometimes created based on number of assets owned, or number of different food types eaten, and so on; while 5 is probably ‘better’ than 4, is it one unit ‘worse’ than 6, and so on: are the differences comparable? Sometimes a difference might mean buying a car, others a radio. Under these conditions the safe approach is:

create a few categories (say 3 -- e.g. 0-3, 4-6, >6);
don’t ascribe values of 1, 2, 3 to these in a single new variable,
but create three ‘0-1' dummy variables, one for each group.
(Don’t forget you’ll exclude one of these dummies in regression.
But the third dummy will give you flexibility in the analysis, and you
might as well create it now in case it’s useful later.)

This makes the minimum of assumptions, but preserves much of the information (if the sample size was big enough and it was important, you could always create a dummy for each value 0 through 6 or more; but that’s going a bit far).

There is then a gradual shading into categories that are likely to be ordered in their relation to outcomes of interest, and those that are probably not ordered. For example, piped water is probably better than well water, in turn probably better than river; but it’s better not to assume that, and to treat them as unordered categories. Even less sure would be (say) occupation: experience hints that professionals are better off than farmers, but we certainly shouldn’t assume that. Finally, where the category is location -- urban-rural, or district, perhaps -- then usually we would have no basis for ordering anyway. All these are best treated the same, as unordered categories: they should be grouped if necessary and ‘0-1' dummies created for each, just as for ‘quasi-continuous’ variables.

CLICK HERE to see how to Re-categorize and Create Dummy Variables.

Finally, inherently dichotomous variables -- gender is one example, alive-dead or pregnant-non-pregnant are others -- behave exactly as dummy variables. It’s probably worth coding them as 0-1 rather than, say 1-2, as it will be slightly easier to interpret the meaning of the intercept and slope in regression results -- but only slightly easier and not worth any argument over, for example, whether M or F is coded 0 or 1, or 1 or 2! (But numbers DO have to be assigned...)

Return to Top of Page