Page  1

Descriptive Analysis

Descriptive statistics are used throughout data analysis in a number of different ways. Simply stated, they refer to means, ranges, and numbers of valid cases of one variable.

First, descriptive statistics are important in data cleaning, as seen in Chapter 2. They are regularly used -- generated or reviewed from hard copy -- during analysis to keep an eye on the variables being used, especially when a considerable number are being studied. It is very important to monitor the ‘N’ (number of valid cases) for each variable.  If the 'N' differs greatly between variables, consider this an early warning of problems that may arise when the variables are examined together later on. You would then want to track down why cases are being lost between variables. At the same time, out of range values with unexpectedly high (or low) means and standard deviations, and other simple parameters, are looked for. This is illustrated in the examples later in this section.

Second, a typical use of  descriptive analysis is to produce a situation analysis which usually consists of national or sub-national level information such as land size, population, income, health expenditure, illness, malnutrition, mortality, access to safe water, sanitation, and agricultural production levels, to name a few.  This data provides a snap shot of the situation under study. Often the information from the situation analysis comes from a variety of sources such as Multi-Indicator Cluster Surveys, national statistics, and compilations done by organizations such as UNICEF and WHO.

Look at an example of the SITUATION ANALYSIS of the Tanzania mainland.

1.  Producing Descriptive Summary Data Using SPSS

It is fairly simple to produce these types of information from the data sets available in this learning package. Remember that the descriptive analysis can often be presented more accurately for the continuous variables than for categorical variables because of lost information from collapsing it into categories. The linked exercises present continuous variables only and categorical analysis will follow. This outcome shows a possible presentation of a Descriptive Statistics exercise. The descriptive statistics chosen include:  N, Minimum, Maximum, Mean, and Standard Deviation.  There are numerous other optional descriptives to choose from in the SPSS program.

Try a couple of exercises to create and interpret the SUMMARY DESCRIPTIVE TABLES and FREQUENCIES using Descriptive Statistics and Frequencies for several variables in the Bangladesh data set.

2. Producing Histograms and Examining Frequency Distributions

Histograms are used to simply show the distribution of a quantitative variable  by its relative frequency of data points in an interval (in this case the intervals are standard deviations on a z-score distribution).  The histogram function in SPSS gives the option of overlaying a normal curve,which helps determine how well the HAZ, WAZ, and WHZ fit the normal distribution, or a bell curve.  If there is a strong deviation from the normal distribution or if the distribution mean is off-set far from 0, then this might tell you something about possible problems with the data (e.g. measurement or calculation errors might lead to a non-normal distribution) or also show severe nutrition problems in the population (e.g. a normal distribution with a mean at -1.8 SD). Histograms are most useful for continuous data to see if the distribution is normal and whether there was adequate data cleaning (are there outliers?). Also, histograms are useful for determining arbitrary cut-points for creating categories from continuous data (e.g. creating three levels of income by visually dividing income distribution into three equal parts using the histogram)

Try to produce a HISTOGRAM for continuous anthropometric data.

INTERPRETATION:

This histogram does approximate a normal or Gaussian distribution, as expected for a z-score. The WAZ variable has also evidently been cleaned (all values above and below 5SD have been eliminated in the data cleaning exercise CH 2) and shows a shift in the mean down the scale to -1.29 SD.  This would indicate that the average child under 5 years (for which the measurements were taken) is at least mildly underweight in comparison with a normal population (and likely the population has a malnutrition problem).

Now try running an exercise to create a HISTOGRAM for the continuous outcome variables HAZ and WHZ in Eastern Kenya.

If you started with continuous outcome data, in many cases it is useful to convert that data into categorical and report the percentages of  the categories representing underweight, stunted and/or wasted children under 5 years of age. As you remember, the WHO cut-point for being classified as at least moderately malnourished is -2.00 standard deviations.

We will use the keast4j.sav data set (DHS of the Eastern Region of Kenya) to:

 Produce and interpret descriptives of the continuous anthropometric data. Run frequency tables for the categorical variables showing the PREVALENCE of underweight, stunting, and wasting.

Try running an exercise to RUN FREQUENCY TABLES for stunting, wasting and underweight.

3.  Scatterplotting

Although scatterplotting is normally used to examine the association of two continuous variables, it has a restricted practical use here to examine descriptively the scatter of values when the sample size is quite small -- for example prevalences by a few divisions in a national dataset. This gives a first idea as to differences between areas, and whether they are due to a few extreme values. For the whole national dataset (or one area at a time) this is better done with the histogram of frequency distributions we saw in the previous section. But scatterplotting is so easy and quick that it’s convenient even when the classifying variable is categorical and unordered, like administrative divisions. An example from the Bangladesh district data looks like this (the point labels are the district numbers from the data set).

INTERPRETATION:  The scatterplot shows that there is quite a lot of variability within the divisions, for the range of prevalences for the districts has a broad range.  For instance, Division 1 has 14  districts that range from approximately 8% low arm circumference in <5 year olds to almost 30% low arm circumference.  This does make it clear that it is important to look within the divisions to find the areas that need more attention.  The case (or in this example district) label is provided to indicate which districts are worse off in each division.

Try running an exercise to create this SCATTERPLOT with the Bangladesh data set.