**Descriptive
Statistics with Data Errors**

**A. District-level data** ( using ** bdeshd1.sav**
in this example)

These variables are examined:

- ACPRVF % females under 5 years by district with low arm circumference (<13.5 cm).
- ACPRVM - % males under 5 years by district with low arm circumference.
- ACPRVFM - % all children under 5 years by district with low arm circumference

.**Using SPSS to produce descriptive
statistics and graphs**

1. After starting SPSS, open the data set called

bdeshd1.sav(the "d" at the end indicates that it is the Bangladesh data set which has not been cleaned, or is "dirty".).2. Click on the

Statisticsmenu, thenSummarize, and thenDescriptives.3. Select ACPRVF and ACPRVM and place them in the Variables box with the arrow

4. Click on

Optionsto select Mean, Std. Deviation, Minimum and Maximum and thenContinue5. Click on

OKto see the results6. For more visual display try these graphics. Under

Graphs, click onHistogramand select each variable ACPRVF and then ACPRVM and select the boxDisplay normal curve, then pressOK. (You will have to run it twice, once for each variable.)

The descriptive statistics are:

The maximum values are clearly out of range for both the males (99.9) and the females (64.3). Try looking at the graphs below to see how these outliers appear in a histogram.

The prevalence of 64.3% in females (as you saw in the maximum value for females in descriptives) is obviously out of range. This value should be eliminated or set to the mean to avoid skewing during further analysis.

** **

Here, the value of 99.9% in males in one district is also out of range. This must be cleaned before proceeding with analysis. Again, either eliminating the value by setting it to missing or setting the value to the mean will take care of the skew that this might cause if it remained in the dataset.

**B. Child-level datasets (such as
from Kenya)**

In child-level datasets, essentially the same approach is used. This
example is from the child-level file**
keat4j.sav**. You can do the same on the working copy of the file, and check
that you get the same results. Let’s examine height, age, and the ht/age z-score
variable (HAZ).

**1. Scatterplot height vs age (in SPSS: Graphs/Scatter/Define).
This is the output:**

Here cases have been identified by their numbers — use these steps: double-click inside the chart, which gives a new menu; click on chart/options/case labels ON. Note that this gives case numbers according to the current sorting, which should normally be by the original order: see CASENUM for labeling method — this means that you can make a note of outlying values without worrying about the case numbers differing — which they will otherwise if you’ve sorted somewhere along the way, causing confusion.

In the scatterplot, cases 32 and 78 far too low for 3 and 4 1/2 year old (less than 200 cm high); cases 22 and 392 are also unlikely — in a large dataset like this it is best to take note of them and set to missing later, according to the case number that is out of range.

**2. Scatterplotting height and weight together is also helpful,
as shown here **

This shows some other highly unlikely combinations — e.g case 78. This case shows a 20 centimeter high child that weighs almost 15 kilograms. This is impossible. These cases might not have been identified without scatterplotting, therefore it can identify errors that may easily be removed before they cause difficulty in data interpretation.

**3. The height/age z-score has already
been calculated in this file. Let’s see how it looks. First, a histogram gives a
picture of the overall distribution. **

1. Open

keast4j.sav2. Click on

Graphs, Histogramand selectVariableand enter HAZ3. Click on the button for

Display normal curveand click onOK.Does the histogram look like this?

It’s immediately clear that the missing values have already been specified —by putting a range on the acceptable values from 4 to -5 SD. This was done in the data editor, by going to ‘define variable’ for HAZ, and set missing values to (say) >5.0. The distribution looks much as expected, whereas the outliers seen earlier in the height by age scatterplot would have certainly been out of the range had you not set those beyond 5 and -5 as missing. However, there are some rather large values at each end. Missing values should probably be set at >+4.00 at the top. At the bottom it’s a bit trickier — the cut-off here was set for -5.0 standard deviations already, for the reason that the distribution had a trailing tail. Usually the cut-off is either -4.0 or -5.0 standard deviations, depending on the distribution that it has. If there is a trailing long tail, then -4.0 may be too conservative as it was in this case. Even using these cut-points, all of the extreme values have not been removed but the majority of them have.