Descriptive Statistics with Data Errors


Return to Data Cleaning

A. District-level data ( using bdeshd1.sav in this example)

These variables are examined:

 

.Using SPSS to produce descriptive statistics and graphs

1.  After starting SPSS, open the data set called bdeshd1.sav (the "d" at the end indicates that it is the Bangladesh data set which has not been cleaned, or is "dirty".).

2.  Click on the Statistics menu, then Summarize, and then Descriptives.

3.  Select ACPRVF and  ACPRVM and place them in the Variables box with the arrow

4.  Click on Options to select Mean, Std. Deviation, Minimum and Maximum and then Continue

5.  Click on OK to see the results

6. For more visual display try these graphics.   Under Graphs, click on Histogram and select each variable ACPRVF and then ACPRVM and select the box Display normal curve, then press OK.   (You will have to run it twice, once for each variable.)

The descriptive statistics are:

ex1_1.jpg (15383 bytes)

The maximum values are clearly out of range for both the males (99.9) and the females (64.3).  Try looking at the graphs below to see how these outliers appear in a histogram.

ex1_3.jpg (26740 bytes)

The prevalence of 64.3% in females (as you saw in the maximum value for females in descriptives) is obviously out of range. This value should be eliminated or set to the mean to avoid skewing during further analysis.

   

ex1_4.jpg (26185 bytes)

Here, the value of 99.9% in males in one district is also out of range. This must be cleaned before proceeding with analysis.   Again, either eliminating the value by setting it to missing or setting the value to the mean will take care of the skew that this might cause if it remained in the dataset.


B.  Child-level datasets (such as from Kenya)

In child-level datasets, essentially the same approach is used. This example is from the child-level file keat4j.sav. You can do the same on the working copy of the file, and check that you get the same results. Let’s examine height, age, and the ht/age z-score variable (HAZ).

 

1. Scatterplot height vs age (in SPSS: Graphs/Scatter/Define). This is the output:

                            wpe7.jpg (28172 bytes)

Here cases have been identified by their numbers — use these steps: double-click inside the chart, which gives a new menu; click on chart/options/case labels ON. Note that this gives case numbers according to the current sorting, which should normally be by the original order: see CASENUM for labeling method — this means that you can make a note of outlying values without worrying about the case numbers differing — which they will otherwise if you’ve sorted somewhere along the way, causing confusion.

In the scatterplot, cases 32 and 78 far too low for 3 and 4 1/2 year old (less than 200 cm high); cases 22 and 392 are also unlikely — in a large dataset like this it is best to take note of them and set to missing later, according to the case number that is out of range.

 

2. Scatterplotting height and weight together is also helpful, as shown here

                              wpe9.jpg (22362 bytes)

This shows some other highly unlikely combinations — e.g case 78.  This case shows a 20 centimeter high child that weighs almost 15 kilograms.   This is impossible.  These cases might not have been identified without scatterplotting, therefore it can identify errors that may easily be removed before they cause difficulty in data interpretation.

 

3. The height/age z-score has already been calculated in this file. Let’s see how it looks. First, a histogram gives a picture of the overall distribution.

1. Open keast4j.sav

2. Click on Graphs, Histogram and select Variable and enter HAZ

3. Click on the button for Display normal curve and click on OK.

Does the histogram look like this?

 

ex1_2.jpg (33995 bytes)

It’s immediately clear that the missing values have already been specified —by putting a range on the acceptable values from 4 to -5 SD. This was done in the data editor, by going to ‘define variable’ for HAZ, and set missing values to (say) >5.0.  The distribution looks much as expected, whereas the outliers seen earlier in the height by age scatterplot would have certainly been out of the range had you not set those beyond 5 and -5 as missing. However, there are some rather large values at each end. Missing values should probably be set at >+4.00 at the top. At the bottom it’s a bit trickier — the cut-off here was set for -5.0 standard deviations already, for the reason that the distribution had a trailing tail.  Usually the cut-off is either -4.0 or -5.0 standard deviations, depending on the distribution that it has.  If there is a trailing long tail, then -4.0 may be too conservative as it was in this case.  Even using these cut-points, all of the extreme values have not been removed but the majority of them have.

Return to Top of Page