Detection of Common Errors Likeliness of a Value Expected Association between Variables
Bivariate Outliers Logic Checks Correction of Errors
Test Yourself (page 2)
Every dataset contains some errors, and every analyst experiences a rite of passage in wasting days drawing wrong conclusions because the errors have not been first rooted out. Up to half of the time needed for analysis is typically spent in "cleaning" the data. This time is also, typically, underestimated. Often, once a clean dataset is achieved, the analysis itself is quite straightforward.
Unless the dataset is small (i.e., less than 100 cases and 10 variables), cleaning is done in several stages. To begin with, the key variables are examined and corrected. For nutrition, this usually means the anthropometric and related variables (e.g., age, sex), and important independent variables like location, socioeconomic factors, and feeding practices. In a large dataset, especially one in which nutrition is one of several modules, variables that become of interest as analysis proceeds need to be cleaned as you go along.
The correct thing is to not forget the necessity of data cleaning! Discipline is needed. Never introduce new variables into an analysis without first checking for errors and making corrections.
DATA CLEANING - a two step process including DETECTION and then CORRECTION of errors in a data set.
Common sources of error are:
|missing data coded as "999"|
|'not applicable' or 'blank' coded as "0"|
|typing errors on data entry|
|COLUMN SHIFT - data for one variable column was entered under the adjacent column|
|fabricated data- 'made up' or contrived|
|MEASUREMENT AND INTERVIEW ERRORS|
Most errors will be detected using three procedures:
The routines are described for Statistical Package for the Social Sciences (SPSS) and some of the transformations are performed using EPI-INFO created by the Centers for Disease Control and Prevention.
Descriptive statistics by variable
In the SPSS Data Editor menu there are: Graphs/Scatterplots; Statistics/Descriptives (or Descriptives/Frequencies for variables with few categories); and Graphs/Histograms.
TAKE A LOOK at an example of Descriptive Statistics with Data Errors in them.
Descriptive statistics show the mean prevalence of a variable, for example low arm circumference, as well as the median, the standard deviations, and both the minimum and maximum values. What you are looking for in data cleaning are:
1. Minimum prevalence of zero. It is unlikely that the rates of low arm circumference would be zero. Checking back with original data would show that no such value exists. Thus the cases with a value for this variable must be an error.
2. Maximum prevalence of 99.90 and 90.00. It is unlikely that the prevalence of malnutrition in any district would be 90 or above. The 99.90 could be indicative of missing information. Further, since the histograms for ACPRVM (acute prevalence male) and ACPRVF (female) show that there is a big gap above 30%, it is likely that values of 90% and above are wrong. 50% and 60% in ACPRVM are also probably wrong.
3. Standard deviation higher than the mean value. This indicates some extreme values, since the variable cannot be less than zero. If the problems of the minimum and maximum values are corrected, the standard deviation will be smaller.
TAKE A LOOK at the Histogram
of Cleaned Arm Circumference
(once the outliers are cleaned from the district level dataset).
Frequencies help to locate the 'dirty' data among the entered variables. For instance, anthropometry (weight-for-age, arm circumference, body mass index - BMI) and many biological and social variables are expected to be distributed fairly normally (a bell curve with approximately 68% of the population lies within one standard deviation of its mean of 0; about 95% of the population lies within two standard deviations of 0; and about 99.7% of the population lies within three standard deviations of 0) It should not be biphasic (having two phases or 'humps').Errors can be detected by running frequencies of weight-for-age categories (mild, moderate, severe) which will show the percentages lying in each. A histogram is probably an easier way to detect errors in distribution.
Frequencies are also useful in detecting unequal distribution in categories such as age, sex, or village. One common problem with nutritional data in developing countries is that most often a child's age is determined by asking the mother or caretaker of the child. There are risks of 'contamination' of the data due to recall bias, where the mother isn't exactly sure of the child's age and rounds to the nearest 6 or 12 month interval. This leads to a phenomenon called age heaping. Rounding up of a child's age can result in the child being misclassified as malnourished and rounding down can lead to missed cases of malnourished children. This age heaping can substantially affect calculations of WAZ and HAZ.
INTERPRETATION: The effect of age heaping is outstanding. There are obvious peaks at the one year (12 month) and half year (6 month) intervals. This effect is most pronounced after one year of age, as would be expected. More mothers are aware of the number of months of age when a child is under one year of age and then begins rounding when a child surpasses the first year.
TAKE A LOOK at an exercise on Age Heaping in Bangladesh and learn how to make a graph.
You can often detect errors in data simply by seeing if the responses are logical. For example, you would expect to see 100% of responses, not 110%. Another example would be if a question were asked about current pregnancy and the reply is marked 'yes' but you notice that the respondent is coded as MALE! The main idea is to enjoy the process but be thorough when looking for these errors because it would be quite embarrassing for you to report that 10% of the men in your sample are pregnant. Here is an example of keying error...more than 100%?
INTERPRETATION: This result is actually nothing to be worried about since it could have resulted from rounding when all of the groups receiving capsules were combined by districts and then into a total within each district. They should be changed to 100% since it is not possible to have more than 100% actual coverage. These logic checks are important to watch though, since it is not effective to present data that shows its errors (even if it is minor and only due to rounding). Keep your eyes open and be sure to check that it is only rounding and then adjust for these small over-estimations.
TAKE A LOOK to run this exercise for vitamin A capsule distribution to do a Logic Check.
Some data errors only appear when two variables are compared. Therefore, we are looking for outliers, or values of a variable that are far different from the expected values.
INTERPRETATION: Notice the association between the two variables - as the prevalence of low arm circumference for females increases (y-axis), the prevalence of low arm circumference for males also increases - in general. However, you can see the outliers - one in particular which do not fall into the general pattern of association between the variables.
TAKE A LOOK at Outliers from the case summaries of low arm circumference prevalence.
To investigate which of the cases may be 'dirty' you can use the Case Summaries command in the Statistics, Summarize menu. This will show which cases are causing the unusual results.
TAKE A LOOK at the results from using the Case Summaries command
Expected Associations Between Variables
Another way to detect errors using bivariate associations and scatterplot graphs is to check for expected associations between variables. In nutrition, it is known that weight-for height and height-for-age are NOT correlated in an individual.
Using the keast4j.sav data set, plot the association between haz and whz and the association between haz and waz.
INTERPRETATION: Notice that there is no real pattern here - just random scattering of the values. This is what you would expect when plotting these two variables against each other. Usually haz is used to detect stunting in a population (long term malnutrition) whereas whz is used to detect wasting (acute malnutrition).
TAKE A LOOK at how to Scatterplot the Association of WHZ and HAZ
(weight-for-height z-scores and
height-for-age z-scores using the Kenya Dataset).
However, weight-for-age and height-for-age ARE correlated.
INTERPRETATION: Notice how you could draw a slanted line from the bottom left up and to the right. This indicates a linear association between height-for-age and weight-for-age, where the lower the height for the age, the lower the weight for the age. There is an expected association between these two variables. In fact, it is expected that weight for age will be associated to some extent with both height for age and weight for height, since it is supposed to be estimating both chronic and acute malnutrition in one variable (although it is not a perfect estimation).
TAKE A LOOK at how to Scatterplot the Association of WAZ and HAZ
(weight-for-age z-scores and height-for-age z-scores using the Kenya East Dataset.)
To further illustrate a relationship between two variables we can look at the bdeshc.sav data set. One would expect a clear relationship between low arm circumference in females and low arm circumference in males.
INTERPRETATION: Here the relationship looks like it is a positive linear relationship, as expected. It should be that as male arm circumference increases, so does females...or where males are malnourished females are also malnourished. There are not any areas that are so clearly out of the expected pattern that they might causes skewing of the results. The rule of thumb is that outliers should neither mask nor drive a relationship.
TAKE A LOOK at the scatterplot from the Bangladesh clean
data set showing the
Association of Low AC in Males and Females.
CORRECTION of Data Errors and Coping with Errors
Once errors are detected, it is important to know how to handle them appropriately so the data can be analyzed without losing their integrity or robustness. There are slightly different ways to deal with error in DEPENDENT and INDEPENDENT variables.
When there are a minimal number of errors, the values are generally recoded to "missing".
TAKE A LOOK at the procedure for Recoding a variable.
What this means is that the suspicious values are counted as missing data since they are not within an acceptable range.
If there are many error values, then check to see if some of the values of the independents are the same for missing and non-missing. If so, then there is less chance of bias in the analysis. If not, then it is possible that the data is not good and that variable should be used with caution.
RECALL How to Identify Suspect Cases that are out of range
scatterplot against case number sort and eyeball declare missing
If there are few data errors, set them to 'missing' using the recoding scheme as above. Another option for defining missing values is through the Utilities menu.
MISSING VALUES can be defined through the Utilities menu if you CLICK on the button.
However, it is good to use caution when setting many values to 'missing,' especially when you will be doing multiple variable analysis. (A term used in this instance to define any statistical method that uses many explanatory or independent variables to predict a single outcome or dependent variable.)
If necessary you can set the error values to the data set mean or the group mean (maybe by age group, for example). An example is by using the Bangladesh data and variable about household size. In the 'dirty' data set, there are several districts that are reporting average household size as being less than 2 persons.
TAKE A LOOK at the graph and see the procedure to detect Improper Values in the dirty Bangladesh data.
Perhaps the best way to deal with this particular error in the data is to set these extreme variables to the mean value. When this is done, the histogram looks much better, with a more normal distribution of values. An exercise is given below to practice this concept and then compare your results to the results shown by PANDA. The easiest way to set the extreme values to the average is to use the Recode option under the Transform menu.
Graph the data set now with the new values and it should look like the one set to the mean that is shown below .
A LOOK at how to Set Values to the MEAN
and to graph the results.
Page Two Test Yourself
Return to Top