|
|
|
|
Return to DirectoryEvery dataset contains some errors, and every analyst experiences a rite of passage in wasting days drawing wrong conclusions because the errors have not been first rooted out. Up to half of the time needed for analysis is typically spent in "cleaning" the data. This time is also, typically, underestimated. Often, once a clean dataset is acheived, the analysis itself is quite straighforward. Unless the dataset is small (i.e., less than 100 cases and 10 variables), cleaning is done in several stages. To begin with, the key variables are examined and corrected. For nutrition, this usually means the anthropometric and related varibales (e.g., age, sex), and important independent variables like location, socioeconomic factors, and feeding practices. In a large dataset, especially one in which nutrition is one of several modules, variables that become of interest as analysis proceeds need to be cleaned as you go along. The correct thing is to not forget the necessity of data cleaning! ***( edit MG) Discipline is needed. Never introduce new variables into an analysis without first checking for errors and dealing with them.
Definition Data cleaning- a two step process including detection and then correction of errors in a data set. Common sources of error are:
***Measurement and Interview Error Mistakes originating in the question, from measurement, or from incorrect questionnaire responses are difficult to detect unless they are out of range. This is similar to the situation for coding errors. They may be readily apparent as outliers when the researcher examines bivariate associations by scatterplotting. ***(edit MG) Most erros will be detected using three procedures: descriptive statistics, scatterplots, and histograms. The routines are described for SPSS*** and Epi-Info*** (Need links to modules).
Detection of data errors Error detection- Descriptive statistics by variable In the SPSS DataEditor menu there are: Graphs/Scatterplots; Statistics/Descriptives (or Descriptives/Frequencies for variables with few categories); and Graphs/Histograms.
***For a preview of 'Coping with Errors' click *** (needs to be formatted into link) Take a look at an example of descriptive statistics with data errors in them. Descriptive statistics show the mean prevalence of a variabe, for example low arm circumference, as well as the median, the standard deviations, and both the minimum and maximum values. What you are looking for in data cleaning are:
***changes to format/wording in this para. made by MG Exercise Using the bdeshc.sav data set (clean data set - note that the dirty data set has the letter d after bdesh and the clean set has the letter c.), produce your own set of descriptives and graphs using the ACPRVM, F, and FM variables. To run descriptive statistics click on Statistics then Summarize then Descriptives and choose the variable ACPVRM. Compare the results with those shown here-- they should be the same. Take a look at descriptive statistics and a histogram of ACPRVM.
Error Detection - Frequencies ***(doesn't fit here) Frequencies help to locate the 'dirty' data among the entered variables. For instance, anthropometry (weight-for-age, arm circumference, body mass index - BMI) and many biological and social variables are expected to be distributed fairly normally (a bell curve with approximately 68% of the population lies within one standard deviation of its mean of 0; about 95% of the population lies within two standard deviations of 0; and about 99.7% of the population lies within three standard deviations of 0) It should not be biphasic (having two phases or 'humps').Errors can be detected by running frequencies of weight-for-age categories (mild, moderate, severe) which will show the percentages lying in each. A histogram is probably an easier way to detect errors in distribution. Frequencies are also useful in detecting unequal distribution in categories such as age, sex, or village. One common problem with nutritional data in developing countries is that most often a child's age is determined by asking the mother or caretaker of the child. There are risks of 'contamination' of the data due to recall bias, where the mother isn't exactly sure of the child's age and rounds to the nearest 6 or 12 month interval. This leads to a phenomenon called age heaping or clumping. Rounding up of a child's age can result in the child being misclassified as malnourished and rounding down can lead to missed cases of malnourished children. This age heaping can substantially affect calculations of WAZ and HAZ. Take a look at an example of age heaping and learn how to make a graph. Error Detection - Logic Checks ***(VACMTOT example) You can often detect errors in data simply by seeing if the responses are logical. For example, you would expect to see 100% of responses, not 110%. Another example would be if a question were asked about current pregnancy and the reply is marked 'yes' but you notice that the respondent is coded as a man! The main idea is to have fun looking for these errors because it would be quite embarrassing for you to report that 10% of the men in your sample are pregnant! Error Detection - Bivariate Outliers Some data errors only appear when teo variables are compared. Therefore, we are looking for outliers, or values of a variable that are far different from the usual values of the data. Take a look at outliers from the case summaries of low arm circumference prevalence. Exercise To produce a scatter plot graph, open the bdeshd.sav file.
Case Summaries Command To investigate which of the cases may be 'dirty' you can use the Case Summaries command in the Statistics, Summarize menu. This will show which cases are causing the unusual results. Take a look at the results from using the Case Summaries command Expected Associations Between Variables Another way to detect errors using bivariate associations and scatterplot graphs is to check for expected associations between variables. In nutrition, it is known that weight-for height and height-for-age are NOT correlated in an individual. (Take a look at a scatterplot of weight-for-height z-scores and height-for-age z-scores.)
However, weight-for-age and height-for-age are correlated. (Take a look at a scatterplot of weight-for-age z-scores and height-for-age z-scores.) Exercise Using the camb.sav data set, plot the association between haz and whz and the association between haz and waz. Another Example To further illustrate a relationship between two variables we can look at the bdeshc.sav data set. One would expect a clear relationship between low arm circumference in females and low arm circumference in males. Take a look at the scatterplot from that clean data set showing these two variables. Coping With Errors Once errors are detected, it is important to know how to handle them appropriately so the data can be analyzed without losing their integrity or robustness. There are slightly different ways to deal with error in dependent and independent variables. Coping with Errors - Dependent Variables When there are a minimal number of errors, the values are generally recoded to "missing". Take a look at the procedure for recoding a variable. What this means is that the suspicious values are counted as missing data since they are not within an acceptable range. If there are many error values, then check to see if some of the values of the independents are the same for missing and non-missing. If so, then there is less chance of bias in the analysis. If not, then it is possible that the data is not good and that variable should be used with caution. Coping with errors - Identifying the suspect cases *** (insertion requiring link) ***Out-of-range cases:
Coping with errors - Independent variables If there are few data errors, set them to 'missing' using the recoding scheme as above. Another option for defining missing values is through the Utilities menu. Take a look at the procedure of defining missing values through the Utilities menu. However, it is good to use caution when setting many values to 'missing,' especially when you will be doing multivariate analysis. (A term used in this instance to define any statistical method that uses many explanatory or independent variables to predict a single outcome or dependent variable.) If necessary you can set the error values to the data set mean or the group mean (maybe by age group, for example). An example is by using the Bangladesh data and variable about household size. In the 'dirty' data set, there are several districts that are reporting average household size as being less than 2 persons. Take a look at the graph and see the procedure to create it. Perhaps the best way to deal with this particular error in the data is to set these extreme variables to the mean value. When this is done, the histogram looks much better, with a more normal distribution of values. Take a look at the graph. Procedure The easiest way to set the extreme values to the average is to use the Recode option under the Transform menu.
These changes will be made to that variable in the data set. Exercise Graph the data set now with the new values and it should look like the one linked to above.
Return to Directory |