The first step to working with data is getting to know the data. It is not usually the case that the data analyst personally developed the survey instrument, collected the data, and performed the data entry. For this reason, becoming familiar with the data set and what it represents is a good starting point.
The easiest way to visualize the data is to print out a codebook, something you can mark-up and use to identify the variables that will be most useful for the analysis. In SPSS, making a codebook is simple using the File option to Display data info. It might be a bulky printout, but usually it is worthwhile to have on hand throughout the analysis.
Through the codebook, the characteristics of the data set start to emerge. This book will allow classification of the variables collected, so that it is clear what information is available and what is not. This might be a time to start planning the analysis to include the use of the collected data, which might include:
Collection of the data might vary slightly in sampling style and specific questions used, but surveys will often request information on all or most of these 8 categories. For the dietary surveys, there sometimes is a desire to analyze the nutrient quantity or quality of the consumed food that was recorded. The PANDA package does not address this level of nutrient analysis, mainly because this type of package is complex and several already exist. To see a list of computerized dietary analysis systems (which include food catalogues and nutrient values), follow the link.
The purpose is to determine what biases (errors) have been introduced into the data and how you can eliminate or control for some of these. The biases might be inherent in the survey instrument (either the question itself or due to interviewer bias), in the collection of the data (measurement error), or in data entry (keying errors). Usually once you are at the stage of analysis, you can do little more than identify that there was a problem with the question used or the measurements taken in the survey and make note of it. With clinical measures and sub-clinical measures, sometimes you might have enough information to correct for a systematic instrument error (but this is unlikely). An unusual instance for correction of measurement is with hemoglobin when the adjustment was not made for altitude. If the altitude is known for the areas of collection, correction can be made at this stage of analysis. This is a very specific example though, and usually these types biases cannot be remedied. A section on Data Cleaning referring to anthropometry is in the Analysis Module of PANDA.
The problems that can often be alleviated at the analysts hands are data entry errors. These might be caught through several methods (if you need review on how to run these routines, use the link at the bottom of the page on Basics of SPSS or Basics of EPI-Info):
Running Frequencies - to detect improper coding/ labeling or invalid/ out-of range entries
Running Descriptives
- to show out-of range entries (minimum and maximum) and differences in mean and median values (again, skewing)Scatterplotting
- to check for known associations and the outliers that might appear when running them (e.g. general malnutrition and anemia)Running Histograms
- to check the distribution of a variable (especially if it is known to be normally distributeda normal curve can be superimposed for comparison) and to look for outliers
Data cleaning and characterizing involves more than just error detection and correction
though, it involves TRANSFORMING as well. By this time (after using ANALYSIS PANDA),
transforming data has become routine. So making bivariates or other categories from many
familiar variables (such as water source, toilet use, and access to health care, etc.)
will not be re-visited. What will be covered is transformations specific to the
micronutrient variables that are used.
Transforming micronutrient variables (sub-clinical specifically) will involve using the indicator tables for each micronutrient in section 1 and applying the established cut-points to create categorical variables for the levels of malnutrition (e.g. mild, moderate, severe). Once the variable is categorized using established cut-points (often developed by WHO), these indicators will give the prevalence and numbers of individuals affected by the deficiency. A benefit of the indicators for sub-clinical deficiencies (hemoglobin, serum retinol, or urinary iodine) is that they fall on a continuum. For example, see the distribution of hemoglobin scores in children 0-5 years (fig. 1).
Hemoglobin scores (g/dl) in Children

Because the variable is continuous and normally distributed, you will have the opportunity to use this in linear regression (which is a powerful way to perform analysis). The analysis will be similar to that using weight for age z-score as the outcome in the ANALYSIS PANDA.
When cleaning and transforming specific micronutrient outcome variables, considerations must be made for each one individually. Exercises and further characterization will be provided for process and outcome variables commonly collected for VAD, IDA, and IDD by following the hyperlinks at the top of the page in red.