Page 5

File Structure


Here are several main points to make on File Structure:

It's important to get file structure right, otherwise it may not be possible to get the answers
the data can potentially give
The most important point is the level of analysis, or definition of case; think of a spreadsheet,
its the definition of rows that we're talking about.
Always use the most disaggregated structure you can.   Try to set up the file at individual level
-- each case is a child, for anthropometry; the household data is simply repeated for each child.
If you don't control the survey, it's very common to find that the file is set up at household
(or parent) level, so there is a separate set of variables for each child within the household file
(usually three or more); the row is now a household, with a set of columns for child 1, another
set for child 2, and so on; this is a problem and potentially wastes information.
The best solution in this case is to restructure the file from household (or respondent) level to
child level -- this has to be done before any child level analysis is done.
This can be done in SPSS (although it's better in SAS) -, see the steps below.
A related issue concerns the coding of information on a topic where there are a range of non-mutually-exclusive answers  (this is similar to issues of transformations, but arises nearer the
data capture end of things) -- examples might be sources of income (and proportion from each);
or treatment  of diarrhea (water, ORS, rice water, ...): the principle is to use as many variables
as needed, rather than to try to compress the data by coding combinations as fewer variables
-- for diarrhea treatment code for 'water' as one variable, 'ORS' as another, and so on.  In
spreadsheet mode, the first principle is use as many rows as possible; the second is as many
columns (which are variables).



The choice of the file structure in any survey which contains both household and child level information must depend upon two major factors; the set-up of the questionnaire, and subsequent analytical procedures. Data collection and entry must minimize errors; therefore ideally one would want to enter information into a computer in exactly the same order it is entered into the questionnaire. In an interview, it may be easier to add data on children onto the same page as the household information if the same person is being asked for both. In this case, it would be easiest to keep the data at the level of the caretaker (or household, which is the same here). However, for the final analyses, files should always be in the level of aggregation asked for by the analysis. Most child level analyses cannot be done on a household level file; i.e. one in which the child information are variables rather than cases. In this instance, the child data may have to be restructured.

Data Base Management packages and statistical software available today can easily manipulate file structures, keeping in mind that moving cases around is much easier than manipulating variables. Creating a child level file from a household level will involve changing variables. Some software packages handle this better than others (e.g. SAS is much better than SPSS in this respect).

Because the size of the file is no longer a major issue, the lengthy "composite" type variables are no longer necessary, and can cause major problems in restructuring files. This type of variable cannot be analyzed unless it is first broken up into it’s separate parts; which often creates problems especially if information is not available for only part of this variable. Missing data can cause column shifting and thus loss of answers for all variables included in this composite variable. It is always best to keep all variables separate even when there may be only a few cases which have the information in question.

Although microcomputer software has become sophisticated enough to handle these file structure issues, it is always best to keep file manipulation at a minimum in order to avoid introducing errors into the data. If possible, child level information should be kept in a separate file from household information, with a unique identifier for merging the two sources of data.




For this exercise, the household file EX100.SAV will be used as an example. It contains 100 households with information on up to six children for each case (i.e. household). In restructuring, only the following variables will be selected from this file: V001 - V012, and HW1$x - HW19$x (where "x" is the number of the child). Out of a possible six children per household, only information on three children will be chosen.

The procedure used involves first creating three separate child level files, and then merging these into one file. The exact methods used in SPSSPC will depend on the size and structure of the household file. For example, in this exercise variables to be included in the child file are selected from the household file — a good method if this number is fewer than the total number of variables in the household file. If most of the variables in the household file are needed in the child file, then it may be more efficient to select the household file, save it with a new name, then simply delete variables and cases to reach the final child level file.

CLICK HERE for an exercise on Creating Child Level Files.



Before merging these three files, make sure all variables are identical in name, type, and format. Since the names were changed, check carefully they are all the same in all three child files. The format and type of variable should have been copied from the household file, so should all be identical.

CLICK HERE for an exercise on Merging child level files.



It is always necessary to check the results of a merge. There are several different ways this can be done. Consistency checking of this sort will depend upon the information available in the file.

CLICK HERE for an exercise in Checking the child level file.

Return to Top of Page