Page 3

Confidence Intervals

A confidence interval is a way to estimate a survey population characteristic with a sample.

If you're not familiar with biostatistics notation please open this page while looking at the example.  Suppose we have a group of 20 HIV+ patients.  We want to know what the average age is, but we don't have time to ask every person in the group.  We randomly select 5 patients, represented by the blue numbers.  This is probability sampling.  Every patient had a 25% chance of being selected. 

Ages of Patients

21
32
26
24
30
22
25
24
31
27
32
29
20
22
26
27
28
24
30
23

What is n?  What is N?  (mouse over for answer)

Now let's go back to the original question.  What is the average age of the group?  In order to get an exact number, we would have to add together the age of every patient and divide by N.  In this case it is possible, but in many other instances it would not be feasible.  Imagine trying to figure out the percentage of underweight children in a country and only having two weeks to do it.  Instead of getting an exact number, we are going to use our sample (n) to construct a range of possible means.  This is a confidence interval.  To calculate the interval we must first know the sample mean, variance, & standard error.

 

Sample Mean: 

The above formula shows that to calculate the sample mean , we add the sample ages and divide by n. 

The patients we selected (those in blue):  21+ 25+ 24+ 31+ 26= 127

127/n = 127/5 = 25.4 

= 25.4

 

Variance:

The formula for variance () incorporates the sample mean , which must be known before calculating variance.  We have = 25.4.  Now we must subtract this from the patient's age and square it.

Patient Age () Mean Difference(
1 21 25.4 19.36
2 25 25.4 .16
3 24 25.4 31.36
4 31 25.4 .36
5 26 25.4 1.96
SUM 127   53.20

Note that because we are squaring, the result will always be positive.  The next step is to take the sum and divide by n-1.

53.20/ n-1 = 53.20/4 = 13.3

= 13.3

 

Standard Error (mean):

Now that we know the variance of our sample we can use it to calculate the standard error. 

For our sample this is 5/20 = .25 (f is a correction factor and is only important if n is >5-10% of N.  For large populations it can be left out of the formula)

We have our standard error and can now move on to constructing a confidence interval.

 

Confidence Interval (mean)

This formula has three components: the mean (), a z score (), and the standard error ().  We already have a mean of 25.4 and standard error of 1.412.  That leaves us with the z score.  For now we'll look at a z score as the amount of risk we're willing to take (it is covered in depth here).  If we want to be 95% sure about our confidence interval, we set alpha at 5% or .05.  This will give us a z score of 1.96.  We can now fill in the formula.

25.4±(1.96)(1.412)

adding and subtracting 2.8 from 25.4 gives us our range.

=(22.6,28.2)

We are 95% confident that the average age of the group is between 22.6 and 28.2

What is the exact average of the group (add together all of the patients and divide by N)?

Does our confidence interval contain the true population mean?

 


Proportions

 

Ages of Patients

21
32
26
24
30
22
25
24
31
27
32
29
20
22
26
27
28
24
30
23

Starting with a different question, the formulas for confidence intervals change slightly.  What percentage of the group is over 25 years of age?  The answer to this question is a proportion, not a mean.  In this example it is possible to answer the question without taking a sample.  11 out of the 20 patients, or 55%, are over 25.  Again, imagine you are searching for the answer to a much larger question, one that encompasses an entire nation.  Let's select our sample and build a confidence interval.

This time we have selected 6 patients (21, 25, 28, 24, 31, 26).  What is n/N?  3 of the 6, or 50%, are over 25 years old.  Now it's time to calculate the standard error.

Standard Error (proportion)

Compare this formula to the previous one. 

In our proportion formula we have two new letters, p & q.  P is the proportion we're looking at.  In this case, patients over the age of 25.  In our sample p equals 50%, or  .50.  Q is the proportion that doesn't fit our criteria, that is, patients 25 years old or younger.  This will always be 1-p.  In this example it is .50.

Another difference in the formula is that we are subtracting 1 from n.  In the previous example we did the same when calculating the variance.       

This was already done by the time we reached our standard error.  Subtracting 1 is a correction factor that makes very little difference when dealing with large samples.  In this case it is important because we only have 6 people.  Let's calculate the standard error.  Remember  

 

Confidence Interval (proportion)

Again we will set alpha at 5 %, giving us a z score of 1.96.  All that's left is to plug in p and the standard error that we just calculated.

.5±(1.96)(.187)

.5±.37

(.13, .87)

We are 95% confident that the true proportion of patients over 25 is between .13 and .87.  We know this is correct because we already have the proportion of patients over 25 (55%) for the entire group, but a confidence interval this large is not very helpful.  It would not be easy to plan targeting when you only know that between 13 and 87 percent of HIV+ people are over 25 years old.  The statistical theories used to generate a useful confidence interval are covered in the sample size page of chapter 2.  For now, take a look at these rules.

 

  • As the sample size (n) increases, the size of the confidence interval decreases.

  • As the standard error (se) increases, the size of the confidence interval increases.

  • As tolerable error (α) increases, the confidence interval decreases.

Which of these would be the easiest to change? 

Which would be the best to change?


  Page 4