|
|
|
|
Page 3
Confidence Intervals
A confidence interval is a way to estimate a survey population characteristic with a sample.
If you're not familiar with biostatistics notation please open this page while looking at the example. Suppose we have a group of 20 HIV+ patients. We want to know what the average age is, but we don't have time to ask every person in the group. We randomly select 5 patients, represented by the blue numbers. This is probability sampling. Every patient had a 25% chance of being selected.
|
Ages of Patients |
||||
|
21
|
32
|
26
|
24
|
30
|
|
22
|
25
|
24
|
31
|
27
|
|
32
|
29
|
20
|
22
|
26
|
|
27
|
28
|
24
|
30
|
23
|
What is n? What is N? (mouse over for answer)
Now let's go back to the original question. What is the average age of the group? In order to get an exact number, we would have to add together the age of every patient and divide by N. In this case it is possible, but in many other instances it would not be feasible. Imagine trying to figure out the percentage of underweight children in a country and only having two weeks to do it. Instead of getting an exact number, we are going to use our sample (n) to construct a range of possible means. This is a confidence interval. To calculate the interval we must first know the sample mean, variance, & standard error.
Sample Mean:

The above formula shows that to calculate the sample mean
, we add the sample
ages and divide by n.
The patients we selected (those in blue): 21+ 25+ 24+ 31+ 26= 127
127/n = 127/5 = 25.4
= 25.4
Variance:
![]()
The formula for variance (
) incorporates
the sample mean
, which
must be known before calculating variance. We have
= 25.4.
Now we must subtract this from the patient's age and square it.
| Patient | Age ( |
Mean |
Difference( |
| 1 | 21 | 25.4 | 19.36 |
| 2 | 25 | 25.4 | .16 |
| 3 | 24 | 25.4 | 31.36 |
| 4 | 31 | 25.4 | .36 |
| 5 | 26 | 25.4 | 1.96 |
| SUM | 127 | 53.20 |
Note that because we are squaring, the result will always be positive. The next step is to take the sum and divide by n-1.
53.20/ n-1 = 53.20/4 = 13.3
= 13.3
Standard Error (mean):
![]()
Now that we know the variance of our sample we can use it to calculate the standard error.
![]()
For our sample this is 5/20 = .25 (f is a correction factor and is only important if n is >5-10% of N. For large populations it can be left out of the formula)
![]()
![]()
We have our standard error and can now move on to constructing a confidence interval.
Confidence Interval (mean)
![]()
This formula has three components: the mean (
),
a z score (
), and the standard error (
).
We already have a mean of 25.4 and standard error of 1.412. That leaves
us with the z score. For now we'll look at a z score as the amount of
risk we're willing to take (it is covered in depth
here).
If we want to be 95% sure about our confidence
interval, we set alpha at 5% or .05. This will give us a z score
of 1.96. We can now fill in the formula.
25.4±(1.96)(1.412)
adding and subtracting 2.8 from 25.4 gives us our range.
=(22.6,28.2)
We are 95% confident that the average age of the group is between 22.6 and 28.2
What is the exact average of the group (add together all of the patients and divide by N)?
Does our confidence interval contain the true population mean?
Proportions
|
Ages of Patients |
||||
|
21
|
32
|
26
|
24
|
30
|
|
22
|
25
|
24
|
31
|
27
|
|
32
|
29
|
20
|
22
|
26
|
|
27
|
28
|
24
|
30
|
23
|
Starting with a different question, the formulas for confidence intervals change slightly. What percentage of the group is over 25 years of age? The answer to this question is a proportion, not a mean. In this example it is possible to answer the question without taking a sample. 11 out of the 20 patients, or 55%, are over 25. Again, imagine you are searching for the answer to a much larger question, one that encompasses an entire nation. Let's select our sample and build a confidence interval.
This time we have selected 6 patients (21, 25, 28, 24, 31, 26). What is n/N? 3 of the 6, or 50%, are over 25 years old. Now it's time to calculate the standard error.
Standard Error (proportion)
![]()
Compare this formula to the previous one.
![]()
In our proportion formula we have two new letters, p & q. P is the proportion we're looking at. In this case, patients over the age of 25. In our sample p equals 50%, or .50. Q is the proportion that doesn't fit our criteria, that is, patients 25 years old or younger. This will always be 1-p. In this example it is .50.
Another difference in the formula is that we are subtracting
1 from n. In the previous example we did the same when calculating the
variance.
This was already done by the time we reached our standard
error. Subtracting 1 is a correction factor that makes very little
difference when dealing with large samples. In this case it is
important because we only have 6 people. Let's calculate the standard
error. Remember ![]()
![]()
![]()
![]()
![]()
Confidence Interval (proportion)
![]()
Again we will set alpha at 5 %, giving us a z score of 1.96. All that's left is to plug in p and the standard error that we just calculated.
.5±(1.96)(.187)
.5±.37
(.13, .87)
We are 95% confident that the true proportion of patients over 25 is between .13 and .87. We know this is correct because we already have the proportion of patients over 25 (55%) for the entire group, but a confidence interval this large is not very helpful. It would not be easy to plan targeting when you only know that between 13 and 87 percent of HIV+ people are over 25 years old. The statistical theories used to generate a useful confidence interval are covered in the sample size page of chapter 2. For now, take a look at these rules.
As the sample size (n) increases, the size of the confidence interval decreases.
As the standard error (se) increases, the size of the confidence interval increases.
As tolerable error (α) increases, the confidence interval decreases.
Which of these would be the easiest to change?
Which would be the best to change?