Descriptive Statistics

Frequency Distribution

     After collecting data, we must organize it.  The first step
in this process is the creation of a frequency distribution.  For
example, if we collected data from 100 adults between the ages of
21 and 70 we would count the number of males and the number of
females and generate the following table.


                     Gender   frequency   %
                     Male        40      40.0
                     Female      60      60.0
                     Total      100     100.0


We generate similar tables for other variables.


Level of Education                     Frequency     Percent
Less than High School                      2            2.0
Some High School                          10           10.0
High School Graduate                      25           25.0
Some College or Technical School          26           26.0
College Graduate                          23           23.0
Post Secondary Training                   14           14.0
Total                                    100          100.0

                      Age    Frequency   Percent
                      21-30     16        16.0
                      31-40     19        19.0
                      41-50     27        27.0
                      51-60     24        24.0
                      61-70     14        14.0
                      Total    100       100.0

     Gender is a binary categorical variable.  Level of education
is a categorical variable with order, and age is a continuous
variable.  Note that ages are grouped in ten year clumps.  We do
this to limit the number of categories in the table.

Measures of Central Tendency

     Frequency distributions are very useful in describing data.
We would, however, like a single number that describes the data.
The best single number is one that is near the middle of the
distribution.  It is called a measure of central tendency.  There
are three common measures of central tendency:  the mean, median
and mode.  The mean is the arithmetic average, the sum of all the
values divided by the number of values.  The median is the middle
score, and the mode is the most frequent score.   Use the mode
when the data is categorical without order.  Use the median when
the data is categorical with order.  Usually use the mean when
the data is continuous.


1.  Mode

     The mode is the most frequent score.  It is the only
appropriate measure of central tendency for categorical data
without order.  The mode of gender in the table shown above is
Female because there are more females than males.  The mode of
education is Some college or technical school.  Note that we do
not need to assign numerical values to the categories of a
variable in order to compute the mode.  That is why it is
appropriate for variables without order.


2.  Median

     The median is the middle value of a distribution once the
values have been ranked.  The median of the numbers 1,2,3,4,5
is three.  If the sample contains an even number of observations,
the median is the average of the middle two nubers.  The median
of
the numbers 1,2,3,4,5,6 is (3+4)/2 = 3.5.
    We cannot compute a median for gender because gender is a
categorical variable without order.
     We can compute a median from a frequency distribution.  Use
the education data as an example.  Add a column containing the
cumulative percents.  High school graduate and below accounts for
37% of the distribution.  Some college or technical school and
below accounts for 63% of the distribution.  The middle score
which occurs at 50% is in this category.  Thus, the median is
Some college or technical school.  Note that we did not need to
assign numeric values to the categories, but we did need to put
the categories in order.

                                              Cumulative
Level of Education    Frequency     Percent      Percent
Less than High School      2            2.0        2.0
Some High School          10           10.0       12.0
High School Graduate      25           25.0       37.0
Some College or
   Technical School       26           26.0       63.0
College Graduate          23           23.0       86.0
Post Secondary Training   14           14.0      100.0
Total                    100          100.0

3.  Mean

     The common symbol for the mean computed from a sample is x
with a bar over it.  In this document we call it xbar.
The mean of a population is denoted by the Greek letter mu.  In
this document we call it mu.  The number of observations in a
sample is denoted n, and the number of observations in a
population is labelled N.  The mean is the sum of the
observations in the sample or population divided by the number of
observations.  It is the arithmetic average.  For example the
mean of the integers one through five is (1+2+3+4+5)/5 = 15/5 =
3.0.
     We can compute a mean from a frequency distribution.
Suppose 100 people have rated their satisfaction with their job
on a scale from one to five with five indicating greatest
satisfaction.  The data is shown below.  We could list all one
hundred values.  The list would include ten one's, thirteen
two's, etcetera.  An easier approach is to compute a weighted
sum.
Multiply each score by the frequency with which it occurs.  Sum
these values, and divide by n.  The mean equals 330/100 = 3.3.

Satisfaction  Score   Frequency   Score x f
Very low        1        10        1 x 10 =  10
Somewhat low    2        13        2 x 13 =  26
Average         3        31        3 x 31 =  93
Somewhat high   4        29        4 x 29 = 116
Very high       5        17        5 x 17 =  85
Total                   100                 330


     Usually use the mean for continuous data.  There is one
exception to this rule, however.  If there are a few extremely
high or extremely low values in the data, they will pull the mean
toward them.  This is skewed data.  Annual family income is
usually positively skewed.  There are many people with relatively
low incomes and a small number of people with extremely high
incomes.  These few high values will make mean family income too
high to indicate where the middle of the distribution is.  The
median is a better indicator of central tendency in this case.  A
small example of positively skewed data is  1,3,4,5,6,20.  The
median is 4.5, but the mean is 39/6 = 6.5.  When continuous data
is symmetric, the mean median and mode are all equal.

Measures of Variability

     In addition to a single number indicating where the center
of a distribution lies we like to describe the variability of a
distribution with a single number.  There are three common
measures of variability that are appropriate for continuous data.

These are the range, the variance and the standard deviation.

1.  Range

     The range is easy to compute.  It is the largest value minus
the smallest value.  The range for the satisfaction data is 5 - 1
= 4.

2.  Variance

     The variance is the sum of the deviation of each variable
from
the mean squared divided by N or by n=1, where N is the size of a
population, and n is the size of a sample.  The variance of a
sample is labelled s squared, and the variance of a population is
denoted by sigma squared.
     The mean of the first five integers is three.  Calculation
of their variance is shown below

      Score  Score - Mean     Score - Mean Squared
       1       1-3 = -2             4
       2       2-3 = -1             1
       3       3-3 =  0             0
       4       4-3 =  1             1
       5       5-3 =  2             4
    Total             0            10

Assume this is a sample.  The variance equals 10/4 = 2.5.
     Note that the sum of the deviations from the mean is zero.
This will always be the case.
     We can also compute a variance for grouped data.  Use the
satisfaction data.

Score   Frequency   Deviation   Dev Squared   Dev sq x f
  1        10         -2.3          5.29         52.90
  2        13         -1.3          1.69         21.97
  3        31          -.3           .09          2.79
  4        29          1.7          2.89         83.81
  5        17          2.7          7.29        123.93
Total     100                                   285.40

The variance equals 285.40/99 = 3.17.


3.  Standard deviation

     One limitation of the variance as a measure of variability
is that it is not in the units of the thing measured but in their
units squared.  If the variable is feet, the variance is
expressed in terms of feet squared.  We can return the measure to
the original units by taking its square root.  The square root of
the variance is called the standard deviation.  The standard
deviation of the numbers 1,2,3,4,5 is sqrt(2.5) = 1.58.  The
standard deviation of the satisfaction data is sqrt(3.17) = 1.78.
The sample standard deviation is labelled s, and the population
standard deviation is labelled sigma.
Return to the problems
Return to the objectives


Return to the home page © J.Rice