Descriptive Statistics Frequency Distribution After collecting data, we must organize it. The first step in this process is the creation of a frequency distribution. For example, if we collected data from 100 adults between the ages of 21 and 70 we would count the number of males and the number of females and generate the following table. Gender frequency % Male 40 40.0 Female 60 60.0 Total 100 100.0 We generate similar tables for other variables. Level of Education Frequency Percent Less than High School 2 2.0 Some High School 10 10.0 High School Graduate 25 25.0 Some College or Technical School 26 26.0 College Graduate 23 23.0 Post Secondary Training 14 14.0 Total 100 100.0 Age Frequency Percent 21-30 16 16.0 31-40 19 19.0 41-50 27 27.0 51-60 24 24.0 61-70 14 14.0 Total 100 100.0 Gender is a binary categorical variable. Level of education is a categorical variable with order, and age is a continuous variable. Note that ages are grouped in ten year clumps. We do this to limit the number of categories in the table. Measures of Central Tendency Frequency distributions are very useful in describing data. We would, however, like a single number that describes the data. The best single number is one that is near the middle of the distribution. It is called a measure of central tendency. There are three common measures of central tendency: the mean, median and mode. The mean is the arithmetic average, the sum of all the values divided by the number of values. The median is the middle score, and the mode is the most frequent score. Use the mode when the data is categorical without order. Use the median when the data is categorical with order. Usually use the mean when the data is continuous. 1. Mode The mode is the most frequent score. It is the only appropriate measure of central tendency for categorical data without order. The mode of gender in the table shown above is Female because there are more females than males. The mode of education is Some college or technical school. Note that we do not need to assign numerical values to the categories of a variable in order to compute the mode. That is why it is appropriate for variables without order. 2. Median The median is the middle value of a distribution once the values have been ranked. The median of the numbers 1,2,3,4,5 is three. If the sample contains an even number of observations, the median is the average of the middle two nubers. The median of the numbers 1,2,3,4,5,6 is (3+4)/2 = 3.5. We cannot compute a median for gender because gender is a categorical variable without order. We can compute a median from a frequency distribution. Use the education data as an example. Add a column containing the cumulative percents. High school graduate and below accounts for 37% of the distribution. Some college or technical school and below accounts for 63% of the distribution. The middle score which occurs at 50% is in this category. Thus, the median is Some college or technical school. Note that we did not need to assign numeric values to the categories, but we did need to put the categories in order. Cumulative Level of Education Frequency Percent Percent Less than High School 2 2.0 2.0 Some High School 10 10.0 12.0 High School Graduate 25 25.0 37.0 Some College or Technical School 26 26.0 63.0 College Graduate 23 23.0 86.0 Post Secondary Training 14 14.0 100.0 Total 100 100.0 3. Mean The common symbol for the mean computed from a sample is x with a bar over it. In this document we call it xbar. The mean of a population is denoted by the Greek letter mu. In this document we call it mu. The number of observations in a sample is denoted n, and the number of observations in a population is labelled N. The mean is the sum of the observations in the sample or population divided by the number of observations. It is the arithmetic average. For example the mean of the integers one through five is (1+2+3+4+5)/5 = 15/5 = 3.0. We can compute a mean from a frequency distribution. Suppose 100 people have rated their satisfaction with their job on a scale from one to five with five indicating greatest satisfaction. The data is shown below. We could list all one hundred values. The list would include ten one's, thirteen two's, etcetera. An easier approach is to compute a weighted sum. Multiply each score by the frequency with which it occurs. Sum these values, and divide by n. The mean equals 330/100 = 3.3. Satisfaction Score Frequency Score x f Very low 1 10 1 x 10 = 10 Somewhat low 2 13 2 x 13 = 26 Average 3 31 3 x 31 = 93 Somewhat high 4 29 4 x 29 = 116 Very high 5 17 5 x 17 = 85 Total 100 330 Usually use the mean for continuous data. There is one exception to this rule, however. If there are a few extremely high or extremely low values in the data, they will pull the mean toward them. This is skewed data. Annual family income is usually positively skewed. There are many people with relatively low incomes and a small number of people with extremely high incomes. These few high values will make mean family income too high to indicate where the middle of the distribution is. The median is a better indicator of central tendency in this case. A small example of positively skewed data is 1,3,4,5,6,20. The median is 4.5, but the mean is 39/6 = 6.5. When continuous data is symmetric, the mean median and mode are all equal. Measures of Variability In addition to a single number indicating where the center of a distribution lies we like to describe the variability of a distribution with a single number. There are three common measures of variability that are appropriate for continuous data. These are the range, the variance and the standard deviation. 1. Range The range is easy to compute. It is the largest value minus the smallest value. The range for the satisfaction data is 5 - 1 = 4. 2. Variance The variance is the sum of the deviation of each variable from the mean squared divided by N or by n=1, where N is the size of a population, and n is the size of a sample. The variance of a sample is labelled s squared, and the variance of a population is denoted by sigma squared. The mean of the first five integers is three. Calculation of their variance is shown below Score Score - Mean Score - Mean Squared 1 1-3 = -2 4 2 2-3 = -1 1 3 3-3 = 0 0 4 4-3 = 1 1 5 5-3 = 2 4 Total 0 10 Assume this is a sample. The variance equals 10/4 = 2.5. Note that the sum of the deviations from the mean is zero. This will always be the case. We can also compute a variance for grouped data. Use the satisfaction data. Score Frequency Deviation Dev Squared Dev sq x f 1 10 -2.3 5.29 52.90 2 13 -1.3 1.69 21.97 3 31 -.3 .09 2.79 4 29 1.7 2.89 83.81 5 17 2.7 7.29 123.93 Total 100 285.40 The variance equals 285.40/99 = 3.17. 3. Standard deviation One limitation of the variance as a measure of variability is that it is not in the units of the thing measured but in their units squared. If the variable is feet, the variance is expressed in terms of feet squared. We can return the measure to the original units by taking its square root. The square root of the variance is called the standard deviation. The standard deviation of the numbers 1,2,3,4,5 is sqrt(2.5) = 1.58. The standard deviation of the satisfaction data is sqrt(3.17) = 1.78. The sample standard deviation is labelled s, and the population standard deviation is labelled sigma.Return to the problemsReturn to the objectives
Return to the home page © J.Rice