When we collect numerical data-like test scores, heights, temperatures, or wait times-we often end up with long lists of numbers. While each number tells us something, it's hard to understand what the entire dataset is saying just by looking at a list. Summarizing quantitative data means using mathematical tools to describe the most important features of a dataset with just a few numbers. These summaries help us answer questions like: What's typical? How spread out are the values? Are there any unusual observations? In this chapter, you'll learn how to calculate and interpret measures that capture the center, spread, and shape of datasets, making sense of numbers in a clear and organized way.
The center of a dataset tells us what a typical or representative value might be. There are three common ways to measure the center: the mean, the median, and the mode. Each has its own strengths and is useful in different situations.
The mean, often called the average, is found by adding all the values in a dataset and dividing by the number of values. The mean is the most commonly used measure of center because it takes every data point into account.
The formula for the mean is:
\[ \text{Mean} = \bar{x} = \frac{\sum x}{n} \]Here, \( \bar{x} \) (read as "x-bar") represents the mean, \( \sum x \) means "the sum of all the data values," and \( n \) is the number of values in the dataset.
Example: A student records the number of hours she studied each day for one week:
3, 5, 2, 4, 6, 3, 5.What is the mean number of hours studied per day?
Solution:
First, add all the values:
3 + 5 + 2 + 4 + 6 + 3 + 5 = 28Count the number of days: \( n = 7 \)
Divide the sum by the number of values:
\( \bar{x} = \frac{28}{7} = 4 \)The mean number of hours studied per day is 4 hours.
The mean is sensitive to outliers-extreme values that are much higher or lower than the rest of the data. A single outlier can pull the mean in its direction, making it less representative of the typical value.
The median is the middle value when the data are arranged in order from least to greatest. If there is an odd number of values, the median is the exact middle value. If there is an even number of values, the median is the mean of the two middle values.
To find the median:
Example: The ages of seven participants in a workshop are:
22, 25, 23, 29, 24, 22, 35.Find the median age.
Solution:
First, arrange the ages in order:
22, 22, 23, 24, 25, 29, 35Count the number of values: \( n = 7 \) (odd number)
The middle position is \( \frac{7+1}{2} = 4 \)
The fourth value in the ordered list is 24.
The median age is 24 years.
Example: The monthly rainfall amounts (in inches) for six months are:
2.5, 3.1, 2.8, 4.0, 3.5, 2.9.What is the median rainfall?
Solution:
Arrange the values in order:
2.5, 2.8, 2.9, 3.1, 3.5, 4.0Count the values: \( n = 6 \) (even number)
The two middle positions are 3 and 4.
The third value is 2.9 and the fourth value is 3.1.
Calculate the average of these two values:
\( \text{Median} = \frac{2.9 + 3.1}{2} = \frac{6.0}{2} = 3.0 \)The median rainfall is 3.0 inches.
The median is resistant to outliers, meaning extreme values don't affect it much. This makes the median a better measure of center when data are skewed or contain outliers.
The mode is the value that appears most frequently in a dataset. A dataset can have one mode, more than one mode, or no mode at all.
Example: A teacher records the number of questions students asked during eight class sessions:
5, 7, 5, 8, 5, 6, 7, 9.What is the mode?
Solution:
Count how many times each value appears:
5 appears 3 times
6 appears 1 time
7 appears 2 times
8 appears 1 time
9 appears 1 timeThe value 5 appears most frequently.
The mode is 5 questions.
The mode is particularly useful for categorical data (like favorite colors or types of pets) and for understanding which value is most common in a dataset.
While measures of center tell us where data tend to cluster, measures of spread (also called measures of variability) tell us how much the data values differ from each other. Two datasets can have the same mean but very different spreads.
The range is the simplest measure of spread. It is the difference between the maximum value and the minimum value in the dataset:
\[ \text{Range} = \text{Maximum} - \text{Minimum} \]Example: The daily high temperatures (in °F) for one week were:
68, 72, 75, 70, 73, 69, 74.Find the range of temperatures.
Solution:
Identify the maximum temperature: 75°F
Identify the minimum temperature: 68°F
Calculate the range:
Range = 75 - 68 = 7The range is 7°F.
The range is easy to calculate, but it only considers two values and can be greatly affected by outliers.
The interquartile range, abbreviated as IQR, measures the spread of the middle 50% of the data. It is calculated by dividing the ordered dataset into four equal parts using quartiles.
The interquartile range is:
\[ \text{IQR} = Q_3 - Q_1 \]Example: The test scores for nine students are:
65, 70, 72, 75, 78, 80, 82, 85, 90.Find the interquartile range.
Solution:
The data are already in order. Find the median (\( Q_2 \)):
The middle value (5th position) is 78.Find \( Q_1 \), the median of the lower half (65, 70, 72, 75):
\( Q_1 = \frac{70 + 72}{2} = 71 \)Find \( Q_3 \), the median of the upper half (80, 82, 85, 90):
\( Q_3 = \frac{82 + 85}{2} = 83.5 \)Calculate the IQR:
IQR = 83.5 - 71 = 12.5The interquartile range is 12.5 points.
The IQR is resistant to outliers because it focuses only on the middle portion of the data. It's especially useful when data are skewed.
The standard deviation measures how far data values typically are from the mean. A small standard deviation means data are clustered closely around the mean, while a large standard deviation means data are more spread out.
The variance is the average of the squared differences from the mean. The standard deviation is the square root of the variance.
For a sample (a subset of a population), the sample variance is:
\[ s^2 = \frac{\sum (x - \bar{x})^2}{n - 1} \]And the sample standard deviation is:
\[ s = \sqrt{\frac{\sum (x - \bar{x})^2}{n - 1}} \]Here, \( x \) represents each data value, \( \bar{x} \) is the mean, and \( n \) is the number of values. We divide by \( n - 1 \) (not \( n \)) when working with a sample to get a better estimate of the population variance.
Example: Five students recorded their commute times (in minutes) to school:
10, 12, 15, 13, 10.Calculate the standard deviation.
Solution:
First, find the mean:
\( \bar{x} = \frac{10 + 12 + 15 + 13 + 10}{5} = \frac{60}{5} = 12 \)Calculate each deviation from the mean and square it:
\( (10 - 12)^2 = (-2)^2 = 4 \)
\( (12 - 12)^2 = 0^2 = 0 \)
\( (15 - 12)^2 = 3^2 = 9 \)
\( (13 - 12)^2 = 1^2 = 1 \)
\( (10 - 12)^2 = (-2)^2 = 4 \)Sum the squared deviations:
4 + 0 + 9 + 1 + 4 = 18Divide by \( n - 1 = 5 - 1 = 4 \):
\( s^2 = \frac{18}{4} = 4.5 \)Take the square root to find the standard deviation:
\( s = \sqrt{4.5} \approx 2.12 \)The standard deviation is approximately 2.12 minutes.
Standard deviation is widely used because it has the same units as the original data and works well with many statistical methods. However, like the mean, it is sensitive to outliers.
The five-number summary provides a quick snapshot of a dataset by listing five key values:
A boxplot (also called a box-and-whisker plot) is a visual representation of the five-number summary. The box shows the IQR (from \( Q_1 \) to \( Q_3 \)), with a line inside marking the median. The whiskers extend from the box to the minimum and maximum values (or to the edges of the data if outliers are present).
Example: The ages of ten volunteers at a community event are:
18, 20, 22, 23, 25, 27, 30, 32, 35, 40.Create the five-number summary.
Solution:
Minimum = 18
The median is the average of the 5th and 6th values:
\( Q_2 = \frac{25 + 27}{2} = 26 \)\( Q_1 \) is the median of the lower half (18, 20, 22, 23, 25):
\( Q_1 = 22 \)\( Q_3 \) is the median of the upper half (27, 30, 32, 35, 40):
\( Q_3 = 32 \)Maximum = 40
The five-number summary is: 18, 22, 26, 32, 40.
Boxplots are particularly useful for comparing multiple datasets side by side and for identifying outliers visually.
An outlier is a data value that is significantly different from the other values in a dataset. Outliers can occur due to measurement errors, data entry mistakes, or genuine variability. Identifying outliers helps us decide whether to investigate them further or use resistant measures like the median.
A common rule for identifying outliers uses the IQR:
Example: Using the test scores from an earlier example:
65, 70, 72, 75, 78, 80, 82, 85, 90.
We found \( Q_1 = 71 \), \( Q_3 = 83.5 \), and IQR = 12.5.Are there any outliers?
Solution:
Calculate the lower fence:
\( 71 - 1.5 \times 12.5 = 71 - 18.75 = 52.25 \)Calculate the upper fence:
\( 83.5 + 1.5 \times 12.5 = 83.5 + 18.75 = 102.25 \)Check if any values fall outside the fences:
All values are between 52.25 and 102.25.There are no outliers in this dataset.
Not all measures of center and spread are equally useful in every situation. Choosing the right summary statistics depends on the shape of the data and the presence of outliers.
When data are roughly symmetric (the left and right sides of a histogram look similar), the mean and standard deviation are typically the best choices. The mean accurately reflects the center, and the standard deviation describes typical distances from the mean.
When data are skewed-meaning they have a long tail on one side-the median and IQR are better choices. In a right-skewed distribution (tail extends to the right), the mean is pulled higher than the median by the extreme high values. In a left-skewed distribution (tail extends to the left), the mean is pulled lower than the median.
Think of it this way: If a few billionaires move into a neighborhood, the mean income will skyrocket, but the median income-what a typical resident earns-won't change much.
If a dataset contains outliers, the median and IQR provide a more accurate summary than the mean and standard deviation. The median and IQR are resistant statistics-they don't change much when outliers are present.
| Data Characteristic | Recommended Measures |
|---|---|
| Symmetric, no outliers | Mean and standard deviation |
| Skewed | Median and IQR |
| Contains outliers | Median and IQR |
| Categorical data | Mode |
Beyond center and spread, the shape of a distribution provides important information about how data are distributed.
A distribution is symmetric if the left and right sides mirror each other. The mean and median are approximately equal in symmetric distributions.
A distribution is skewed if one tail is longer than the other:
Distributions can have different numbers of peaks:
Recognizing these patterns helps you understand the underlying processes that generated the data.
One powerful use of summary statistics is comparing two or more datasets. By calculating and comparing measures of center and spread, we can identify similarities and differences between groups.
Example: Two classes took the same math test. Class A had a mean score of 78 with a standard deviation of 6. Class B had a mean score of 78 with a standard deviation of 12.
What can we conclude about the two classes?
Solution:
Both classes have the same mean score: 78.
Class A has a smaller standard deviation (6), meaning scores are more consistent and clustered near the mean.
Class B has a larger standard deviation (12), meaning scores are more spread out with greater variability.
We conclude that Class A performed more consistently, while Class B had more variability in student performance.
When comparing distributions, consider:
Summarizing quantitative data is essential across many fields:
For instance, if you're shopping for a new phone and see that Battery Model A lasts an average of 10 hours with a standard deviation of 0.5 hours, while Battery Model B lasts an average of 10 hours with a standard deviation of 2 hours, you'd probably choose Model A-it's more reliable and predictable.
Summarizing quantitative data transforms long lists of numbers into meaningful insights. By calculating measures of center (mean, median, mode), measures of spread (range, IQR, standard deviation), and examining the shape of distributions, we can describe datasets clearly and make informed comparisons. The key is choosing the right measures for the situation: use the mean and standard deviation for symmetric data without outliers, and use the median and IQR for skewed data or data with outliers. Understanding these tools equips you to interpret data in school, in everyday life, and in future careers where data-driven decisions are essential.