Statistics is a mathematical field focused on numbers and the analysis of data. It involves the examination, interpretation, presentation, and arrangement of data. Within statistical theory, a statistic is defined as a function applied to a sample; this function remains unaffected by the distribution of the sample.
The important statistics formulas are listed in the chart below:
Mean
- Definition: The arithmetical mean (or average) of a set of numbers is the sum of the numbers divided by the count of numbers. The mean gives a measure of the central tendency of the data.
- Formula:
- Remarks: Use the population mean formula when the data represent an entire population. Use the sample mean as an estimator when the data are a sample from a larger population. For grouped data, calculate the mean using class mid-points multiplied by frequencies divided by total frequency.
Example (simple data):
Find the mean of the numbers 2, 4, 7 and 9.
Sum of the numbers.
2 + 4 + 7 + 9 = 22
Number of observations.
4
Mean = Sum ÷ Number of observations.
Mean = 22 ÷ 4 = 5.5
- Definition: In a sorted list (ascending or descending), the median is the middle value that divides the data into two equal halves. The median is often more representative than the mean for skewed distributions.
- Formulas:
- Remarks: For an odd number of observations the median is the middle value. For an even number of observations the median is the average of the two middle values. For grouped data, compute the median by linear interpolation inside the median class.
Example (odd n):
Find the median of 3, 8, 11, 14, 20.
Number of observations = 5, which is odd.
Median is the 3rd value (middle value) in the sorted list.
Median = 11
Example (even n):
Find the median of 4, 6, 9, 13.
Number of observations = 4, which is even.
Median = average of 2nd and 3rd values = (6 + 9) ÷ 2
Median = 15 ÷ 2 = 7.5
Mode
- Definition: The mode of a data set is the value that occurs most frequently. A distribution may be unimodal, bimodal, or multimodal depending on the number of modes.
- Grouped data: For frequency distributions, the mode lies inside the modal class (the class with the highest frequency). A commonly used formula for the mode in grouped data is the linear interpolation formula shown below.
- Remarks: In discrete ungrouped data, the mode is the value with the maximum count. For grouped data, use the modal-class formula with the preceding and following class frequencies and the class width.
Example (ungrouped):
Find the mode of 2, 5, 2, 9, 5, 2.
Frequency of 2 is 3, frequency of 5 is 2, frequency of 9 is 1.
Mode = 2 (most frequent value)
Example (grouped - illustration of method):
Identify the modal class and apply the grouped mode formula using the modal class lower limit, frequency of modal class, frequencies of neighbouring classes and class width.
Standard Deviation
- Definition: Standard deviation measures the dispersion or spread of values about the mean. It is the square root of the variance and has the same units as the data.
- Population and sample: Population standard deviation uses denominator N (population size); sample standard deviation uses denominator (n - 1) to correct bias (Bessel's correction).
- Formula:
- Remarks: The square root ensures standard deviation is in the same units as the original data. For large samples the difference between n and n - 1 is small, but for small samples use n - 1 for an unbiased estimator.
Example (sample standard deviation):
Data: 3, 7, 7, 19.
Compute the sample mean.
Mean = (3 + 7 + 7 + 19) ÷ 4 = 36 ÷ 4 = 9
Compute squared deviations from the mean and sum them.
(3 - 9)² + (7 - 9)² + (7 - 9)² + (19 - 9)² = 36 + 4 + 4 + 100 = 144
Sample variance = Sum of squared deviations ÷ (n - 1).
Sample variance = 144 ÷ 3 = 48
Sample standard deviation = √48 = 4√3 ≈ 6.928
Variance
- Definition: Variance is the expectation of the squared deviation of a random variable from its mean. It quantifies dispersion by averaging squared distances from the mean.
- Formula (population):
- Alternative (computational) formula: For population variance, Var(X) = (Σx² ÷ N) - (mean)². This formula is often useful for manual calculation when Σx and Σx² are known.
- Relation: Standard deviation = √(variance).
Example (population variance using computational formula):
Data (population): 2, 4, 6.
Compute mean.
Mean = (2 + 4 + 6) ÷ 3 = 12 ÷ 3 = 4
Compute Σx².
Σx² = 2² + 4² + 6² = 4 + 16 + 36 = 56
Population variance = (Σx² ÷ N) - (mean)².
Population variance = (56 ÷ 3) - 4² = 18.666... - 16 = 2.666... ≈ 8/3
Population standard deviation = √(8/3) ≈ 1.633
- Weighted mean: When observations have different weights, weighted mean = (Σ w_i x_i) ÷ (Σ w_i). Use for averages where items contribute unequally.
- Geometric mean: For n positive numbers, geometric mean = (Π x_i)^(1/n). Useful for growth rates and multiplicative processes.
- Harmonic mean: For positive numbers, harmonic mean = n ÷ (Σ 1/x_i). Useful when averaging rates or ratios.
- Coefficient of variation (CV): CV = (Standard deviation ÷ Mean) × 100%. Use CV to compare relative variability between data sets with different units or means.
- Percentiles and quartiles: The p-th percentile divides data so that p% of observations are at or below that value. The 25th, 50th and 75th percentiles are the first quartile (Q1), median (Q2) and third quartile (Q3) respectively.
- Mean of combined groups: For two groups with means μ1, μ2 and sizes n1, n2, combined mean = (n1μ1 + n2μ2) ÷ (n1 + n2).
Practical notes and interpretation
- Choice of measure: Use mean for symmetric distributions without outliers, median for skewed distributions or when outliers are present, and mode when the most frequent value is of interest.
- Units: Mean and median retain the units of the data. Variance has squared units; standard deviation returns to original units.
- Comparisons: Use coefficient of variation to compare spread across different datasets. Use quartiles and interquartile range (IQR = Q3 - Q1) to measure spread robustly against outliers.
- Grouped data caution: All grouped-data formulas rely on class mid-points or linear interpolation; precision depends on class width and distribution within classes.
Summary (optional): The primary measures of central tendency are mean, median and mode. Measures of dispersion include variance and standard deviation. Use the appropriate formula for population or sample data, apply grouped-data formulas when necessary, and choose the measure that best represents the data context and the decision task at hand.