Measures of Central Tendency
Mean (Arithmetic Average)
Sample Mean:
\[
\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1 + x_2 + ... + x_n}{n}
\]
- \(\bar{x}\) = sample mean
- \(x_i\) = individual data values
- \(n\) = number of observations in the sample
Population Mean:
\[
\mu = \frac{1}{N}\sum_{i=1}^{N}x_i
\]
- \(\mu\) = population mean
- \(N\) = total number of observations in the population
Weighted Mean:
\[
\bar{x}_w = \frac{\sum_{i=1}^{n}w_i x_i}{\sum_{i=1}^{n}w_i}
\]
- \(\bar{x}_w\) = weighted mean
- \(w_i\) = weight assigned to observation \(x_i\)
- \(x_i\) = individual data values
- \(n\) = number of observations
Median
Definition: The middle value when data is arranged in ascending or descending order.
For odd number of observations:
\[
\text{Median} = x_{\frac{n+1}{2}}
\]
For even number of observations:
\[
\text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}
\]
- \(n\) = number of observations
- \(x_i\) = data values arranged in order
Mode
Definition: The value that occurs most frequently in a dataset.
- A dataset can have no mode, one mode (unimodal), two modes (bimodal), or multiple modes (multimodal)
- Mode is the only measure of central tendency applicable to nominal data
Measures of Dispersion (Variability)
Range
\[
\text{Range} = x_{max} - x_{min}
\]
- \(x_{max}\) = maximum value in dataset
- \(x_{min}\) = minimum value in dataset
Variance
Sample Variance:
\[
s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2
\]
- \(s^2\) = sample variance
- \(x_i\) = individual data values
- \(\bar{x}\) = sample mean
- \(n\) = number of observations in the sample
- Note: Division by \(n-1\) provides an unbiased estimate (Bessel's correction)
Population Variance:
\[
\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2
\]
- \(\sigma^2\) = population variance
- \(\mu\) = population mean
- \(N\) = total number of observations in the population
Computational Formula for Sample Variance:
\[
s^2 = \frac{\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n}x_i)^2}{n}}{n-1}
\]
Standard Deviation
Sample Standard Deviation:
\[
s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}
\]
- \(s\) = sample standard deviation
- Units are the same as the original data
Population Standard Deviation:
\[
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}
\]
- \(\sigma\) = population standard deviation
Coefficient of Variation
\[
CV = \frac{s}{\bar{x}} \times 100\%
\]
or for population:
\[
CV = \frac{\sigma}{\mu} \times 100\%
\]
- CV = coefficient of variation (expressed as percentage)
- \(s\) = sample standard deviation
- \(\bar{x}\) = sample mean
- Note: Dimensionless measure of relative variability; useful for comparing variability between datasets with different units or means
Interquartile Range (IQR)
\[
IQR = Q_3 - Q_1
\]
- IQR = interquartile range
- \(Q_3\) = third quartile (75th percentile)
- \(Q_1\) = first quartile (25th percentile)
- Measures the spread of the middle 50% of the data
- Resistant to outliers
Measures of Position
Percentiles
Position of kth Percentile:
\[
L_k = \frac{k}{100}(n+1)
\]
- \(L_k\) = position of the kth percentile
- \(k\) = desired percentile (0 to 100)
- \(n\) = number of observations
- Note: If \(L_k\) is not an integer, interpolate between the two nearest data values
Quartiles
- \(Q_1\) = first quartile = 25th percentile
- \(Q_2\) = second quartile = 50th percentile = median
- \(Q_3\) = third quartile = 75th percentile
Standard Score (Z-Score)
Sample Z-Score:
\[
z = \frac{x - \bar{x}}{s}
\]
Population Z-Score:
\[
z = \frac{x - \mu}{\sigma}
\]
- \(z\) = standardized score
- \(x\) = data value
- \(\bar{x}\) or \(\mu\) = mean
- \(s\) or \(\sigma\) = standard deviation
- Z-score indicates how many standard deviations a value is from the mean
- Positive z-score: value is above the mean
- Negative z-score: value is below the mean
Measures of Shape
Skewness
Sample Skewness (Pearson's moment coefficient):
\[
g_1 = \frac{n}{(n-1)(n-2)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3
\]
Approximate Skewness:
\[
\text{Skewness} \approx \frac{3(\bar{x} - \text{Median})}{s}
\]
- \(g_1\) = sample skewness coefficient
- \(\bar{x}\) = sample mean
- \(s\) = sample standard deviation
- Interpretation:
- \(g_1 = 0\): symmetric distribution
- \(g_1 > 0\): positively skewed (right-skewed, tail extends to the right)
- \(g_1 < 0\):="" negatively="" skewed="" (left-skewed,="" tail="" extends="" to="" the="">
Kurtosis
Sample Kurtosis (excess kurtosis):
\[
g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)}
\]
- \(g_2\) = excess kurtosis coefficient
- Interpretation:
- \(g_2 = 0\): mesokurtic (normal distribution)
- \(g_2 > 0\): leptokurtic (heavier tails, more peaked)
- \(g_2 < 0\):="" platykurtic="" (lighter="" tails,="">
Correlation and Covariance
Covariance
Sample Covariance:
\[
s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})
\]
Population Covariance:
\[
\sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y)
\]
- \(s_{xy}\) = sample covariance between variables x and y
- \(\bar{x}\), \(\bar{y}\) = sample means of x and y
- \(n\) = number of paired observations
- Positive covariance indicates variables tend to move together
- Negative covariance indicates variables tend to move in opposite directions
Correlation Coefficient
Pearson Correlation Coefficient (r):
\[
r = \frac{s_{xy}}{s_x s_y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}}
\]
Alternative Computational Formula:
\[
r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{n\sum x_i^2 - (\sum x_i)^2}\sqrt{n\sum y_i^2 - (\sum y_i)^2}}
\]
- \(r\) = Pearson correlation coefficient
- \(s_x\), \(s_y\) = sample standard deviations of x and y
- Range: \(-1 \leq r \leq +1\)
- Interpretation:
- \(r = +1\): perfect positive linear relationship
- \(r = -1\): perfect negative linear relationship
- \(r = 0\): no linear relationship
- \(|r| > 0.7\): strong correlation
- \(0.3 < |r|="">< 0.7\):="" moderate="">
- \(|r| < 0.3\):="" weak="">
Coefficient of Determination
\[
r^2 = \text{(Pearson correlation coefficient)}^2
\]
- \(r^2\) = coefficient of determination
- Range: \(0 \leq r^2 \leq 1\)
- Represents the proportion of variance in one variable that is predictable from the other variable
- Expressed as a percentage when multiplied by 100
Linear Regression
Simple Linear Regression Model
\[
y = a + bx
\]
or
\[
\hat{y} = a + bx
\]
- \(\hat{y}\) = predicted value of dependent variable
- \(x\) = independent variable
- \(a\) = y-intercept
- \(b\) = slope of the regression line
Slope of Regression Line
\[
b = \frac{s_{xy}}{s_x^2} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}
\]
Alternative Computational Formula:
\[
b = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2}
\]
- \(b\) = slope (regression coefficient)
- \(s_{xy}\) = covariance of x and y
- \(s_x^2\) = variance of x
Y-Intercept of Regression Line
\[
a = \bar{y} - b\bar{x}
\]
- \(a\) = y-intercept
- \(\bar{x}\), \(\bar{y}\) = means of x and y
- \(b\) = slope
- The regression line always passes through the point \((\bar{x}, \bar{y})\)
Residual
\[
e_i = y_i - \hat{y}_i
\]
- \(e_i\) = residual for observation i
- \(y_i\) = actual observed value
- \(\hat{y}_i\) = predicted value from regression equation
Sum of Squared Errors (SSE)
\[
SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}e_i^2
\]
- SSE = sum of squared errors (residuals)
- Measure of variation not explained by the regression model
Standard Error of Estimate
\[
s_e = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n-2}}
\]
- \(s_e\) = standard error of the estimate
- \(n\) = number of observations
- Measures the typical deviation of observed values from the regression line
- Smaller values indicate better fit
Frequency Distributions
Class Width
\[
\text{Class Width} = \frac{\text{Range}}{\text{Number of Classes}} = \frac{x_{max} - x_{min}}{k}
\]
- \(k\) = number of classes
- Typically round up to a convenient number
Sturges' Rule for Number of Classes
\[
k = 1 + 3.322 \log_{10}(n)
\]
or
\[
k = 1 + \frac{\log(n)}{\log(2)}
\]
- \(k\) = suggested number of classes
- \(n\) = number of observations
- Provides a starting point; final number may be adjusted
Class Midpoint
\[
\text{Midpoint} = \frac{\text{Lower Class Limit} + \text{Upper Class Limit}}{2}
\]
Relative Frequency
\[
\text{Relative Frequency} = \frac{\text{Class Frequency}}{n}
\]
- \(n\) = total number of observations
- Sum of all relative frequencies equals 1.0
Cumulative Frequency
- Sum of frequencies up to and including the current class
- The cumulative frequency of the last class equals \(n\)
Outlier Detection
Interquartile Range (IQR) Method
Lower Fence:
\[
\text{Lower Fence} = Q_1 - 1.5 \times IQR
\]
Upper Fence:
\[
\text{Upper Fence} = Q_3 + 1.5 \times IQR
\]
- Data points below the lower fence or above the upper fence are considered outliers
- \(Q_1\) = first quartile
- \(Q_3\) = third quartile
- \(IQR\) = interquartile range = \(Q_3 - Q_1\)
Z-Score Method
- A data point is typically considered an outlier if:
- \(|z| > 2\) (for small samples)
- \(|z| > 3\) (for large samples or more conservative approach)
- Where \(z\) is the z-score of the data point
Special Statistical Properties
Properties of Mean
- Sum of deviations from the mean equals zero: \(\sum_{i=1}^{n}(x_i - \bar{x}) = 0\)
- Mean is sensitive to outliers and extreme values
- Mean of a constant times a variable: \(\overline{kx} = k\bar{x}\)
- Mean of sum/difference: \(\overline{x \pm y} = \bar{x} \pm \bar{y}\)
Properties of Variance
- Variance of a constant is zero: \(\text{Var}(k) = 0\)
- Variance of a constant times a variable: \(\text{Var}(kx) = k^2 \text{Var}(x)\)
- For independent variables: \(\text{Var}(x \pm y) = \text{Var}(x) + \text{Var}(y)\)
Properties of Standard Deviation
- Standard deviation of a constant is zero: \(\text{SD}(k) = 0\)
- Standard deviation of a constant times a variable: \(\text{SD}(kx) = |k| \times \text{SD}(x)\)
Empirical Rule (68-95-99.7 Rule)
For normal (bell-shaped) distributions:
- Approximately 68% of data falls within \(\mu \pm \sigma\) (one standard deviation from mean)
- Approximately 95% of data falls within \(\mu \pm 2\sigma\) (two standard deviations from mean)
- Approximately 99.7% of data falls within \(\mu \pm 3\sigma\) (three standard deviations from mean)
Chebyshev's Theorem
For any distribution (regardless of shape):
\[
\text{Minimum proportion} = 1 - \frac{1}{k^2}
\]
- \(k\) = number of standard deviations from the mean (\(k > 1\))
- At least \(1 - \frac{1}{k^2}\) of the data falls within \(k\) standard deviations of the mean
- Examples:
- \(k = 2\): at least 75% of data within \(\mu \pm 2\sigma\)
- \(k = 3\): at least 89% of data within \(\mu \pm 3\sigma\)