PE Exam Exam  >  PE Exam Notes  >  Engineering Fundamentals Revision for PE  >  Formula Sheet: Descriptive Statistics

Formula Sheet: Descriptive Statistics

Measures of Central Tendency

Mean (Arithmetic Average)

Sample Mean: \[ \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i = \frac{x_1 + x_2 + ... + x_n}{n} \]
  • \(\bar{x}\) = sample mean
  • \(x_i\) = individual data values
  • \(n\) = number of observations in the sample
Population Mean: \[ \mu = \frac{1}{N}\sum_{i=1}^{N}x_i \]
  • \(\mu\) = population mean
  • \(N\) = total number of observations in the population
Weighted Mean: \[ \bar{x}_w = \frac{\sum_{i=1}^{n}w_i x_i}{\sum_{i=1}^{n}w_i} \]
  • \(\bar{x}_w\) = weighted mean
  • \(w_i\) = weight assigned to observation \(x_i\)
  • \(x_i\) = individual data values
  • \(n\) = number of observations

Median

Definition: The middle value when data is arranged in ascending or descending order. For odd number of observations: \[ \text{Median} = x_{\frac{n+1}{2}} \] For even number of observations: \[ \text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} \]
  • \(n\) = number of observations
  • \(x_i\) = data values arranged in order

Mode

Definition: The value that occurs most frequently in a dataset.
  • A dataset can have no mode, one mode (unimodal), two modes (bimodal), or multiple modes (multimodal)
  • Mode is the only measure of central tendency applicable to nominal data

Measures of Dispersion (Variability)

Range

\[ \text{Range} = x_{max} - x_{min} \]
  • \(x_{max}\) = maximum value in dataset
  • \(x_{min}\) = minimum value in dataset

Variance

Sample Variance: \[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2 \]
  • \(s^2\) = sample variance
  • \(x_i\) = individual data values
  • \(\bar{x}\) = sample mean
  • \(n\) = number of observations in the sample
  • Note: Division by \(n-1\) provides an unbiased estimate (Bessel's correction)
Population Variance: \[ \sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2 \]
  • \(\sigma^2\) = population variance
  • \(\mu\) = population mean
  • \(N\) = total number of observations in the population
Computational Formula for Sample Variance: \[ s^2 = \frac{\sum_{i=1}^{n}x_i^2 - \frac{(\sum_{i=1}^{n}x_i)^2}{n}}{n-1} \]

Standard Deviation

Sample Standard Deviation: \[ s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} \]
  • \(s\) = sample standard deviation
  • Units are the same as the original data
Population Standard Deviation: \[ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2} \]
  • \(\sigma\) = population standard deviation

Coefficient of Variation

\[ CV = \frac{s}{\bar{x}} \times 100\% \] or for population: \[ CV = \frac{\sigma}{\mu} \times 100\% \]
  • CV = coefficient of variation (expressed as percentage)
  • \(s\) = sample standard deviation
  • \(\bar{x}\) = sample mean
  • Note: Dimensionless measure of relative variability; useful for comparing variability between datasets with different units or means

Interquartile Range (IQR)

\[ IQR = Q_3 - Q_1 \]
  • IQR = interquartile range
  • \(Q_3\) = third quartile (75th percentile)
  • \(Q_1\) = first quartile (25th percentile)
  • Measures the spread of the middle 50% of the data
  • Resistant to outliers

Measures of Position

Percentiles

Position of kth Percentile: \[ L_k = \frac{k}{100}(n+1) \]
  • \(L_k\) = position of the kth percentile
  • \(k\) = desired percentile (0 to 100)
  • \(n\) = number of observations
  • Note: If \(L_k\) is not an integer, interpolate between the two nearest data values

Quartiles

  • \(Q_1\) = first quartile = 25th percentile
  • \(Q_2\) = second quartile = 50th percentile = median
  • \(Q_3\) = third quartile = 75th percentile

Standard Score (Z-Score)

Sample Z-Score: \[ z = \frac{x - \bar{x}}{s} \] Population Z-Score: \[ z = \frac{x - \mu}{\sigma} \]
  • \(z\) = standardized score
  • \(x\) = data value
  • \(\bar{x}\) or \(\mu\) = mean
  • \(s\) or \(\sigma\) = standard deviation
  • Z-score indicates how many standard deviations a value is from the mean
  • Positive z-score: value is above the mean
  • Negative z-score: value is below the mean

Measures of Shape

Skewness

Sample Skewness (Pearson's moment coefficient): \[ g_1 = \frac{n}{(n-1)(n-2)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^3 \] Approximate Skewness: \[ \text{Skewness} \approx \frac{3(\bar{x} - \text{Median})}{s} \]
  • \(g_1\) = sample skewness coefficient
  • \(\bar{x}\) = sample mean
  • \(s\) = sample standard deviation
  • Interpretation:
    • \(g_1 = 0\): symmetric distribution
    • \(g_1 > 0\): positively skewed (right-skewed, tail extends to the right)
    • \(g_1 < 0\):="" negatively="" skewed="" (left-skewed,="" tail="" extends="" to="" the="">

Kurtosis

Sample Kurtosis (excess kurtosis): \[ g_2 = \frac{n(n+1)}{(n-1)(n-2)(n-3)}\sum_{i=1}^{n}\left(\frac{x_i - \bar{x}}{s}\right)^4 - \frac{3(n-1)^2}{(n-2)(n-3)} \]
  • \(g_2\) = excess kurtosis coefficient
  • Interpretation:
    • \(g_2 = 0\): mesokurtic (normal distribution)
    • \(g_2 > 0\): leptokurtic (heavier tails, more peaked)
    • \(g_2 < 0\):="" platykurtic="" (lighter="" tails,="">

Correlation and Covariance

Covariance

Sample Covariance: \[ s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \] Population Covariance: \[ \sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_x)(y_i - \mu_y) \]
  • \(s_{xy}\) = sample covariance between variables x and y
  • \(\bar{x}\), \(\bar{y}\) = sample means of x and y
  • \(n\) = number of paired observations
  • Positive covariance indicates variables tend to move together
  • Negative covariance indicates variables tend to move in opposite directions

Correlation Coefficient

Pearson Correlation Coefficient (r): \[ r = \frac{s_{xy}}{s_x s_y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \] Alternative Computational Formula: \[ r = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{\sqrt{n\sum x_i^2 - (\sum x_i)^2}\sqrt{n\sum y_i^2 - (\sum y_i)^2}} \]
  • \(r\) = Pearson correlation coefficient
  • \(s_x\), \(s_y\) = sample standard deviations of x and y
  • Range: \(-1 \leq r \leq +1\)
  • Interpretation:
    • \(r = +1\): perfect positive linear relationship
    • \(r = -1\): perfect negative linear relationship
    • \(r = 0\): no linear relationship
    • \(|r| > 0.7\): strong correlation
    • \(0.3 < |r|="">< 0.7\):="" moderate="">
    • \(|r| < 0.3\):="" weak="">

Coefficient of Determination

\[ r^2 = \text{(Pearson correlation coefficient)}^2 \]
  • \(r^2\) = coefficient of determination
  • Range: \(0 \leq r^2 \leq 1\)
  • Represents the proportion of variance in one variable that is predictable from the other variable
  • Expressed as a percentage when multiplied by 100

Linear Regression

Simple Linear Regression Model

\[ y = a + bx \] or \[ \hat{y} = a + bx \]
  • \(\hat{y}\) = predicted value of dependent variable
  • \(x\) = independent variable
  • \(a\) = y-intercept
  • \(b\) = slope of the regression line

Slope of Regression Line

\[ b = \frac{s_{xy}}{s_x^2} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \] Alternative Computational Formula: \[ b = \frac{n\sum x_i y_i - \sum x_i \sum y_i}{n\sum x_i^2 - (\sum x_i)^2} \]
  • \(b\) = slope (regression coefficient)
  • \(s_{xy}\) = covariance of x and y
  • \(s_x^2\) = variance of x

Y-Intercept of Regression Line

\[ a = \bar{y} - b\bar{x} \]
  • \(a\) = y-intercept
  • \(\bar{x}\), \(\bar{y}\) = means of x and y
  • \(b\) = slope
  • The regression line always passes through the point \((\bar{x}, \bar{y})\)

Residual

\[ e_i = y_i - \hat{y}_i \]
  • \(e_i\) = residual for observation i
  • \(y_i\) = actual observed value
  • \(\hat{y}_i\) = predicted value from regression equation

Sum of Squared Errors (SSE)

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}e_i^2 \]
  • SSE = sum of squared errors (residuals)
  • Measure of variation not explained by the regression model

Standard Error of Estimate

\[ s_e = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{n-2}} \]
  • \(s_e\) = standard error of the estimate
  • \(n\) = number of observations
  • Measures the typical deviation of observed values from the regression line
  • Smaller values indicate better fit

Frequency Distributions

Class Width

\[ \text{Class Width} = \frac{\text{Range}}{\text{Number of Classes}} = \frac{x_{max} - x_{min}}{k} \]
  • \(k\) = number of classes
  • Typically round up to a convenient number

Sturges' Rule for Number of Classes

\[ k = 1 + 3.322 \log_{10}(n) \] or \[ k = 1 + \frac{\log(n)}{\log(2)} \]
  • \(k\) = suggested number of classes
  • \(n\) = number of observations
  • Provides a starting point; final number may be adjusted

Class Midpoint

\[ \text{Midpoint} = \frac{\text{Lower Class Limit} + \text{Upper Class Limit}}{2} \]

Relative Frequency

\[ \text{Relative Frequency} = \frac{\text{Class Frequency}}{n} \]
  • \(n\) = total number of observations
  • Sum of all relative frequencies equals 1.0

Cumulative Frequency

  • Sum of frequencies up to and including the current class
  • The cumulative frequency of the last class equals \(n\)

Outlier Detection

Interquartile Range (IQR) Method

Lower Fence: \[ \text{Lower Fence} = Q_1 - 1.5 \times IQR \] Upper Fence: \[ \text{Upper Fence} = Q_3 + 1.5 \times IQR \]
  • Data points below the lower fence or above the upper fence are considered outliers
  • \(Q_1\) = first quartile
  • \(Q_3\) = third quartile
  • \(IQR\) = interquartile range = \(Q_3 - Q_1\)

Z-Score Method

  • A data point is typically considered an outlier if:
  • \(|z| > 2\) (for small samples)
  • \(|z| > 3\) (for large samples or more conservative approach)
  • Where \(z\) is the z-score of the data point

Special Statistical Properties

Properties of Mean

  • Sum of deviations from the mean equals zero: \(\sum_{i=1}^{n}(x_i - \bar{x}) = 0\)
  • Mean is sensitive to outliers and extreme values
  • Mean of a constant times a variable: \(\overline{kx} = k\bar{x}\)
  • Mean of sum/difference: \(\overline{x \pm y} = \bar{x} \pm \bar{y}\)

Properties of Variance

  • Variance of a constant is zero: \(\text{Var}(k) = 0\)
  • Variance of a constant times a variable: \(\text{Var}(kx) = k^2 \text{Var}(x)\)
  • For independent variables: \(\text{Var}(x \pm y) = \text{Var}(x) + \text{Var}(y)\)

Properties of Standard Deviation

  • Standard deviation of a constant is zero: \(\text{SD}(k) = 0\)
  • Standard deviation of a constant times a variable: \(\text{SD}(kx) = |k| \times \text{SD}(x)\)

Empirical Rule (68-95-99.7 Rule)

For normal (bell-shaped) distributions:
  • Approximately 68% of data falls within \(\mu \pm \sigma\) (one standard deviation from mean)
  • Approximately 95% of data falls within \(\mu \pm 2\sigma\) (two standard deviations from mean)
  • Approximately 99.7% of data falls within \(\mu \pm 3\sigma\) (three standard deviations from mean)

Chebyshev's Theorem

For any distribution (regardless of shape): \[ \text{Minimum proportion} = 1 - \frac{1}{k^2} \]
  • \(k\) = number of standard deviations from the mean (\(k > 1\))
  • At least \(1 - \frac{1}{k^2}\) of the data falls within \(k\) standard deviations of the mean
  • Examples:
    • \(k = 2\): at least 75% of data within \(\mu \pm 2\sigma\)
    • \(k = 3\): at least 89% of data within \(\mu \pm 3\sigma\)
The document Formula Sheet: Descriptive Statistics is a part of the PE Exam Course Engineering Fundamentals Revision for PE.
All you need of PE Exam at this link: PE Exam
Explore Courses for PE Exam exam
Get EduRev Notes directly in your Google search
Related Searches
Extra Questions, Exam, ppt, Objective type Questions, Important questions, Previous Year Questions with Solutions, video lectures, mock tests for examination, study material, Free, Formula Sheet: Descriptive Statistics, Formula Sheet: Descriptive Statistics, Semester Notes, Viva Questions, shortcuts and tricks, practice quizzes, Sample Paper, Summary, Formula Sheet: Descriptive Statistics, pdf , MCQs, past year papers;