PE Exam Exam  >  PE Exam Notes  >  Engineering Fundamentals Revision for PE  >  Formula Sheet: Regression & Correlation

Formula Sheet: Regression & Correlation

Linear Regression

Simple Linear Regression Model

General Form:

\[y = a + bx\]
  • y = dependent variable (response variable)
  • x = independent variable (predictor variable)
  • a = y-intercept (constant term)
  • b = slope (regression coefficient)

Estimated Regression Line (Least Squares Line):

\[\hat{y} = a + bx\]
  • \(\hat{y}\) = predicted value of y
  • Minimizes the sum of squared residuals

Calculation of Regression Coefficients

Slope (b):

\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]

Alternative formula for slope:

\[b = \frac{S_{xy}}{S_{xx}}\]

Y-Intercept (a):

\[a = \bar{y} - b\bar{x}\]
  • n = number of data points
  • \(\bar{x}\) = mean of x values = \(\frac{\sum x}{n}\)
  • \(\bar{y}\) = mean of y values = \(\frac{\sum y}{n}\)

Sum of Squares

Sum of Squares for x (\(S_{xx}\)):

\[S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}\]

Sum of Squares for y (\(S_{yy}\)):

\[S_{yy} = \sum y^2 - \frac{(\sum y)^2}{n}\]

Sum of Cross Products (\(S_{xy}\)):

\[S_{xy} = \sum xy - \frac{\sum x \sum y}{n}\]

Residuals and Error Analysis

Residual:

\[e_i = y_i - \hat{y}_i\]
  • \(e_i\) = residual for observation i
  • \(y_i\) = observed value
  • \(\hat{y}_i\) = predicted value

Sum of Squared Errors (SSE):

\[SSE = \sum (y_i - \hat{y}_i)^2 = \sum e_i^2\]

Alternative formula for SSE:

\[SSE = S_{yy} - b \cdot S_{xy}\]

Total Sum of Squares (SST):

\[SST = \sum (y_i - \bar{y})^2 = S_{yy}\]

Regression Sum of Squares (SSR):

\[SSR = \sum (\hat{y}_i - \bar{y})^2\]

Relationship:

\[SST = SSR + SSE\]

Standard Error of Estimate

Standard Error of the Estimate (\(s_e\) or \(s_{y/x}\)):

\[s_e = \sqrt{\frac{SSE}{n-2}}\]
  • Measures the typical distance data points fall from the regression line
  • Denominator uses (n-2) degrees of freedom for simple linear regression
  • Units are the same as the dependent variable y

Alternative formula:

\[s_e = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\]

Correlation

Correlation Coefficient

Pearson Correlation Coefficient (r):

\[r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}\]

Alternative formula:

\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]
  • Range: -1 ≤ r ≤ +1
  • r = +1: perfect positive linear correlation
  • r = -1: perfect negative linear correlation
  • r = 0: no linear correlation
  • r is dimensionless (no units)

Relationship between r and slope b:

\[r = b \cdot \frac{\sqrt{S_{xx}}}{\sqrt{S_{yy}}}\]

Coefficient of Determination

Coefficient of Determination (\(r^2\) or \(R^2\)):

\[r^2 = \frac{SSR}{SST} = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}\]
  • Represents the proportion of variance in y explained by x
  • Range: 0 ≤ \(r^2\) ≤ 1
  • Expressed as percentage: multiply by 100
  • For simple linear regression: \(r^2\) = (correlation coefficient)\(^2\)

Interpretation:

  • \(r^2\) = 0.85 means 85% of the variation in y is explained by the linear relationship with x
  • Remaining 15% is unexplained variance

Multiple Linear Regression

Multiple Regression Model

General Form:

\[y = a + b_1x_1 + b_2x_2 + ... + b_kx_k\]
  • y = dependent variable
  • \(x_1, x_2, ..., x_k\) = independent variables
  • a = y-intercept
  • \(b_1, b_2, ..., b_k\) = partial regression coefficients
  • k = number of independent variables

Adjusted Coefficient of Determination

Adjusted \(R^2\):

\[R_{adj}^2 = 1 - \frac{SSE/(n-k-1)}{SST/(n-1)}\]

Alternative formula:

\[R_{adj}^2 = 1 - (1-R^2)\frac{n-1}{n-k-1}\]
  • n = number of observations
  • k = number of independent variables
  • Adjusts for the number of predictors in the model
  • Penalizes addition of non-significant variables
  • Used for comparing models with different numbers of predictors

Standard Error for Multiple Regression

Standard Error of the Estimate:

\[s_e = \sqrt{\frac{SSE}{n-k-1}}\]
  • Denominator uses (n-k-1) degrees of freedom
  • k = number of independent variables

Prediction and Confidence Intervals

Prediction Using Regression

Point Estimate:

\[\hat{y} = a + bx\]
  • Substitute the value of x into the regression equation
  • Valid only within the range of the original data (interpolation)
  • Extrapolation beyond data range is unreliable

Confidence Interval for Mean Response

Confidence Interval for \(E(y|x_0)\):

\[\hat{y} \pm t_{\alpha/2, n-2} \cdot s_e \sqrt{\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
  • \(x_0\) = specific value of x
  • \(t_{\alpha/2, n-2}\) = t-value for desired confidence level with (n-2) degrees of freedom
  • Estimates the mean value of y for a given x

Prediction Interval for Individual Response

Prediction Interval for individual y value:

\[\hat{y} \pm t_{\alpha/2, n-2} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
  • Wider than confidence interval for mean response
  • Accounts for individual variation plus uncertainty in mean
  • Note the "1+" under the square root compared to confidence interval

Hypothesis Testing in Regression

Testing Significance of Slope

Null Hypothesis:

\[H_0: b = 0\]
  • Tests whether there is a significant linear relationship

Test Statistic:

\[t = \frac{b - 0}{s_b}\]

Standard Error of Slope (\(s_b\)):

\[s_b = \frac{s_e}{\sqrt{S_{xx}}}\]
  • Compare calculated t-value to critical t-value with (n-2) degrees of freedom
  • Reject \(H_0\) if |t| > \(t_{\alpha/2, n-2}\)

Testing Significance of Correlation

Null Hypothesis:

\[H_0: \rho = 0\]
  • \(\rho\) = population correlation coefficient

Test Statistic:

\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]
  • r = sample correlation coefficient
  • n = sample size
  • Degrees of freedom = n-2
  • Equivalent to testing if slope b = 0 in simple linear regression

Nonlinear Regression

Transformations to Linear Form

Exponential Model:

\[y = ae^{bx}\]

Linearization: Take natural logarithm of both sides

\[\ln(y) = \ln(a) + bx\]
  • Plot ln(y) vs. x
  • Slope = b, intercept = ln(a)

Power Model:

\[y = ax^b\]

Linearization: Take logarithm of both sides

\[\ln(y) = \ln(a) + b\ln(x)\]
  • Plot ln(y) vs. ln(x)
  • Slope = b, intercept = ln(a)

Logarithmic Model:

\[y = a + b\ln(x)\]
  • Already in linear form
  • Plot y vs. ln(x)

Reciprocal Model:

\[y = \frac{1}{a + bx}\]

Linearization:

\[\frac{1}{y} = a + bx\]
  • Plot 1/y vs. x

Important Notes and Conditions

Assumptions of Linear Regression

  • Linearity: Relationship between x and y is linear
  • Independence: Observations are independent
  • Homoscedasticity: Constant variance of residuals
  • Normality: Residuals are normally distributed (especially important for small samples)
  • No multicollinearity: Independent variables are not highly correlated (for multiple regression)

Correlation vs. Causation

  • Correlation does not imply causation
  • A significant correlation indicates association, not necessarily cause-and-effect
  • Confounding variables may influence both x and y

Outliers and Influential Points

  • Outlier: Point with large residual (unusual y-value)
  • Influential point: Point whose removal significantly changes regression equation
  • Leverage: Points with extreme x-values have high leverage
  • Always examine scatter plots and residual plots

Interpolation vs. Extrapolation

  • Interpolation: Predicting within the range of observed data (generally reliable)
  • Extrapolation: Predicting outside the range of observed data (unreliable and risky)
  • Relationship may not hold beyond observed data range
The document Formula Sheet: Regression & Correlation is a part of the PE Exam Course Engineering Fundamentals Revision for PE.
All you need of PE Exam at this link: PE Exam
Explore Courses for PE Exam exam
Get EduRev Notes directly in your Google search
Related Searches
Previous Year Questions with Solutions, Free, Objective type Questions, Formula Sheet: Regression & Correlation, Formula Sheet: Regression & Correlation, Sample Paper, MCQs, Summary, pdf , Viva Questions, shortcuts and tricks, ppt, practice quizzes, past year papers, study material, Important questions, Exam, Semester Notes, Extra Questions, Formula Sheet: Regression & Correlation, video lectures, mock tests for examination;