Linear Regression
Simple Linear Regression Model
General Form:
\[y = a + bx\]
- y = dependent variable (response variable)
- x = independent variable (predictor variable)
- a = y-intercept (constant term)
- b = slope (regression coefficient)
Estimated Regression Line (Least Squares Line):
\[\hat{y} = a + bx\]
- \(\hat{y}\) = predicted value of y
- Minimizes the sum of squared residuals
Calculation of Regression Coefficients
Slope (b):
\[b = \frac{n\sum xy - \sum x \sum y}{n\sum x^2 - (\sum x)^2}\]
Alternative formula for slope:
\[b = \frac{S_{xy}}{S_{xx}}\]
Y-Intercept (a):
\[a = \bar{y} - b\bar{x}\]
- n = number of data points
- \(\bar{x}\) = mean of x values = \(\frac{\sum x}{n}\)
- \(\bar{y}\) = mean of y values = \(\frac{\sum y}{n}\)
Sum of Squares
Sum of Squares for x (\(S_{xx}\)):
\[S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}\]
Sum of Squares for y (\(S_{yy}\)):
\[S_{yy} = \sum y^2 - \frac{(\sum y)^2}{n}\]
Sum of Cross Products (\(S_{xy}\)):
\[S_{xy} = \sum xy - \frac{\sum x \sum y}{n}\]
Residuals and Error Analysis
Residual:
\[e_i = y_i - \hat{y}_i\]
- \(e_i\) = residual for observation i
- \(y_i\) = observed value
- \(\hat{y}_i\) = predicted value
Sum of Squared Errors (SSE):
\[SSE = \sum (y_i - \hat{y}_i)^2 = \sum e_i^2\]
Alternative formula for SSE:
\[SSE = S_{yy} - b \cdot S_{xy}\]
Total Sum of Squares (SST):
\[SST = \sum (y_i - \bar{y})^2 = S_{yy}\]
Regression Sum of Squares (SSR):
\[SSR = \sum (\hat{y}_i - \bar{y})^2\]
Relationship:
\[SST = SSR + SSE\]
Standard Error of Estimate
Standard Error of the Estimate (\(s_e\) or \(s_{y/x}\)):
\[s_e = \sqrt{\frac{SSE}{n-2}}\]
- Measures the typical distance data points fall from the regression line
- Denominator uses (n-2) degrees of freedom for simple linear regression
- Units are the same as the dependent variable y
Alternative formula:
\[s_e = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}}\]
Correlation
Correlation Coefficient
Pearson Correlation Coefficient (r):
\[r = \frac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}}\]
Alternative formula:
\[r = \frac{n\sum xy - \sum x \sum y}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}\]
- Range: -1 ≤ r ≤ +1
- r = +1: perfect positive linear correlation
- r = -1: perfect negative linear correlation
- r = 0: no linear correlation
- r is dimensionless (no units)
Relationship between r and slope b:
\[r = b \cdot \frac{\sqrt{S_{xx}}}{\sqrt{S_{yy}}}\]
Coefficient of Determination
Coefficient of Determination (\(r^2\) or \(R^2\)):
\[r^2 = \frac{SSR}{SST} = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST}\]
- Represents the proportion of variance in y explained by x
- Range: 0 ≤ \(r^2\) ≤ 1
- Expressed as percentage: multiply by 100
- For simple linear regression: \(r^2\) = (correlation coefficient)\(^2\)
Interpretation:
- \(r^2\) = 0.85 means 85% of the variation in y is explained by the linear relationship with x
- Remaining 15% is unexplained variance
Multiple Linear Regression
Multiple Regression Model
General Form:
\[y = a + b_1x_1 + b_2x_2 + ... + b_kx_k\]
- y = dependent variable
- \(x_1, x_2, ..., x_k\) = independent variables
- a = y-intercept
- \(b_1, b_2, ..., b_k\) = partial regression coefficients
- k = number of independent variables
Adjusted Coefficient of Determination
Adjusted \(R^2\):
\[R_{adj}^2 = 1 - \frac{SSE/(n-k-1)}{SST/(n-1)}\]
Alternative formula:
\[R_{adj}^2 = 1 - (1-R^2)\frac{n-1}{n-k-1}\]
- n = number of observations
- k = number of independent variables
- Adjusts for the number of predictors in the model
- Penalizes addition of non-significant variables
- Used for comparing models with different numbers of predictors
Standard Error for Multiple Regression
Standard Error of the Estimate:
\[s_e = \sqrt{\frac{SSE}{n-k-1}}\]
- Denominator uses (n-k-1) degrees of freedom
- k = number of independent variables
Prediction and Confidence Intervals
Prediction Using Regression
Point Estimate:
\[\hat{y} = a + bx\]
- Substitute the value of x into the regression equation
- Valid only within the range of the original data (interpolation)
- Extrapolation beyond data range is unreliable
Confidence Interval for Mean Response
Confidence Interval for \(E(y|x_0)\):
\[\hat{y} \pm t_{\alpha/2, n-2} \cdot s_e \sqrt{\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
- \(x_0\) = specific value of x
- \(t_{\alpha/2, n-2}\) = t-value for desired confidence level with (n-2) degrees of freedom
- Estimates the mean value of y for a given x
Prediction Interval for Individual Response
Prediction Interval for individual y value:
\[\hat{y} \pm t_{\alpha/2, n-2} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}\]
- Wider than confidence interval for mean response
- Accounts for individual variation plus uncertainty in mean
- Note the "1+" under the square root compared to confidence interval
Hypothesis Testing in Regression
Testing Significance of Slope
Null Hypothesis:
\[H_0: b = 0\]
- Tests whether there is a significant linear relationship
Test Statistic:
\[t = \frac{b - 0}{s_b}\]
Standard Error of Slope (\(s_b\)):
\[s_b = \frac{s_e}{\sqrt{S_{xx}}}\]
- Compare calculated t-value to critical t-value with (n-2) degrees of freedom
- Reject \(H_0\) if |t| > \(t_{\alpha/2, n-2}\)
Testing Significance of Correlation
Null Hypothesis:
\[H_0: \rho = 0\]
- \(\rho\) = population correlation coefficient
Test Statistic:
\[t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]
- r = sample correlation coefficient
- n = sample size
- Degrees of freedom = n-2
- Equivalent to testing if slope b = 0 in simple linear regression
Nonlinear Regression
Transformations to Linear Form
Exponential Model:
\[y = ae^{bx}\]
Linearization: Take natural logarithm of both sides
\[\ln(y) = \ln(a) + bx\]
- Plot ln(y) vs. x
- Slope = b, intercept = ln(a)
Power Model:
\[y = ax^b\]
Linearization: Take logarithm of both sides
\[\ln(y) = \ln(a) + b\ln(x)\]
- Plot ln(y) vs. ln(x)
- Slope = b, intercept = ln(a)
Logarithmic Model:
\[y = a + b\ln(x)\]
- Already in linear form
- Plot y vs. ln(x)
Reciprocal Model:
\[y = \frac{1}{a + bx}\]
Linearization:
\[\frac{1}{y} = a + bx\]
Important Notes and Conditions
Assumptions of Linear Regression
- Linearity: Relationship between x and y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed (especially important for small samples)
- No multicollinearity: Independent variables are not highly correlated (for multiple regression)
Correlation vs. Causation
- Correlation does not imply causation
- A significant correlation indicates association, not necessarily cause-and-effect
- Confounding variables may influence both x and y
Outliers and Influential Points
- Outlier: Point with large residual (unusual y-value)
- Influential point: Point whose removal significantly changes regression equation
- Leverage: Points with extreme x-values have high leverage
- Always examine scatter plots and residual plots
Interpolation vs. Extrapolation
- Interpolation: Predicting within the range of observed data (generally reliable)
- Extrapolation: Predicting outside the range of observed data (unreliable and risky)
- Relationship may not hold beyond observed data range