When we collect data about two different things and want to understand how they relate to each other, we often discover patterns. For instance, if we measure the height and shoe size of many people, we might notice that taller people tend to have larger shoe sizes. Regression is a statistical method that helps us find a mathematical relationship between two variables so we can make predictions. It gives us an equation that describes how one variable changes when the other variable changes. In this chapter, we will explore how to find the line or curve that best fits a set of data points, how to interpret that fit, and how to use it to make informed predictions.
Before we can perform regression, we need to understand what bivariate data means. Bivariate data is simply data that involves two variables measured on the same group of subjects or objects. Each observation consists of a pair of values, one for each variable.
For example, imagine we collect data on the number of hours students study per week and their test scores. Each student gives us two pieces of information: hours studied and test score. We call one variable the independent variable (also called the explanatory variable or predictor) and the other the dependent variable (also called the response variable). The independent variable is the one we think might influence or predict the other. The dependent variable is what we are trying to predict or explain.
We typically plot bivariate data on a scatter plot, with the independent variable on the horizontal axis (x-axis) and the dependent variable on the vertical axis (y-axis). Each point on the scatter plot represents one observation from our data set.
A scatter plot is our first tool for understanding the relationship between two variables. When we look at a scatter plot, we look for several characteristics:
The direction tells us whether the variables move together in the same direction or in opposite directions.
The form describes the shape of the pattern in the scatter plot.
The strength tells us how closely the points follow the pattern.
An outlier is a point that doesn't fit the general pattern of the other points. It stands far away from where we would expect it to be based on the relationship shown by the rest of the data. Outliers can strongly influence regression results and should be investigated carefully.
When the relationship between two variables appears to be linear on a scatter plot, we use linear regression to find the best-fitting straight line through the data points. This line is called the regression line, the line of best fit, or the least-squares regression line.
The regression line has the same form as any linear equation. We write it as:
\[ \hat{y} = a + bx \]or sometimes as:
\[ \hat{y} = b_0 + b_1x \]In these equations:
The slope is particularly important for interpretation. If the slope is positive, we have a positive association. If the slope is negative, we have a negative association. The magnitude (size) of the slope tells us how steep the relationship is.
The "best-fitting" line is determined using a method called least squares. This method finds the line that minimizes the sum of the squared vertical distances between the actual data points and the predicted points on the line. These vertical distances are called residuals.
For each data point, the residual is calculated as:
\[ \text{residual} = y - \hat{y} \]This is the actual value minus the predicted value. A positive residual means the actual point is above the regression line, and a negative residual means it's below the line. The least-squares method finds the line where the sum of all the squared residuals is as small as possible.
The formulas for calculating the slope and y-intercept are:
\[ b = r \cdot \frac{s_y}{s_x} \] \[ a = \bar{y} - b\bar{x} \]Where:
Example: A teacher collected data on the number of hours five students studied for an exam and their scores.
The data is: (2, 65), (4, 75), (5, 80), (7, 90), (9, 95), where x = hours studied and y = test score.Find the equation of the least-squares regression line.
Solution:
First, we calculate the means:
\( \bar{x} = \frac{2 + 4 + 5 + 7 + 9}{5} = \frac{27}{5} = 5.4 \)
\( \bar{y} = \frac{65 + 75 + 80 + 90 + 95}{5} = \frac{405}{5} = 81 \)Next, we calculate the standard deviations. For x:
\( s_x = \sqrt{\frac{(2-5.4)^2 + (4-5.4)^2 + (5-5.4)^2 + (7-5.4)^2 + (9-5.4)^2}{5-1}} \)
\( s_x = \sqrt{\frac{11.56 + 1.96 + 0.16 + 2.56 + 12.96}{4}} = \sqrt{\frac{29.2}{4}} = \sqrt{7.3} \approx 2.702 \)For y:
\( s_y = \sqrt{\frac{(65-81)^2 + (75-81)^2 + (80-81)^2 + (90-81)^2 + (95-81)^2}{4}} \)
\( s_y = \sqrt{\frac{256 + 36 + 1 + 81 + 196}{4}} = \sqrt{\frac{570}{4}} = \sqrt{142.5} \approx 11.937 \)We also need the correlation coefficient. Using the computational formula (or technology), we find \( r \approx 0.986 \).
Now we calculate the slope:
\( b = 0.986 \times \frac{11.937}{2.702} \approx 0.986 \times 4.417 \approx 4.355 \)And the y-intercept:
\( a = 81 - 4.355 \times 5.4 \approx 81 - 23.517 \approx 57.483 \)The regression equation is \( \hat{y} = 57.5 + 4.36x \) (rounded to reasonable precision).
This means that for each additional hour of study, the test score is predicted to increase by approximately 4.36 points.
The correlation coefficient, denoted by \( r \), measures the strength and direction of the linear relationship between two variables. It is a number between -1 and 1.
The formula for the correlation coefficient is:
\[ r = \frac{1}{n-1} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right) \]This formula shows that \( r \) is based on standardized values (z-scores) of x and y. In practice, we typically use technology (calculators or statistical software) to compute \( r \).
Example: The correlation between the number of hours spent watching TV per week and GPA for a group of students is \( r = -0.72 \).
Interpret this correlation coefficient.
Solution:
The negative sign tells us there is a negative association: as TV watching time increases, GPA tends to decrease.
The magnitude of 0.72 indicates this is a moderately strong linear relationship.
We can say there is a moderately strong negative linear relationship between hours of TV watched per week and GPA, meaning students who watch more TV tend to have lower GPAs.
The coefficient of determination, denoted \( r^2 \), is simply the square of the correlation coefficient. It has a very useful interpretation: \( r^2 \) tells us the proportion (or percentage) of the variation in the dependent variable that is explained by the linear relationship with the independent variable.
For example, if \( r = 0.8 \), then \( r^2 = 0.64 \), which means 64% of the variation in y can be explained by its linear relationship with x. The remaining 36% of the variation is due to other factors not included in the model.
Example: A regression analysis of the relationship between advertising spending (in thousands of dollars) and product sales (in thousands of units) yields \( r = 0.90 \).
Calculate and interpret \( r^2 \).
Solution:
\( r^2 = (0.90)^2 = 0.81 \)
This means that 81% of the variation in product sales can be explained by the linear relationship with advertising spending.
The other 19% of variation in sales is due to other factors not captured by advertising spending alone.
Once we have found the regression equation, we can use it to make predictions. We simply substitute the value of the independent variable into the equation and calculate the predicted value of the dependent variable.
Interpolation means making a prediction for an x-value that falls within the range of x-values in our original data set. This is generally safe and reliable, assuming the linear relationship holds throughout that range.
Extrapolation means making a prediction for an x-value that falls outside the range of our original data. This is risky because we don't know if the linear relationship continues beyond the range of our data. The relationship might become nonlinear, or other factors might come into play.
Example: Using the regression equation from our earlier example, \( \hat{y} = 57.5 + 4.36x \), where x is hours studied and y is test score.
The original data had x-values ranging from 2 to 9 hours.Predict the test score for a student who studies 6 hours. Is this interpolation or extrapolation?
Solution:
We substitute x = 6 into the equation:
\( \hat{y} = 57.5 + 4.36(6) \)
\( \hat{y} = 57.5 + 26.16 \)
\( \hat{y} = 83.66 \)The predicted test score is approximately 83.7 points.
Since 6 falls within the range of 2 to 9 hours in our original data, this is interpolation and is reasonably reliable.
Remember that a residual is the difference between an actual y-value and the predicted y-value from the regression line:
\[ \text{residual} = y - \hat{y} \]Examining residuals helps us assess whether a linear model is appropriate for our data. We create a residual plot by plotting the residuals on the vertical axis and either the x-values or the predicted values \( \hat{y} \) on the horizontal axis.
A good linear model should produce a residual plot with the following characteristics:
If the residual plot shows a pattern (such as a curve, a funnel shape, or clusters), this suggests that:
Residual plots are also useful for identifying outliers. Points with large residuals (far from zero) don't fit the model well and deserve special attention.
Not all data points have equal influence on the regression line. Some points have more impact on the slope and position of the line than others.
Outlier in y: A point that has an unusual y-value given its x-value. It will have a large residual. These points are easy to spot in a residual plot.
Outlier in x: A point that has an x-value far from the mean of all x-values. These points can have high leverage, meaning they have the potential to strongly influence the regression line.
Influential point: A point whose removal would substantially change the regression equation. A point with high leverage that doesn't follow the pattern of the other points is likely to be influential.
When you identify an unusual point, you should:
Never remove a data point simply because it doesn't fit your expectations. Only remove points that are clearly errors or have a documented special cause that makes them not representative of the population you're studying.
For our inference about regression (such as creating confidence intervals or performing hypothesis tests) to be valid, certain conditions must be met. We can remember these as the LINE conditions:
If these conditions are not met, the predictions from the regression line may still be useful, but we cannot reliably perform statistical inference (like creating confidence intervals or performing significance tests).
When the relationship between two variables is nonlinear, we cannot appropriately use linear regression on the original data. However, we can sometimes transform one or both variables to create a linear relationship, then perform linear regression on the transformed data.
Logarithmic transformation: If the scatter plot shows an exponential growth or decay pattern, try taking the logarithm of the y-values. If the relationship between x and log(y) is linear, then the original relationship is exponential.
Power transformation: If the data shows a power relationship, taking logs of both variables might linearize it. If the relationship between log(x) and log(y) is linear, then the original relationship is a power function.
After transforming and finding the regression equation using the transformed variables, we can transform predictions back to the original scale.
Example: Population data for a city shows exponential growth.
After taking the natural logarithm of population values, the regression line using years since 2000 as x is:
ln(population) = 10.5 + 0.08xPredict the population in 2025.
Solution:
For the year 2025, x = 2025 - 2000 = 25 years since 2000.
Substitute into the equation:
ln(population) = 10.5 + 0.08(25)
ln(population) = 10.5 + 2.0
ln(population) = 12.5To find the actual population, we take the exponential of both sides:
population = e12.5 ≈ 268,337The predicted population in 2025 is approximately 268,337 people.
While it's important to understand the concepts and formulas behind regression, in practice we almost always use technology to perform the calculations. Graphing calculators, spreadsheet software like Excel, and statistical programs like R or MINITAB can quickly compute regression equations, correlation coefficients, residuals, and create plots.
When using technology:
Technology allows us to focus on the important work of interpretation and decision-making rather than spending time on tedious calculations. However, understanding what the technology is doing "under the hood" helps us use it wisely and interpret results correctly.