When we collect data on two variables, we often want to know if there is a relationship between them. For example, does studying more hours lead to higher test scores? Does height relate to shoe size? When we create a scatterplot of paired data, we can often see a pattern or trend. A least-squares regression equation, also called a line of best fit, is a mathematical tool that gives us the single straight line that best represents the relationship between two quantitative variables. This line allows us to make predictions and understand how one variable tends to change as the other changes. In plain English, the least-squares regression line is the line that comes closest to all the data points on average, minimizing the total distance (squared) between the line and the actual data points.
Before we can find a least-squares regression equation, we need to understand what kind of data works well with this approach. A linear relationship means that as one variable increases, the other variable tends to increase or decrease at a relatively constant rate. When graphed on a scatterplot, the data points roughly follow a straight-line pattern rather than a curve.
We use specific vocabulary to describe the two variables:
Think of it this way: if you're studying how hours of study time (explanatory variable) affects test scores (response variable), you're saying that test scores respond to or depend on study time, not the other way around.
Every straight line can be described using an equation. The most common form for writing the equation of a line is called slope-intercept form:
\[ y = mx + b \]In this equation:
In statistics, we often write the regression equation using slightly different notation to emphasize that we're making a prediction:
\[ \hat{y} = a + bx \]Here, \( \hat{y} \) (read as "y-hat") represents our predicted value for the response variable, \( a \) is the y-intercept, and \( b \) is the slope. Both forms represent the same concept; we'll use the statistical notation \( \hat{y} = a + bx \) throughout this discussion.
The slope \( b \) tells us how much the response variable changes, on average, when the explanatory variable increases by one unit. More specifically:
The y-intercept \( a \) tells us the predicted value of \( y \) when \( x = 0 \). In some contexts, this has a meaningful interpretation. In other contexts, \( x = 0 \) might be impossible or outside the range of our data, making the y-intercept less meaningful on its own. However, the y-intercept is still necessary for making accurate predictions across the range of the data.
When we have a collection of data points, there are infinitely many lines we could draw through or near them. The least-squares regression line is special because it is the one line that minimizes the sum of the squared vertical distances between each data point and the line itself.
For each data point \( (x_i, y_i) \), we can calculate a residual, which is the difference between the actual y-value and the predicted y-value:
\[ \text{residual} = y_i - \hat{y}_i \]The residual tells us how far off our prediction is for that particular point. A positive residual means the actual value is above the line; a negative residual means it's below the line.
The least-squares method finds the line that makes the sum of all the squared residuals as small as possible:
\[ \text{Minimize: } \sum (y_i - \hat{y}_i)^2 \]Why square the residuals? Squaring ensures that positive and negative errors don't cancel each other out, and it also gives more weight to larger errors, which helps the line avoid being pulled too far by outliers in a systematic way.
To find the least-squares regression line \( \hat{y} = a + bx \), we need to calculate the slope \( b \) and the y-intercept \( a \) using our data. The formulas involve several statistical measures you may already know: the mean (average) and the standard deviation of each variable, as well as the correlation coefficient \( r \).
The slope of the least-squares regression line is calculated as:
\[ b = r \cdot \frac{s_y}{s_x} \]Where:
The correlation coefficient \( r \) always falls between -1 and +1. A value near +1 indicates a strong positive linear relationship, a value near -1 indicates a strong negative linear relationship, and a value near 0 indicates little or no linear relationship.
Once we have calculated the slope, we can find the y-intercept using:
\[ a = \bar{y} - b\bar{x} \]Where:
This formula guarantees that the least-squares regression line always passes through the point \( (\bar{x}, \bar{y}) \), which is called the point of averages.
To find the least-squares regression equation from a set of data, follow these steps:
Example: A teacher collects data on the number of hours students studied for a test and their test scores.
The data summary statistics are:
Mean study hours: \( \bar{x} = 5 \) hours, Standard deviation of study hours: \( s_x = 2 \) hours
Mean test score: \( \bar{y} = 78 \) points, Standard deviation of test scores: \( s_y = 10 \) points
Correlation coefficient: \( r = 0.8 \)Find the least-squares regression equation to predict test score from study hours.
Solution:
First, calculate the slope using \( b = r \cdot \frac{s_y}{s_x} \):
\( b = 0.8 \cdot \frac{10}{2} = 0.8 \cdot 5 = 4 \)
Next, calculate the y-intercept using \( a = \bar{y} - b\bar{x} \):
\( a = 78 - 4(5) = 78 - 20 = 58 \)
Write the regression equation:
\( \hat{y} = 58 + 4x \)
The least-squares regression equation is \( \hat{y} = 58 + 4x \), where \( x \) is study hours and \( \hat{y} \) is the predicted test score.
Once you have the regression equation, you need to be able to explain what it means in the context of the problem.
The slope tells you how much the response variable is predicted to change for each one-unit increase in the explanatory variable. Always state the slope interpretation in context using this template:
"For each additional [one unit of x], the predicted [y] increases/decreases by [slope value] [units of y]."
Example: Using the regression equation from the previous example: \( \hat{y} = 58 + 4x \), where \( x \) is study hours and \( \hat{y} \) is predicted test score.
Interpret the slope in context.
Solution:
The slope is 4.
Context interpretation: For each additional hour of study time, the predicted test score increases by 4 points.
The y-intercept tells you the predicted value of the response variable when the explanatory variable equals zero. Use this template:
"When [x] is 0 [units], the predicted [y] is [y-intercept value] [units of y]."
Always consider whether \( x = 0 \) makes sense in the context. If it doesn't, note that the y-intercept may not have a meaningful interpretation but is still necessary for the equation.
Example: Using the equation \( \hat{y} = 58 + 4x \).
Interpret the y-intercept in context.
Solution:
The y-intercept is 58.
When a student studies 0 hours, the predicted test score is 58 points.
This interpretation makes sense in this context, representing a baseline score without any study time.
One of the most practical uses of the least-squares regression equation is making predictions. Once you have the equation \( \hat{y} = a + bx \), you can substitute any value of \( x \) to predict the corresponding value of \( y \).
Example: A study finds that the regression equation relating outdoor temperature (\( x \), in degrees Fahrenheit) to ice cream sales (\( y \), in dollars) is:
\( \hat{y} = -200 + 8x \)Predict the ice cream sales when the temperature is 85°F.
Solution:
Substitute \( x = 85 \) into the equation:
\( \hat{y} = -200 + 8(85) \)
\( \hat{y} = -200 + 680 \)
\( \hat{y} = 480 \)
When the temperature is 85°F, the predicted ice cream sales are $480.
Extrapolation means using the regression equation to make predictions for x-values that are outside the range of the data used to create the equation. This can be risky because we don't know if the linear relationship continues beyond the observed data range. The relationship might change, curve, or break down entirely.
For example, if our ice cream sales data only included temperatures between 60°F and 95°F, predicting sales at 110°F or 30°F would be extrapolation and might not be reliable.
Always check whether your prediction falls within the range of your original data. Predictions within this range are called interpolation and are generally more reliable.
The correlation coefficient \( r \) plays a central role in the least-squares regression equation, and understanding it helps us interpret how well the regression line fits the data.
Closely related to \( r \) is \( r^2 \), called the coefficient of determination. This value tells us what fraction (or percentage) of the variation in the response variable is explained by the linear relationship with the explanatory variable.
For example, if \( r = 0.8 \), then \( r^2 = 0.64 \), which means 64% of the variation in the response variable can be explained by its linear relationship with the explanatory variable. The remaining 36% of variation is due to other factors not captured by this model.
Example: A study examining the relationship between hours of weekly exercise and resting heart rate finds a correlation of \( r = -0.7 \).
Calculate and interpret \( r^2 \).
Solution:
\( r^2 = (-0.7)^2 = 0.49 \)
This means that 49% of the variation in resting heart rate can be explained by the linear relationship with hours of weekly exercise.
The remaining 51% of variation is due to other factors not included in this model.
After creating a regression equation, we should always check whether a linear model is appropriate for our data. The primary tool for this is examining the residuals.
As mentioned earlier, a residual is the difference between an actual observed value and the predicted value:
\[ \text{residual} = y - \hat{y} \]If we calculate the residual for every data point and create a residual plot (plotting residuals on the y-axis against the x-values or predicted values on the x-axis), we can check whether our linear model is appropriate.
A good linear model should produce a residual plot with the following characteristics:
If the residual plot shows a curved pattern, a fan shape (changing spread), or other systematic patterns, this suggests that a linear model may not be appropriate for the data. You might need to consider a different type of model or transformation of the variables.
While least-squares regression is a powerful tool, it's important to understand its limitations:
Even if two variables have a strong correlation and a good regression equation, this does not mean that one variable causes the other to change. There might be a third variable (a lurking variable or confounding variable) that influences both, or the association might be coincidental.
For example, ice cream sales and drowning incidents are positively correlated, but eating ice cream doesn't cause drowning. Both increase during hot summer weather, which is a lurking variable.
The least-squares regression line only captures linear relationships. If the true relationship between variables is curved or more complex, a straight line will not provide accurate predictions. Always create a scatterplot first to check whether a linear model makes sense.
Regression lines can be influenced by outliers, especially those that are far from the rest of the data in the x-direction. A single influential point can dramatically change the slope and y-intercept. Always examine your data for outliers and consider their impact on your model.
An interesting property of regression is that extreme x-values tend to be associated with less extreme predicted y-values. This phenomenon, called regression to the mean, occurs because the slope \( b = r \cdot \frac{s_y}{s_x} \) includes the correlation coefficient \( r \), which is always between -1 and 1. Unless \( r = ±1 \) (a perfect linear relationship), predictions will be pulled toward the mean.
In practice, most regression equations are calculated using technology: graphing calculators, spreadsheet software, or statistical programs. These tools can quickly compute all the necessary statistics and provide the regression equation, correlation coefficient, residual plots, and more.
When using technology, you should:
Understanding the underlying concepts-what the equation means, how to interpret it, and when it's appropriate to use-is just as important as being able to calculate it.
The least-squares regression equation is a mathematical summary of the linear relationship between two quantitative variables. It provides the single best-fitting line through a set of data points by minimizing the sum of squared residuals. The equation \( \hat{y} = a + bx \) allows us to make predictions and understand how variables relate to each other.
To use regression effectively, you need to:
Mastering least-squares regression opens the door to understanding relationships in data across many fields: science, economics, health, sports, and more. It is one of the most widely used statistical techniques and forms the foundation for more advanced modeling methods you may encounter in future studies.