Grade 9 Exam  >  Grade 9 Notes  >  Statistics & Probability  >  Chapter Notes: Regression

Chapter Notes: Regression

When we collect data about two different things and want to understand how they relate to each other, we often discover patterns. For instance, if we measure the height and shoe size of many people, we might notice that taller people tend to have larger shoe sizes. Regression is a statistical method that helps us find a mathematical relationship between two variables so we can make predictions. It gives us an equation that describes how one variable changes when the other variable changes. In this chapter, we will explore how to find the line or curve that best fits a set of data points, how to interpret that fit, and how to use it to make informed predictions.

Understanding Bivariate Data

Before we can perform regression, we need to understand what bivariate data means. Bivariate data is simply data that involves two variables measured on the same group of subjects or objects. Each observation consists of a pair of values, one for each variable.

For example, imagine we collect data on the number of hours students study per week and their test scores. Each student gives us two pieces of information: hours studied and test score. We call one variable the independent variable (also called the explanatory variable or predictor) and the other the dependent variable (also called the response variable). The independent variable is the one we think might influence or predict the other. The dependent variable is what we are trying to predict or explain.

  • Independent variable (x): The variable we use to make predictions. In our example, this would be hours studied.
  • Dependent variable (y): The variable we want to predict or explain. In our example, this would be the test score.

We typically plot bivariate data on a scatter plot, with the independent variable on the horizontal axis (x-axis) and the dependent variable on the vertical axis (y-axis). Each point on the scatter plot represents one observation from our data set.

Scatter Plots and Patterns

A scatter plot is our first tool for understanding the relationship between two variables. When we look at a scatter plot, we look for several characteristics:

Direction of Association

The direction tells us whether the variables move together in the same direction or in opposite directions.

  • Positive association: As one variable increases, the other variable tends to increase as well. The points slope upward from left to right. Example: As the temperature increases, ice cream sales tend to increase.
  • Negative association: As one variable increases, the other variable tends to decrease. The points slope downward from left to right. Example: As the amount of time spent exercising increases, body weight tends to decrease.
  • No association: There is no clear pattern. The points are scattered randomly with no discernible slope.

Form of Association

The form describes the shape of the pattern in the scatter plot.

  • Linear: The points roughly follow a straight line.
  • Nonlinear (curved): The points follow a curved pattern, such as exponential, quadratic, or logarithmic.

Strength of Association

The strength tells us how closely the points follow the pattern.

  • Strong: The points are tightly clustered around a line or curve.
  • Moderate: The points follow a general pattern but with noticeable scatter.
  • Weak: The points are very spread out with only a vague pattern visible.

Outliers

An outlier is a point that doesn't fit the general pattern of the other points. It stands far away from where we would expect it to be based on the relationship shown by the rest of the data. Outliers can strongly influence regression results and should be investigated carefully.

Linear Regression

When the relationship between two variables appears to be linear on a scatter plot, we use linear regression to find the best-fitting straight line through the data points. This line is called the regression line, the line of best fit, or the least-squares regression line.

The Equation of the Regression Line

The regression line has the same form as any linear equation. We write it as:

\[ \hat{y} = a + bx \]

or sometimes as:

\[ \hat{y} = b_0 + b_1x \]

In these equations:

  • \( \hat{y} \) (read as "y-hat") is the predicted value of the dependent variable
  • \( x \) is the value of the independent variable
  • \( a \) or \( b_0 \) is the y-intercept, which is the predicted value of y when x equals zero
  • \( b \) or \( b_1 \) is the slope, which tells us how much y changes when x increases by one unit

The slope is particularly important for interpretation. If the slope is positive, we have a positive association. If the slope is negative, we have a negative association. The magnitude (size) of the slope tells us how steep the relationship is.

The Least-Squares Method

The "best-fitting" line is determined using a method called least squares. This method finds the line that minimizes the sum of the squared vertical distances between the actual data points and the predicted points on the line. These vertical distances are called residuals.

For each data point, the residual is calculated as:

\[ \text{residual} = y - \hat{y} \]

This is the actual value minus the predicted value. A positive residual means the actual point is above the regression line, and a negative residual means it's below the line. The least-squares method finds the line where the sum of all the squared residuals is as small as possible.

The formulas for calculating the slope and y-intercept are:

\[ b = r \cdot \frac{s_y}{s_x} \] \[ a = \bar{y} - b\bar{x} \]

Where:

  • \( r \) is the correlation coefficient (discussed in the next section)
  • \( s_x \) is the standard deviation of the x-values
  • \( s_y \) is the standard deviation of the y-values
  • \( \bar{x} \) is the mean of the x-values
  • \( \bar{y} \) is the mean of the y-values

Example:  A teacher collected data on the number of hours five students studied for an exam and their scores.
The data is: (2, 65), (4, 75), (5, 80), (7, 90), (9, 95), where x = hours studied and y = test score.

Find the equation of the least-squares regression line.

Solution:

First, we calculate the means:
\( \bar{x} = \frac{2 + 4 + 5 + 7 + 9}{5} = \frac{27}{5} = 5.4 \)
\( \bar{y} = \frac{65 + 75 + 80 + 90 + 95}{5} = \frac{405}{5} = 81 \)

Next, we calculate the standard deviations. For x:
\( s_x = \sqrt{\frac{(2-5.4)^2 + (4-5.4)^2 + (5-5.4)^2 + (7-5.4)^2 + (9-5.4)^2}{5-1}} \)
\( s_x = \sqrt{\frac{11.56 + 1.96 + 0.16 + 2.56 + 12.96}{4}} = \sqrt{\frac{29.2}{4}} = \sqrt{7.3} \approx 2.702 \)

For y:
\( s_y = \sqrt{\frac{(65-81)^2 + (75-81)^2 + (80-81)^2 + (90-81)^2 + (95-81)^2}{4}} \)
\( s_y = \sqrt{\frac{256 + 36 + 1 + 81 + 196}{4}} = \sqrt{\frac{570}{4}} = \sqrt{142.5} \approx 11.937 \)

We also need the correlation coefficient. Using the computational formula (or technology), we find \( r \approx 0.986 \).

Now we calculate the slope:
\( b = 0.986 \times \frac{11.937}{2.702} \approx 0.986 \times 4.417 \approx 4.355 \)

And the y-intercept:
\( a = 81 - 4.355 \times 5.4 \approx 81 - 23.517 \approx 57.483 \)

The regression equation is \( \hat{y} = 57.5 + 4.36x \) (rounded to reasonable precision).

This means that for each additional hour of study, the test score is predicted to increase by approximately 4.36 points.

Correlation

The correlation coefficient, denoted by \( r \), measures the strength and direction of the linear relationship between two variables. It is a number between -1 and 1.

Properties of the Correlation Coefficient

  • Range: \( -1 \leq r \leq 1 \)
  • Sign: The sign of \( r \) indicates the direction of the relationship. Positive \( r \) means positive association; negative \( r \) means negative association.
  • Magnitude: The absolute value of \( r \) indicates the strength of the linear relationship:
    • \( |r| \) close to 1 indicates a strong linear relationship
    • \( |r| \) around 0.5 indicates a moderate linear relationship
    • \( |r| \) close to 0 indicates a weak or no linear relationship
  • Perfect correlation: \( r = 1 \) or \( r = -1 \) means all points lie exactly on a straight line
  • No correlation: \( r = 0 \) means there is no linear relationship (though there might be a nonlinear relationship)

The formula for the correlation coefficient is:

\[ r = \frac{1}{n-1} \sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right) \]

This formula shows that \( r \) is based on standardized values (z-scores) of x and y. In practice, we typically use technology (calculators or statistical software) to compute \( r \).

Important Notes About Correlation

  • Correlation does not imply causation: Just because two variables are correlated doesn't mean one causes the other. There might be a third variable influencing both, or the relationship might be coincidental.
  • Correlation measures only linear relationships: Two variables might have a strong nonlinear relationship but a correlation coefficient close to zero.
  • Outliers can strongly affect \( r \): A single outlier can dramatically change the correlation coefficient.
  • Correlation is unitless: The value of \( r \) doesn't depend on the units of measurement.

Example:  The correlation between the number of hours spent watching TV per week and GPA for a group of students is \( r = -0.72 \).

Interpret this correlation coefficient.

Solution:

The negative sign tells us there is a negative association: as TV watching time increases, GPA tends to decrease.

The magnitude of 0.72 indicates this is a moderately strong linear relationship.

We can say there is a moderately strong negative linear relationship between hours of TV watched per week and GPA, meaning students who watch more TV tend to have lower GPAs.

Coefficient of Determination

The coefficient of determination, denoted \( r^2 \), is simply the square of the correlation coefficient. It has a very useful interpretation: \( r^2 \) tells us the proportion (or percentage) of the variation in the dependent variable that is explained by the linear relationship with the independent variable.

For example, if \( r = 0.8 \), then \( r^2 = 0.64 \), which means 64% of the variation in y can be explained by its linear relationship with x. The remaining 36% of the variation is due to other factors not included in the model.

  • \( r^2 \) always ranges from 0 to 1 (or 0% to 100%)
  • Higher values of \( r^2 \) indicate that the regression line fits the data better
  • \( r^2 \) is always positive, regardless of whether the correlation is positive or negative

Example:  A regression analysis of the relationship between advertising spending (in thousands of dollars) and product sales (in thousands of units) yields \( r = 0.90 \).

Calculate and interpret \( r^2 \).

Solution:

\( r^2 = (0.90)^2 = 0.81 \)

This means that 81% of the variation in product sales can be explained by the linear relationship with advertising spending.

The other 19% of variation in sales is due to other factors not captured by advertising spending alone.

Making Predictions Using the Regression Line

Once we have found the regression equation, we can use it to make predictions. We simply substitute the value of the independent variable into the equation and calculate the predicted value of the dependent variable.

Interpolation vs. Extrapolation

Interpolation means making a prediction for an x-value that falls within the range of x-values in our original data set. This is generally safe and reliable, assuming the linear relationship holds throughout that range.

Extrapolation means making a prediction for an x-value that falls outside the range of our original data. This is risky because we don't know if the linear relationship continues beyond the range of our data. The relationship might become nonlinear, or other factors might come into play.

Example:  Using the regression equation from our earlier example, \( \hat{y} = 57.5 + 4.36x \), where x is hours studied and y is test score.
The original data had x-values ranging from 2 to 9 hours.

Predict the test score for a student who studies 6 hours. Is this interpolation or extrapolation?

Solution:

We substitute x = 6 into the equation:
\( \hat{y} = 57.5 + 4.36(6) \)
\( \hat{y} = 57.5 + 26.16 \)
\( \hat{y} = 83.66 \)

The predicted test score is approximately 83.7 points.

Since 6 falls within the range of 2 to 9 hours in our original data, this is interpolation and is reasonably reliable.

Residuals and Residual Plots

Remember that a residual is the difference between an actual y-value and the predicted y-value from the regression line:

\[ \text{residual} = y - \hat{y} \]

Examining residuals helps us assess whether a linear model is appropriate for our data. We create a residual plot by plotting the residuals on the vertical axis and either the x-values or the predicted values \( \hat{y} \) on the horizontal axis.

Interpreting Residual Plots

A good linear model should produce a residual plot with the following characteristics:

  • Random scatter: The residuals should be randomly scattered around the horizontal line at residual = 0, with no clear pattern
  • Constant spread: The vertical spread of residuals should be roughly the same across all x-values
  • No curvature: There should be no curved pattern in the residuals

If the residual plot shows a pattern (such as a curve, a funnel shape, or clusters), this suggests that:

  • A linear model may not be appropriate
  • A nonlinear model might fit the data better
  • There might be other variables we need to consider

Residual plots are also useful for identifying outliers. Points with large residuals (far from zero) don't fit the model well and deserve special attention.

Influential Points and Outliers

Not all data points have equal influence on the regression line. Some points have more impact on the slope and position of the line than others.

Types of Unusual Points

Outlier in y: A point that has an unusual y-value given its x-value. It will have a large residual. These points are easy to spot in a residual plot.

Outlier in x: A point that has an x-value far from the mean of all x-values. These points can have high leverage, meaning they have the potential to strongly influence the regression line.

Influential point: A point whose removal would substantially change the regression equation. A point with high leverage that doesn't follow the pattern of the other points is likely to be influential.

What to Do About Unusual Points

When you identify an unusual point, you should:

  1. Check if it's a data entry error
  2. Investigate whether there's a special explanation for why that observation is different
  3. Consider running the regression both with and without the point to see how much influence it has
  4. Report your findings transparently, including whether you chose to include or exclude the point and why

Never remove a data point simply because it doesn't fit your expectations. Only remove points that are clearly errors or have a documented special cause that makes them not representative of the population you're studying.

Conditions for Linear Regression

For our inference about regression (such as creating confidence intervals or performing hypothesis tests) to be valid, certain conditions must be met. We can remember these as the LINE conditions:

  • L - Linear: The relationship between x and y must be linear. Check this with a scatter plot and residual plot.
  • I - Independent: The observations must be independent of each other. This is usually satisfied if we have a random sample or a randomized experiment.
  • N - Normal: The residuals should be approximately normally distributed. This is most important for small sample sizes. Check with a histogram or normal probability plot of the residuals.
  • E - Equal variance: The variability of the residuals should be roughly constant across all values of x. Check this with a residual plot; the vertical spread should be consistent.

If these conditions are not met, the predictions from the regression line may still be useful, but we cannot reliably perform statistical inference (like creating confidence intervals or performing significance tests).

Transformations for Nonlinear Data

When the relationship between two variables is nonlinear, we cannot appropriately use linear regression on the original data. However, we can sometimes transform one or both variables to create a linear relationship, then perform linear regression on the transformed data.

Common Transformations

Logarithmic transformation: If the scatter plot shows an exponential growth or decay pattern, try taking the logarithm of the y-values. If the relationship between x and log(y) is linear, then the original relationship is exponential.

Power transformation: If the data shows a power relationship, taking logs of both variables might linearize it. If the relationship between log(x) and log(y) is linear, then the original relationship is a power function.

After transforming and finding the regression equation using the transformed variables, we can transform predictions back to the original scale.

Example:  Population data for a city shows exponential growth.
After taking the natural logarithm of population values, the regression line using years since 2000 as x is:
ln(population) = 10.5 + 0.08x

Predict the population in 2025.

Solution:

For the year 2025, x = 2025 - 2000 = 25 years since 2000.

Substitute into the equation:
ln(population) = 10.5 + 0.08(25)
ln(population) = 10.5 + 2.0
ln(population) = 12.5

To find the actual population, we take the exponential of both sides:
population = e12.5 ≈ 268,337

The predicted population in 2025 is approximately 268,337 people.

Using Technology for Regression

While it's important to understand the concepts and formulas behind regression, in practice we almost always use technology to perform the calculations. Graphing calculators, spreadsheet software like Excel, and statistical programs like R or MINITAB can quickly compute regression equations, correlation coefficients, residuals, and create plots.

When using technology:

  • Always create a scatter plot first to visualize the relationship
  • Check that the linear model is appropriate before interpreting the output
  • Examine the residual plot to verify that conditions are met
  • Look at the value of \( r^2 \) to see how well the model fits
  • Be cautious about extrapolation beyond your data range
  • Always interpret your results in the context of the problem

Technology allows us to focus on the important work of interpretation and decision-making rather than spending time on tedious calculations. However, understanding what the technology is doing "under the hood" helps us use it wisely and interpret results correctly.

The document Chapter Notes: Regression is a part of the Grade 9 Course Statistics & Probability.
All you need of Grade 9 at this link: Grade 9
Explore Courses for Grade 9 exam
Get EduRev Notes directly in your Google search
Related Searches
MCQs, pdf , practice quizzes, Summary, Free, past year papers, Extra Questions, Chapter Notes: Regression, Exam, video lectures, Viva Questions, Chapter Notes: Regression, Objective type Questions, ppt, Sample Paper, study material, Semester Notes, shortcuts and tricks, mock tests for examination, Chapter Notes: Regression, Previous Year Questions with Solutions, Important questions;