Page 1
CORRELATION AND REGRESSION
17
CHAPTER
After reading this chapter, students will be able to understand:
? The meaning of bivariate data and techniques of preparation of bivariate distribution;
? The concept of correlation between two variables and quantitative measurement of
correlation including the interpretation of positive, negative and zero correlation;
? Concept of regression and its application in estimation of a variable from known set of data.
Types of
Correlation
Biv aria te
D ata
C o rrela t ion
A n al y sis
Positive
Correlation
Measures of
Correlation
Negative
Correlation
Spearman’s
Correlation
Coefficient
Coefficient of
Concurrent
Deviations
Karl Person Product
Moment correlation
Coefficient
Scatter
Diagram
Bivariate Frequency
Distribution
Marginal
Distribution
Conditional
Distribution
Reg r e s s io n
An al y s is
Estimation of
Regression
Analysis
Method of
Least Squares
Regression
Lines
Regression
equation y on x
Regression
equation x on y
CHAPTER OVERVIEW
© The Institute of Chartered Accountants of India
Page 2
CORRELATION AND REGRESSION
17
CHAPTER
After reading this chapter, students will be able to understand:
? The meaning of bivariate data and techniques of preparation of bivariate distribution;
? The concept of correlation between two variables and quantitative measurement of
correlation including the interpretation of positive, negative and zero correlation;
? Concept of regression and its application in estimation of a variable from known set of data.
Types of
Correlation
Biv aria te
D ata
C o rrela t ion
A n al y sis
Positive
Correlation
Measures of
Correlation
Negative
Correlation
Spearman’s
Correlation
Coefficient
Coefficient of
Concurrent
Deviations
Karl Person Product
Moment correlation
Coefficient
Scatter
Diagram
Bivariate Frequency
Distribution
Marginal
Distribution
Conditional
Distribution
Reg r e s s io n
An al y s is
Estimation of
Regression
Analysis
Method of
Least Squares
Regression
Lines
Regression
equation y on x
Regression
equation x on y
CHAPTER OVERVIEW
© The Institute of Chartered Accountants of India
17 .2
STATISTICS
In the previous chapter, we discussed many a statistical measure relating to Univariate distribution
i.e. distribution of one variable like height, weight, mark, profit, wage and so on. However, there
are situations that demand study of more than one variable simultaneously. A businessman may
be keen to know what amount of investment would yield a desired level of profit or a student
may want to know whether performing better in the selection test would enhance his or her
chance of doing well in the final examination. With a view to answering this series of questions,
we need to study more than one variable at the same time. Correlation Analysis and Regression
Analysis are the two analyses that are made from a multivariate distribution i.e. a distribution of
more than one variable. In particular when there are two variables, say x and y, we study bivariate
distribution. We restrict our discussion to bivariate distribution only.
Correlation analysis, it may be noted, helps us to find an association or the lack of it between the
two variables x and y. Thus if x and y stand for profit and investment of a firm or the marks in
Statistics and Mathematics for a group of students, then we may be interested to know whether
x and y are associated or independent of each other. The extent or amount of correlation between
x and y is provided by different measures of Correlation namely Product Moment Correlation
Coefficient or Rank Correlation Coefficient or Coefficient of Concurrent Deviations. In Correlation
analysis, we must be careful about a cause and effect relation between the variables under
consideration because there may be situations where x and y are related due to the influence of a
third variable although no causal relationship exists between the two variables.
Regression analysis, on the other hand, is concerned with predicting the value of the dependent
variable corresponding to a known value of the independent variable on the assumption of a
mathematical relationship between the two variables and also an average relationship between
them.
When data are collected on two variables simultaneously, they are known as bivariate data and
the corresponding frequency distribution, derived from it, is known as Bivariate Frequency
Distribution. If x and y denote marks in Maths and Stats for a group of 30 students, then the
corresponding bivariate data would be (x
i
, y
i
) for i = 1, 2, …. 30 where (x
1
, y
1
) denotes the marks
in Mathematics and Statistics for the student with serial number or Roll Number 1, (x
2
, y
2
), that
for the student with Roll Number 2 and so on and lastly (x
30
, y
30
) denotes the pair of marks for the
student bearing Roll Number 30.
As in the case of a Univariate Distribution, we need to construct the frequency distribution
for bivariate data. Such a distribution takes into account the classification in respect of both
the variables simultaneously. Usually, we make horizontal classification in respect of x and
vertical classification in respect of the other variable y. Such a distribution is known as
Bivariate Frequency Distribution or Joint Frequency Distribution or Two way classification
of the two variables x and y.
© The Institute of Chartered Accountants of India
Page 3
CORRELATION AND REGRESSION
17
CHAPTER
After reading this chapter, students will be able to understand:
? The meaning of bivariate data and techniques of preparation of bivariate distribution;
? The concept of correlation between two variables and quantitative measurement of
correlation including the interpretation of positive, negative and zero correlation;
? Concept of regression and its application in estimation of a variable from known set of data.
Types of
Correlation
Biv aria te
D ata
C o rrela t ion
A n al y sis
Positive
Correlation
Measures of
Correlation
Negative
Correlation
Spearman’s
Correlation
Coefficient
Coefficient of
Concurrent
Deviations
Karl Person Product
Moment correlation
Coefficient
Scatter
Diagram
Bivariate Frequency
Distribution
Marginal
Distribution
Conditional
Distribution
Reg r e s s io n
An al y s is
Estimation of
Regression
Analysis
Method of
Least Squares
Regression
Lines
Regression
equation y on x
Regression
equation x on y
CHAPTER OVERVIEW
© The Institute of Chartered Accountants of India
17 .2
STATISTICS
In the previous chapter, we discussed many a statistical measure relating to Univariate distribution
i.e. distribution of one variable like height, weight, mark, profit, wage and so on. However, there
are situations that demand study of more than one variable simultaneously. A businessman may
be keen to know what amount of investment would yield a desired level of profit or a student
may want to know whether performing better in the selection test would enhance his or her
chance of doing well in the final examination. With a view to answering this series of questions,
we need to study more than one variable at the same time. Correlation Analysis and Regression
Analysis are the two analyses that are made from a multivariate distribution i.e. a distribution of
more than one variable. In particular when there are two variables, say x and y, we study bivariate
distribution. We restrict our discussion to bivariate distribution only.
Correlation analysis, it may be noted, helps us to find an association or the lack of it between the
two variables x and y. Thus if x and y stand for profit and investment of a firm or the marks in
Statistics and Mathematics for a group of students, then we may be interested to know whether
x and y are associated or independent of each other. The extent or amount of correlation between
x and y is provided by different measures of Correlation namely Product Moment Correlation
Coefficient or Rank Correlation Coefficient or Coefficient of Concurrent Deviations. In Correlation
analysis, we must be careful about a cause and effect relation between the variables under
consideration because there may be situations where x and y are related due to the influence of a
third variable although no causal relationship exists between the two variables.
Regression analysis, on the other hand, is concerned with predicting the value of the dependent
variable corresponding to a known value of the independent variable on the assumption of a
mathematical relationship between the two variables and also an average relationship between
them.
When data are collected on two variables simultaneously, they are known as bivariate data and
the corresponding frequency distribution, derived from it, is known as Bivariate Frequency
Distribution. If x and y denote marks in Maths and Stats for a group of 30 students, then the
corresponding bivariate data would be (x
i
, y
i
) for i = 1, 2, …. 30 where (x
1
, y
1
) denotes the marks
in Mathematics and Statistics for the student with serial number or Roll Number 1, (x
2
, y
2
), that
for the student with Roll Number 2 and so on and lastly (x
30
, y
30
) denotes the pair of marks for the
student bearing Roll Number 30.
As in the case of a Univariate Distribution, we need to construct the frequency distribution
for bivariate data. Such a distribution takes into account the classification in respect of both
the variables simultaneously. Usually, we make horizontal classification in respect of x and
vertical classification in respect of the other variable y. Such a distribution is known as
Bivariate Frequency Distribution or Joint Frequency Distribution or Two way classification
of the two variables x and y.
© The Institute of Chartered Accountants of India
1 7 .3 CORRELATION AND REGRESSION
ILLUSTRATIONS:
Example 17.1: Prepare a Bivariate Frequency table for the following data relating to the marks in
Statistics (x) and Mathematics (y):
(15, 13), (1, 3), (2, 6), (8, 3), (15, 10), (3, 9), (13, 19),
(10, 11), (6, 4), (18, 14), (10, 19), (12, 8), (11, 14), (13, 16),
(17, 15), (18, 18), (11, 7), (10, 14), (14, 16), (16, 15), (7, 11),
(5, 1), (11, 15), (9, 4), (10, 15), (13, 12) (14, 17), (10, 11),
(6, 9), (13, 17), (16, 15), (6, 4), (4, 8), (8, 11), (9, 12),
(14, 11), (16, 15), (9, 10), (4, 6), (5, 7), (3, 11), (4, 16),
(5, 8), (6, 9), (7, 12), (15, 6), (18, 11), (18, 19), (17, 16)
(10, 14)
Take mutually exclusive classification for both the variables, the first class interval being 0-4 for
both.
Solution:
From the given data, we find that
Range for x = 19–1 = 18
Range for y = 19–1 = 18
We take the class intervals 0-4, 4-8, 8-12, 12-16, 16-20 for both the variables. Since the first pair of
marks is (15, 13) and 15 belongs to the fourth class interval (12-16) for x and 13 belongs to the
fourth class interval for y, we put a stroke in the (4, 4)-th cell. We carry on giving tally marks till
the list is exhausted.
© The Institute of Chartered Accountants of India
Page 4
CORRELATION AND REGRESSION
17
CHAPTER
After reading this chapter, students will be able to understand:
? The meaning of bivariate data and techniques of preparation of bivariate distribution;
? The concept of correlation between two variables and quantitative measurement of
correlation including the interpretation of positive, negative and zero correlation;
? Concept of regression and its application in estimation of a variable from known set of data.
Types of
Correlation
Biv aria te
D ata
C o rrela t ion
A n al y sis
Positive
Correlation
Measures of
Correlation
Negative
Correlation
Spearman’s
Correlation
Coefficient
Coefficient of
Concurrent
Deviations
Karl Person Product
Moment correlation
Coefficient
Scatter
Diagram
Bivariate Frequency
Distribution
Marginal
Distribution
Conditional
Distribution
Reg r e s s io n
An al y s is
Estimation of
Regression
Analysis
Method of
Least Squares
Regression
Lines
Regression
equation y on x
Regression
equation x on y
CHAPTER OVERVIEW
© The Institute of Chartered Accountants of India
17 .2
STATISTICS
In the previous chapter, we discussed many a statistical measure relating to Univariate distribution
i.e. distribution of one variable like height, weight, mark, profit, wage and so on. However, there
are situations that demand study of more than one variable simultaneously. A businessman may
be keen to know what amount of investment would yield a desired level of profit or a student
may want to know whether performing better in the selection test would enhance his or her
chance of doing well in the final examination. With a view to answering this series of questions,
we need to study more than one variable at the same time. Correlation Analysis and Regression
Analysis are the two analyses that are made from a multivariate distribution i.e. a distribution of
more than one variable. In particular when there are two variables, say x and y, we study bivariate
distribution. We restrict our discussion to bivariate distribution only.
Correlation analysis, it may be noted, helps us to find an association or the lack of it between the
two variables x and y. Thus if x and y stand for profit and investment of a firm or the marks in
Statistics and Mathematics for a group of students, then we may be interested to know whether
x and y are associated or independent of each other. The extent or amount of correlation between
x and y is provided by different measures of Correlation namely Product Moment Correlation
Coefficient or Rank Correlation Coefficient or Coefficient of Concurrent Deviations. In Correlation
analysis, we must be careful about a cause and effect relation between the variables under
consideration because there may be situations where x and y are related due to the influence of a
third variable although no causal relationship exists between the two variables.
Regression analysis, on the other hand, is concerned with predicting the value of the dependent
variable corresponding to a known value of the independent variable on the assumption of a
mathematical relationship between the two variables and also an average relationship between
them.
When data are collected on two variables simultaneously, they are known as bivariate data and
the corresponding frequency distribution, derived from it, is known as Bivariate Frequency
Distribution. If x and y denote marks in Maths and Stats for a group of 30 students, then the
corresponding bivariate data would be (x
i
, y
i
) for i = 1, 2, …. 30 where (x
1
, y
1
) denotes the marks
in Mathematics and Statistics for the student with serial number or Roll Number 1, (x
2
, y
2
), that
for the student with Roll Number 2 and so on and lastly (x
30
, y
30
) denotes the pair of marks for the
student bearing Roll Number 30.
As in the case of a Univariate Distribution, we need to construct the frequency distribution
for bivariate data. Such a distribution takes into account the classification in respect of both
the variables simultaneously. Usually, we make horizontal classification in respect of x and
vertical classification in respect of the other variable y. Such a distribution is known as
Bivariate Frequency Distribution or Joint Frequency Distribution or Two way classification
of the two variables x and y.
© The Institute of Chartered Accountants of India
1 7 .3 CORRELATION AND REGRESSION
ILLUSTRATIONS:
Example 17.1: Prepare a Bivariate Frequency table for the following data relating to the marks in
Statistics (x) and Mathematics (y):
(15, 13), (1, 3), (2, 6), (8, 3), (15, 10), (3, 9), (13, 19),
(10, 11), (6, 4), (18, 14), (10, 19), (12, 8), (11, 14), (13, 16),
(17, 15), (18, 18), (11, 7), (10, 14), (14, 16), (16, 15), (7, 11),
(5, 1), (11, 15), (9, 4), (10, 15), (13, 12) (14, 17), (10, 11),
(6, 9), (13, 17), (16, 15), (6, 4), (4, 8), (8, 11), (9, 12),
(14, 11), (16, 15), (9, 10), (4, 6), (5, 7), (3, 11), (4, 16),
(5, 8), (6, 9), (7, 12), (15, 6), (18, 11), (18, 19), (17, 16)
(10, 14)
Take mutually exclusive classification for both the variables, the first class interval being 0-4 for
both.
Solution:
From the given data, we find that
Range for x = 19–1 = 18
Range for y = 19–1 = 18
We take the class intervals 0-4, 4-8, 8-12, 12-16, 16-20 for both the variables. Since the first pair of
marks is (15, 13) and 15 belongs to the fourth class interval (12-16) for x and 13 belongs to the
fourth class interval for y, we put a stroke in the (4, 4)-th cell. We carry on giving tally marks till
the list is exhausted.
© The Institute of Chartered Accountants of India
1 7 .4
STATISTICS
Table 17.1
Bivariate Frequency Distribution of Marks in Statistics and Mathematics.
MARKS IN MATHS
Y 0-4 4-8 8-12 12-16 16-20 Total
X
0–4 I (1) I (1) II (2) 4
4–8 I (1) IIII (4) IIII (5) I (1) I (1) 12
8–12 I (1) II (2) IIII (4) IIII I (6) I (1) 14
12–16 I (1) III (3) II (2) IIII (5) 11
16–20 I (1) IIII (5) III (3) 9
Total 3 8 15 14 10 50
We note, from the above table, that some of the cell frequencies (f
ij
) are zero. Starting from the
above Bivariate Frequency Distribution, we can obtain two types of univariate distributions which
are known as:
(a) Marginal distribution.
(b) Conditional distribution.
If we consider the distribution of Statistics marks along with the marginal totals presented in the
last column of Table 17.1, we get the marginal distribution of marks in Statistics. Similarly, we
can obtain one more marginal distribution of Mathematics marks. The following table shows the
marginal distribution of marks of Statistics.
Table 17.2
Marginal Distribution of Marks in Statistics
Marks No. of Students
0-4 4
4-8 12
8-12 14
12-16 11
16-20 9
Total 50
We can find the mean and standard deviation of marks in Statistics from Table 17.2. They would
be known as marginal mean and marginal SD of Statistics marks. Similarly, we can obtain the
marginal mean and marginal SD of Mathematics marks. Any other statistical measure in respect
of x or y can be computed in a similar manner.
MARKS
IN STATS
© The Institute of Chartered Accountants of India
Page 5
CORRELATION AND REGRESSION
17
CHAPTER
After reading this chapter, students will be able to understand:
? The meaning of bivariate data and techniques of preparation of bivariate distribution;
? The concept of correlation between two variables and quantitative measurement of
correlation including the interpretation of positive, negative and zero correlation;
? Concept of regression and its application in estimation of a variable from known set of data.
Types of
Correlation
Biv aria te
D ata
C o rrela t ion
A n al y sis
Positive
Correlation
Measures of
Correlation
Negative
Correlation
Spearman’s
Correlation
Coefficient
Coefficient of
Concurrent
Deviations
Karl Person Product
Moment correlation
Coefficient
Scatter
Diagram
Bivariate Frequency
Distribution
Marginal
Distribution
Conditional
Distribution
Reg r e s s io n
An al y s is
Estimation of
Regression
Analysis
Method of
Least Squares
Regression
Lines
Regression
equation y on x
Regression
equation x on y
CHAPTER OVERVIEW
© The Institute of Chartered Accountants of India
17 .2
STATISTICS
In the previous chapter, we discussed many a statistical measure relating to Univariate distribution
i.e. distribution of one variable like height, weight, mark, profit, wage and so on. However, there
are situations that demand study of more than one variable simultaneously. A businessman may
be keen to know what amount of investment would yield a desired level of profit or a student
may want to know whether performing better in the selection test would enhance his or her
chance of doing well in the final examination. With a view to answering this series of questions,
we need to study more than one variable at the same time. Correlation Analysis and Regression
Analysis are the two analyses that are made from a multivariate distribution i.e. a distribution of
more than one variable. In particular when there are two variables, say x and y, we study bivariate
distribution. We restrict our discussion to bivariate distribution only.
Correlation analysis, it may be noted, helps us to find an association or the lack of it between the
two variables x and y. Thus if x and y stand for profit and investment of a firm or the marks in
Statistics and Mathematics for a group of students, then we may be interested to know whether
x and y are associated or independent of each other. The extent or amount of correlation between
x and y is provided by different measures of Correlation namely Product Moment Correlation
Coefficient or Rank Correlation Coefficient or Coefficient of Concurrent Deviations. In Correlation
analysis, we must be careful about a cause and effect relation between the variables under
consideration because there may be situations where x and y are related due to the influence of a
third variable although no causal relationship exists between the two variables.
Regression analysis, on the other hand, is concerned with predicting the value of the dependent
variable corresponding to a known value of the independent variable on the assumption of a
mathematical relationship between the two variables and also an average relationship between
them.
When data are collected on two variables simultaneously, they are known as bivariate data and
the corresponding frequency distribution, derived from it, is known as Bivariate Frequency
Distribution. If x and y denote marks in Maths and Stats for a group of 30 students, then the
corresponding bivariate data would be (x
i
, y
i
) for i = 1, 2, …. 30 where (x
1
, y
1
) denotes the marks
in Mathematics and Statistics for the student with serial number or Roll Number 1, (x
2
, y
2
), that
for the student with Roll Number 2 and so on and lastly (x
30
, y
30
) denotes the pair of marks for the
student bearing Roll Number 30.
As in the case of a Univariate Distribution, we need to construct the frequency distribution
for bivariate data. Such a distribution takes into account the classification in respect of both
the variables simultaneously. Usually, we make horizontal classification in respect of x and
vertical classification in respect of the other variable y. Such a distribution is known as
Bivariate Frequency Distribution or Joint Frequency Distribution or Two way classification
of the two variables x and y.
© The Institute of Chartered Accountants of India
1 7 .3 CORRELATION AND REGRESSION
ILLUSTRATIONS:
Example 17.1: Prepare a Bivariate Frequency table for the following data relating to the marks in
Statistics (x) and Mathematics (y):
(15, 13), (1, 3), (2, 6), (8, 3), (15, 10), (3, 9), (13, 19),
(10, 11), (6, 4), (18, 14), (10, 19), (12, 8), (11, 14), (13, 16),
(17, 15), (18, 18), (11, 7), (10, 14), (14, 16), (16, 15), (7, 11),
(5, 1), (11, 15), (9, 4), (10, 15), (13, 12) (14, 17), (10, 11),
(6, 9), (13, 17), (16, 15), (6, 4), (4, 8), (8, 11), (9, 12),
(14, 11), (16, 15), (9, 10), (4, 6), (5, 7), (3, 11), (4, 16),
(5, 8), (6, 9), (7, 12), (15, 6), (18, 11), (18, 19), (17, 16)
(10, 14)
Take mutually exclusive classification for both the variables, the first class interval being 0-4 for
both.
Solution:
From the given data, we find that
Range for x = 19–1 = 18
Range for y = 19–1 = 18
We take the class intervals 0-4, 4-8, 8-12, 12-16, 16-20 for both the variables. Since the first pair of
marks is (15, 13) and 15 belongs to the fourth class interval (12-16) for x and 13 belongs to the
fourth class interval for y, we put a stroke in the (4, 4)-th cell. We carry on giving tally marks till
the list is exhausted.
© The Institute of Chartered Accountants of India
1 7 .4
STATISTICS
Table 17.1
Bivariate Frequency Distribution of Marks in Statistics and Mathematics.
MARKS IN MATHS
Y 0-4 4-8 8-12 12-16 16-20 Total
X
0–4 I (1) I (1) II (2) 4
4–8 I (1) IIII (4) IIII (5) I (1) I (1) 12
8–12 I (1) II (2) IIII (4) IIII I (6) I (1) 14
12–16 I (1) III (3) II (2) IIII (5) 11
16–20 I (1) IIII (5) III (3) 9
Total 3 8 15 14 10 50
We note, from the above table, that some of the cell frequencies (f
ij
) are zero. Starting from the
above Bivariate Frequency Distribution, we can obtain two types of univariate distributions which
are known as:
(a) Marginal distribution.
(b) Conditional distribution.
If we consider the distribution of Statistics marks along with the marginal totals presented in the
last column of Table 17.1, we get the marginal distribution of marks in Statistics. Similarly, we
can obtain one more marginal distribution of Mathematics marks. The following table shows the
marginal distribution of marks of Statistics.
Table 17.2
Marginal Distribution of Marks in Statistics
Marks No. of Students
0-4 4
4-8 12
8-12 14
12-16 11
16-20 9
Total 50
We can find the mean and standard deviation of marks in Statistics from Table 17.2. They would
be known as marginal mean and marginal SD of Statistics marks. Similarly, we can obtain the
marginal mean and marginal SD of Mathematics marks. Any other statistical measure in respect
of x or y can be computed in a similar manner.
MARKS
IN STATS
© The Institute of Chartered Accountants of India
17 .5 CORRELATION AND REGRESSION
If we want to study the distribution of Statistics Marks for a particular group of students, say for
those students who got marks between 8 to 12 in Mathematics, we come across another univariate
distribution known as conditional distribution.
Table 17.3
Conditional Distribution of Marks in Statistics for Students
having Mathematics Marks between 8 to 12
Marks No. of Students
0-4 2
4-8 5
8-12 4
12-16 3
16-20 1
Total 15
We may obtain the mean and SD from the above table. They would be known as conditional
mean and conditional SD of marks of Statistics. The same result holds for marks in Mathematics.
In particular, if there are m classifications for x and n classifications for y, then there would be
altogether (m + n) conditional distribution.
While studying two variables at the same time, if it is found that the change in one variable is
reciprocated by a corresponding change in the other variable either directly or inversely, then the
two variables are known to be associated or correlated. Otherwise, the two variables are known
to be dissociated or uncorrelated or independent. There are two types of correlation.
(i) Positive correlation
(ii) Negative correlation
If two variables move in the same direction i.e. an increase (or decrease) on the part of one variable
introduces an increase (or decrease) on the part of the other variable, then the two variables are
known to be positively correlated. As for example, height and weight yield and rainfall, profit
and investment etc. are positively correlated.
On the other hand, if the two variables move in the opposite directions i.e. an increase (or a
decrease) on the part of one variable results a decrease (or an increase) on the part of the other
variable, then the two variables are known to have a negative correlation. The price and demand
of an item, the profits of Insurance Company and the number of claims it has to meet etc. are
examples of variables having a negative correlation.
The two variables are known to be uncorrelated if the movement on the part of one variable does
not produce any movement of the other variable in a particular direction. As for example, Shoe-
size and intelligence are uncorrelated.
© The Institute of Chartered Accountants of India
Read More