When two variables, such as x and y, fluctuate together either in tandem or in opposite directions, they are described as being correlated or associated. Correlation denotes the connection between these variables, which is often observed in certain types of data. For instance, a correlation is evident between income and expenditure, absenteeism and production, or advertisement expenses and sales. However, the nature of correlation can vary depending on the specific variables being examined.
To analyze such relationships, scatter diagrams are utilized. These diagrams plot different datasets on a graph, providing valuable insights. Firstly, they allow us to visually identify patterns between variables, indicating whether there is a relationship between them. Secondly, if a relationship is present, the scatter diagram can provide clues about the type of correlation that exists. Various types of correlations can be observed through scatter diagrams, as depicted in Figure.
Figure: Possible Relationships Between Two Variables, X and Y
Illustration 1
Table : A Company’s Advertising Expenses and Sales Data (Rs. in crore)
The sales manager of the company asserts that the fluctuations in sales result from the marketing department's frequent alterations in advertising expenditure. While confident that a relationship exists between sales and advertising, the manager is uncertain about the nature of this relationship. The various scenarios illustrated in Figure represent potential descriptions of the relationship between sales and advertising expenditure for the company. To ascertain the precise relationship, we need to create a scatter diagram, as depicted in Figure, taking into account the values provided in Table.
Figure: Scatter Diagram of Sales and Advertising Expenditure for a Company.
Figure indicates that advertising expenditure and sales seem to be linearly (positively) related. However, the strength of this relationship is not known, that is, how close do the points come to fall on a straight line is yet to be determined. The quantitative measure of strength of the linear relationship between two variables (here sales and advertising expenditure) is called the correlation coefficient. In the next section, therefore, we shall study the methods for determining the coefficient of correlation.
[Intext Question]
As explained above, the coefficient of correlation helps in measuring the degree of relationship between two variables, X and Y. The methods which are used to measure the degree of relationship will be discussed below. Karl Pearson’s Correlation Coefficient Karl Pearson’s coefficient of correlation (r) is one of the mathematical methods of measuring the degree of correlation between any two variables X and Y is given as:
The simplified formulae (which are algebraic equivalent to the above formula) are:
Note: This formula is used when are integers.
Before we delve into an example to gauge the degree of correlation, it's important to highlight several key points:
It's crucial to exercise caution when interpreting correlation results. Although a change in advertising might lead to a change in sales, a correlated relationship between two variables does not necessarily imply a cause-and-effect relationship. Often, two seemingly unrelated variables may exhibit high correlation. For instance, significant correlation may be observed between individuals' height and income, or between shoe size and exam scores, despite the absence of any conceivable causal relationship. This type of correlation is termed spurious or nonsense correlation. Hence, it's essential to refrain from drawing conclusions solely based on spurious correlation.
Illustration 2: To illustrate, considering the data of advertisement expenditure (X) and sales (Y) of a company over ten years as presented in Table, we proceed to compute the correlation coefficient between these variables.
Solution: Refer to Table for the Calculation of Correlation Coefficient.
We know that
The calculated coefficient of correlation r = 0.9835 shows that there is a high degree of association between the sales and advertisement expenditure. For this particular problem, it indicates that an increase in advertisement expenditure is likely to yield higher sales. If the results of the calculation show a strong correlation for the data, either negative or positive, then the line of best fit to that data will be useful for forecasting (it is discussed in Section on ‘Simple Linear Regression’).
You may notice that the manual calculations will be cumbersome for real life research work. Therefore, statistical packages like minitab, SPSS, SAS, etc., may be used to calculate ‘r’ and other devices as well.
Referring to the table of t-distribution for (n–2) degree of freedom, we can find the critical value for t at any desired level of significance (5% level of significance is commonly used). If the calculated value of t (as obtained by the above formula) is less than or equal to the table value of t, we accept the null hypothesis (H0), meaning that the correlation between the two variables is not significantly different from zero.
The following example will illustrate the use of this test.
Illustration 3
Suppose, a random sample of 12 pairs of observations from a normal population gives a correlation coefficient of 0.55. Is it likely that the two variables in the population are uncorrelated?
Solution: Let us take the null hypothesis (H0) that the variables in the population are uncorrelated.
Applying t-test,
From the t-distribution (refer the table given at the end of this unit) with 10 degrees of freedom for a 5% level of significance, we see that the table value of t0.05/2, (10–2) = 2.228. The calculated value of t is less than the table value of t. Therefore, we can conclude that this r of 0.55 for n = 12 is not significantly different from zero. Hence our hypothesis (H0) holds true, i.e., the sample variables in the population are uncorrelated.
The Karl Pearson’s correlation coefficient, discussed above, is not applicable in cases where the direct quantitative measurement of a phenomenon under study is not possible. Sometimes we are required to examine the extent of association between two ordinally scaled variables such as two rank orderings. For example, we can study efficiency, performance, competitive events, attitudinal surveys etc. In such cases, a measure to ascertain the degree of association between the ranks of two variables, X and Y, is called Rank Correlation. It was developed by Edward Spearman, its coefficient (R) is expressed by the following formula:
squares of difference between the ranks of two variables.
The following example illustrates the computation of rank correlation coefficient.
Illustration 5
Salesmen employed by a company were given one month training. At the end of the training, they conducted a test on 10 salesmen on a sample basis who were ranked on the basis of their performance in the test. They were then posted to their respective areas. After six months, they were rated in terms of their sales performance. Find the degree of association between them.
Solution: Table: Calculation of Coefficient of Rank Correlation.
Using the Spearman’s formula, we obtain
we can say that there is a high degree of positive correlation between the training and sales performance of the salesmen. Now we proceed to test the significance of the results obtained. We are interested in testing the null hypothesis (H0) that the two sets of ranks are not associated in the population and that the observed value of R differs from zero only by chance. The test that is used is t-statistic.
Referring to the t-distribution table for 8 d.f (n–2), the critical value for t at a 5% level of significance [t0.05/2, (10–2)] is 2.306. The calculated value of t is greater than the table value. Hence, we reject the null hypothesis concluding that performance in training and on sales are closely associated.
Sometimes the data, relating to qualitative phenomenon, may not be available in ranks, but values. In such a situation the researcher must assign the ranks to the values. Ranks may be assigned by taking either the highest value as 1 or the lowest value as 1. But the same method must be followed in case of both variables.
Tied Ranks
Sometimes there is a tie between two or more ranks in the first and/or second series. For example, there are two items with the same 4th rank, then instead of awarding 4th rank to the respective two observations, we award 4.5 (4+5/2) for each of the two observations and the mean of the ranks is unaffected. In such cases, an adjustment in the Spearman’s formula is made. For this, Σ d2 is increased by (t3 − t)/12 for each tie, where t stands for the number of observations in each tie. The formula can thus be expressed as:
[Intext Question]
The objective of simple linear regression is to represent the relationship between two variables with a model of the form shown below:
wherein Yi = value of the dependent variable,
β0 = Y-intercept,
β1 = slope of the regression line,
Xi = value of the independent variable,
ei = error term (i.e., the difference between the actual Y value and the value of Y predicted by the model.
If we consider the two variables (X variable and Y variable), we shall have two regression lines. They are:
The first regression line (Y on X) estimates value of Y for given value of X. The second regression line (X on Y) estimates the value of X for given value of Y. These two regression lines will coincide, if correlation between the variable is either perfect positive or perfect negative.
Regression Equations:
As mentioned earlier, there are two regression equations, also known as estimating equations, corresponding to the two regression lines (Y on X and X on Y). These equations serve as algebraic expressions of the regression lines and are formulated as follows: Regression Equation of Y on X
Regression Equation of Y on X can be expressed as Ŷ = a + bx, where Ŷ represents the computed values of Y (dependent variable) based on the relationship for a given X. In this equation, 'a' and 'b' are constants (fixed values), where 'a' determines the level of the fitted line at the Y-axis (Y-intercept), 'b' determines the slope of the regression line, and X represents a given value of the independent variable.
An alternative simplified expression for the above equation is:
The regression equation of X on Y can be represented as X = a + by. An alternative simplified expression for this equation is: X − X = bxy (Y − Y ).
It is worthwhile to note that the estimated simple regression line always passes through (which is shown in Figure). The following illustration shows how the estimated regression equations are obtained, and hence how they are used to estimate the value of y for given x value.
Illustration 6
From the following 12 months sample data of a company, estimate the regression lines and also estimate the value of sales when the company decided to spend Rs. 2,50,000 on advertising during the next quarter.
Solution: Table: Calculations for Least Square Estimates of a Company.
Now we establish the best regression line (estimated by the least square method). We know the regression equation of Y on X is:
which is shown in Figure. Note that, as said earlier, this line passes through X (2.733) and Y(32).
Figure: The Least Squares Regression Line for a Company's Advertising Expenditure and Sales.
Thus, an advertising expenditure of Rs. 2.5 lakh is estimated to generate sales for the company to the tune of Rs. 30,64,850. Similarly, we can also establish the best regression line of X on Y as follows: Regression Equation of X on Y
The following points about the regression should be noted:
Consider the values of regression coefficients from the previous illustration to know the degree of correlation between advertising expenditure and sales.
[Intext Question]
where, Se is standard error of estimate, Y is values of the dependent variable, Ŷ is estimated values from the estimating equation that corresponds to each Y value, and n is the number of observations (sample size).
Let us take up an illustration to calculate Se in a given situation.
Illustration 7: Consider the following data relating to the relationships between expenditure on research and development, and annual profits of a firm during 1998–2004.
The estimated regression equation in this situation is found to be Ŷ = 14.44 + 4.31x . Calculate the standard error of estimate.
Note: Before proceeding to compute Se you may calculate the regression equation of Y on X on your own to ensure whether the given equation for the above data is correct or not.
Solution: To calculate Se for this problem, we must first obtain the value of ∑ (Y − Ŷ) 2 . We have done this in Table.
We can, now, find the standard error of estimate as follows.
Standard error of estimate of annual profit is Rs. 1.875 lakh.
We also notice, as discussed in Section 10.5, that ∑ (Y −Ŷ) = 0 . This is one way to verify the accuracy of the regression line fitted by the least square method.
After gaining an understanding of the concept and application of simple correlation and simple regression, we can discern the differences between them.