Page 1 Linear Discriminant Analysis - cs534 Given training setf(x i ;y i ) :i = 1; ;Ng, y i 2f0; 1g, and x i 2R d . We aim to learn p(x;y). Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj y ; ). That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant analysis. 1 Parameter Estimation There are two problems we need to solve. First, the learning problem | given the training set, we need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 0 , 1 and . We will apply maximum likelihood estimation for this. The likelihood function is as follows. logP (DjM) = N X i=1 logp(x i ;y i ) = N X i=1 logf[N(x i j 1 ; )] y i [(1)N(x i j 0 ; )] 1y i g = N X i=1 fy i log +y i logN(x i j 1 ; ) + (1y i ) log(1) + (1y i ) logN(x i j 0 ; )g Let's rst estimate . Consider the parts that contain , we have: N X i=1 fy i log + (1y i ) log(1)g Take the derivative over : 1 N X i=1 y i 1 1 N X i=1 (1y i ) Setting it to zero and let N 1 = P N i=1 y i and N 1 = P N i=1 (1y i ), we have: N 1 = N 2 1 = N 1 N 1 +N 2 We now move onto estimating 1 ( 0 is exactly the same). Consider the parts that contain 1 , we have: N X i=1 y i logN(x i j 1 ; ) = N X i=1 y i (x i 1 ) T 1 (x i 1 ) 2 +const = X y i =1 (x i 1 ) T 1 (x i 1 ) 2 +const Take the derivative over 1 and set it to zero, we have: 1 Page 2 Linear Discriminant Analysis - cs534 Given training setf(x i ;y i ) :i = 1; ;Ng, y i 2f0; 1g, and x i 2R d . We aim to learn p(x;y). Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj y ; ). That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant analysis. 1 Parameter Estimation There are two problems we need to solve. First, the learning problem | given the training set, we need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 0 , 1 and . We will apply maximum likelihood estimation for this. The likelihood function is as follows. logP (DjM) = N X i=1 logp(x i ;y i ) = N X i=1 logf[N(x i j 1 ; )] y i [(1)N(x i j 0 ; )] 1y i g = N X i=1 fy i log +y i logN(x i j 1 ; ) + (1y i ) log(1) + (1y i ) logN(x i j 0 ; )g Let's rst estimate . Consider the parts that contain , we have: N X i=1 fy i log + (1y i ) log(1)g Take the derivative over : 1 N X i=1 y i 1 1 N X i=1 (1y i ) Setting it to zero and let N 1 = P N i=1 y i and N 1 = P N i=1 (1y i ), we have: N 1 = N 2 1 = N 1 N 1 +N 2 We now move onto estimating 1 ( 0 is exactly the same). Consider the parts that contain 1 , we have: N X i=1 y i logN(x i j 1 ; ) = N X i=1 y i (x i 1 ) T 1 (x i 1 ) 2 +const = X y i =1 (x i 1 ) T 1 (x i 1 ) 2 +const Take the derivative over 1 and set it to zero, we have: 1 X y i =1 1 (x i 1 ) = 0 1 = 1 N 1 X y i =1 x i Similarly we have: 1 = 1 N 2 X y i =0 x i Finally, we will estimate . Taking the part that contains , we have: N 1 2 lnjj 1 2 N X i=1 y i (x i 1 ) T 1 (x i 1 ) N 2 2 lnjj 1 2 N X i=1 (1y i )(x i 0 ) T 1 (x i 0 ) = N 2 lnjj N 1 2 Tr( 1 S 1 ) N 2 2 Tr( 1 S 2 ) = N 2 lnjj N 2 Tr( 1 ( N 1 N S 1 + N 2 N S 2 )) = N 2 lnjj N 2 Tr( 1 S) Taking derivative over and set it to zero, we have: =S = N 1 N S 1 + N 2 N S 2 2 Decision Boundary The second problem that we need to solve is the inference problem | given x, we need to infer its y value, aka, the prediction problem. Recall from our previous lectures that to minimize the probability of misclassifying a given example x, we predicty to be the value that maximizesp(x;y). Thus, we can consider the ratio: p(x;y = 1) p(x;y = 0) and predict 1 if this ratio is greater than 1. This is equivalent to predictingy = 1 if log p(x;y=1) p(x;y=0) > 0. Note that log p(x;y = 1) p(x;y = 0) = log N(xj 1 ; ) (1)N(xj 0 ; ) = log 1 + log N(xj 1 ; ) N(xj 0 ; ) = log 1 1 2 (x 1 ) T 1 (x 1 ) + 1 2 (x 0 ) T 1 (x 0 ) = log 1 1 2 x T 1 x + T 1 1 x 1 2 T 1 1 1 + 1 2 x T 1 x T 0 1 x + 1 2 T 0 1 0 = ( 1 0 ) T 1 x 1 2 T 1 1 1 + 1 2 T 0 1 0 + log 1 = w T x +w 0 2 Page 3 Linear Discriminant Analysis - cs534 Given training setf(x i ;y i ) :i = 1; ;Ng, y i 2f0; 1g, and x i 2R d . We aim to learn p(x;y). Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj y ; ). That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant analysis. 1 Parameter Estimation There are two problems we need to solve. First, the learning problem | given the training set, we need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 0 , 1 and . We will apply maximum likelihood estimation for this. The likelihood function is as follows. logP (DjM) = N X i=1 logp(x i ;y i ) = N X i=1 logf[N(x i j 1 ; )] y i [(1)N(x i j 0 ; )] 1y i g = N X i=1 fy i log +y i logN(x i j 1 ; ) + (1y i ) log(1) + (1y i ) logN(x i j 0 ; )g Let's rst estimate . Consider the parts that contain , we have: N X i=1 fy i log + (1y i ) log(1)g Take the derivative over : 1 N X i=1 y i 1 1 N X i=1 (1y i ) Setting it to zero and let N 1 = P N i=1 y i and N 1 = P N i=1 (1y i ), we have: N 1 = N 2 1 = N 1 N 1 +N 2 We now move onto estimating 1 ( 0 is exactly the same). Consider the parts that contain 1 , we have: N X i=1 y i logN(x i j 1 ; ) = N X i=1 y i (x i 1 ) T 1 (x i 1 ) 2 +const = X y i =1 (x i 1 ) T 1 (x i 1 ) 2 +const Take the derivative over 1 and set it to zero, we have: 1 X y i =1 1 (x i 1 ) = 0 1 = 1 N 1 X y i =1 x i Similarly we have: 1 = 1 N 2 X y i =0 x i Finally, we will estimate . Taking the part that contains , we have: N 1 2 lnjj 1 2 N X i=1 y i (x i 1 ) T 1 (x i 1 ) N 2 2 lnjj 1 2 N X i=1 (1y i )(x i 0 ) T 1 (x i 0 ) = N 2 lnjj N 1 2 Tr( 1 S 1 ) N 2 2 Tr( 1 S 2 ) = N 2 lnjj N 2 Tr( 1 ( N 1 N S 1 + N 2 N S 2 )) = N 2 lnjj N 2 Tr( 1 S) Taking derivative over and set it to zero, we have: =S = N 1 N S 1 + N 2 N S 2 2 Decision Boundary The second problem that we need to solve is the inference problem | given x, we need to infer its y value, aka, the prediction problem. Recall from our previous lectures that to minimize the probability of misclassifying a given example x, we predicty to be the value that maximizesp(x;y). Thus, we can consider the ratio: p(x;y = 1) p(x;y = 0) and predict 1 if this ratio is greater than 1. This is equivalent to predictingy = 1 if log p(x;y=1) p(x;y=0) > 0. Note that log p(x;y = 1) p(x;y = 0) = log N(xj 1 ; ) (1)N(xj 0 ; ) = log 1 + log N(xj 1 ; ) N(xj 0 ; ) = log 1 1 2 (x 1 ) T 1 (x 1 ) + 1 2 (x 0 ) T 1 (x 0 ) = log 1 1 2 x T 1 x + T 1 1 x 1 2 T 1 1 1 + 1 2 x T 1 x T 0 1 x + 1 2 T 0 1 0 = ( 1 0 ) T 1 x 1 2 T 1 1 1 + 1 2 T 0 1 0 + log 1 = w T x +w 0 2 where w = 1 ( 1 0 ) and w 0 = 1 2 T 1 1 1 + 1 2 T 0 1 0 + log 1 . This indicates that LDA learns a linear decision boundary. Note that if we relax our modeling assumption such that the dierent classes have dierent covariance matrix, the above derivation will result in a quadratic decision boundary. 3 Dimension reduction view of LDA As shown in the slides posted on class website, we can also arrive at a similar solution by seeking a projection vector w that maximizes the separation between the two classes. It turns out that w = 1 ( 1 0 ) is the optimal projection vector in this sense. We will skip this part at this point of the class and revisit when we discuss dimension reduction techniques. 3Read More

Offer running on EduRev: __Apply code STAYHOME200__ to get INR 200 off on our premium plan EduRev Infinity!