Courses

# Linear Discriminant Analysis Notes | EduRev

## : Linear Discriminant Analysis Notes | EduRev

``` Page 1

Linear Discriminant Analysis - cs534
Given training setf(x
i
;y
i
) :i = 1; ;Ng, y
i
2f0; 1g, and x
i
2R
d
. We aim to learn p(x;y).
Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will
denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj
y
; ).
That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each
with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant
analysis.
1 Parameter Estimation
There are two problems we need to solve. First, the learning problem | given the training set, we
need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 
0
,

1
and . We will apply maximum likelihood estimation for this. The likelihood function is as
follows.
logP (DjM) =
N
X
i=1
logp(x
i
;y
i
)
=
N
X
i=1
logf[N(x
i
j
1
; )]
y
i
[(1)N(x
i
j
0
; )]
1y
i
g
=
N
X
i=1
fy
i
log +y
i
logN(x
i
j
1
; ) + (1y
i
) log(1) + (1y
i
) logN(x
i
j
0
; )g
Let's rst estimate . Consider the parts that contain , we have:
N
X
i=1
fy
i
log + (1y
i
) log(1)g
Take the derivative over :
1

N
X
i=1
y
i

1
1
N
X
i=1
(1y
i
)
Setting it to zero and let N
1
=
P
N
i=1
y
i
and N
1
=
P
N
i=1
(1y
i
), we have:
N
1

=
N
2
1
 =
N
1
N
1
+N
2
We now move onto estimating 
1
(
0
is exactly the same). Consider the parts that contain 
1
,
we have:
N
X
i=1
y
i
logN(x
i
j
1
; ) =
N
X
i=1
y
i
(x
i

1
)
T

1
(x
i

1
)
2
+const
=
X
y
i
=1
(x
i

1
)
T

1
(x
i

1
)
2
+const
Take the derivative over 
1
and set it to zero, we have:
1
Page 2

Linear Discriminant Analysis - cs534
Given training setf(x
i
;y
i
) :i = 1; ;Ng, y
i
2f0; 1g, and x
i
2R
d
. We aim to learn p(x;y).
Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will
denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj
y
; ).
That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each
with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant
analysis.
1 Parameter Estimation
There are two problems we need to solve. First, the learning problem | given the training set, we
need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 
0
,

1
and . We will apply maximum likelihood estimation for this. The likelihood function is as
follows.
logP (DjM) =
N
X
i=1
logp(x
i
;y
i
)
=
N
X
i=1
logf[N(x
i
j
1
; )]
y
i
[(1)N(x
i
j
0
; )]
1y
i
g
=
N
X
i=1
fy
i
log +y
i
logN(x
i
j
1
; ) + (1y
i
) log(1) + (1y
i
) logN(x
i
j
0
; )g
Let's rst estimate . Consider the parts that contain , we have:
N
X
i=1
fy
i
log + (1y
i
) log(1)g
Take the derivative over :
1

N
X
i=1
y
i

1
1
N
X
i=1
(1y
i
)
Setting it to zero and let N
1
=
P
N
i=1
y
i
and N
1
=
P
N
i=1
(1y
i
), we have:
N
1

=
N
2
1
 =
N
1
N
1
+N
2
We now move onto estimating 
1
(
0
is exactly the same). Consider the parts that contain 
1
,
we have:
N
X
i=1
y
i
logN(x
i
j
1
; ) =
N
X
i=1
y
i
(x
i

1
)
T

1
(x
i

1
)
2
+const
=
X
y
i
=1
(x
i

1
)
T

1
(x
i

1
)
2
+const
Take the derivative over 
1
and set it to zero, we have:
1
X
y
i
=1

1
(x
i

1
) = 0

1
=
1
N
1
X
y
i
=1
x
i
Similarly we have:

1
=
1
N
2
X
y
i
=0
x
i
Finally, we will estimate . Taking the part that contains , we have:

N
1
2
lnjj
1
2
N
X
i=1
y
i
(x
i

1
)
T

1
(x
i

1
)
N
2
2
lnjj
1
2
N
X
i=1
(1y
i
)(x
i

0
)
T

1
(x
i

0
)
=
N
2
lnjj
N
1
2
Tr(
1
S
1
)
N
2
2
Tr(
1
S
2
)
=
N
2
lnjj
N
2
Tr(
1
(
N
1
N
S
1
+
N
2
N
S
2
))
=
N
2
lnjj
N
2
Tr(
1
S)
Taking derivative over  and set it to zero, we have:
 =S =
N
1
N
S
1
+
N
2
N
S
2
2 Decision Boundary
The second problem that we need to solve is the inference problem | given x, we need to infer
its y value, aka, the prediction problem. Recall from our previous lectures that to minimize the
probability of misclassifying a given example x, we predicty to be the value that maximizesp(x;y).
Thus, we can consider the ratio:
p(x;y = 1)
p(x;y = 0)
and predict 1 if this ratio is greater than 1. This is equivalent to predictingy = 1 if log
p(x;y=1)
p(x;y=0)
> 0.
Note that
log
p(x;y = 1)
p(x;y = 0)
= log
N(xj
1
; )
(1)N(xj
0
; )
= log

1
+ log
N(xj
1
; )
N(xj
0
; )
= log

1

1
2
(x
1
)
T

1
(x
1
) +
1
2
(x
0
)
T

1
(x
0
)
= log

1

1
2
x
T

1
x +
T
1

1
x
1
2

T
1

1

1
+
1
2
x
T

1
x
T
0

1
x +
1
2

T
0

1

0
= (
1

0
)
T

1
x
1
2

T
1

1

1
+
1
2

T
0

1

0
+ log

1
= w
T
x +w
0
2
Page 3

Linear Discriminant Analysis - cs534
Given training setf(x
i
;y
i
) :i = 1; ;Ng, y
i
2f0; 1g, and x
i
2R
d
. We aim to learn p(x;y).
Specically, we factorize p(x;y) = p(xjy)p(y), where p(y) is the prior distribution of y. We will
denote p(y = 1) = . We further make the simplifying assumption that p(xjy) = N(xj
y
; ).
That is, we assume that the data from class 0 and class 1 come from two dierent Gaussians, each
with distinct mean but shared covariance. This is the basic assumption behind Linear discriminant
analysis.
1 Parameter Estimation
There are two problems we need to solve. First, the learning problem | given the training set, we
need to learn the parameters to fully specify the joint distribution p(x;y), which includes , 
0
,

1
and . We will apply maximum likelihood estimation for this. The likelihood function is as
follows.
logP (DjM) =
N
X
i=1
logp(x
i
;y
i
)
=
N
X
i=1
logf[N(x
i
j
1
; )]
y
i
[(1)N(x
i
j
0
; )]
1y
i
g
=
N
X
i=1
fy
i
log +y
i
logN(x
i
j
1
; ) + (1y
i
) log(1) + (1y
i
) logN(x
i
j
0
; )g
Let's rst estimate . Consider the parts that contain , we have:
N
X
i=1
fy
i
log + (1y
i
) log(1)g
Take the derivative over :
1

N
X
i=1
y
i

1
1
N
X
i=1
(1y
i
)
Setting it to zero and let N
1
=
P
N
i=1
y
i
and N
1
=
P
N
i=1
(1y
i
), we have:
N
1

=
N
2
1
 =
N
1
N
1
+N
2
We now move onto estimating 
1
(
0
is exactly the same). Consider the parts that contain 
1
,
we have:
N
X
i=1
y
i
logN(x
i
j
1
; ) =
N
X
i=1
y
i
(x
i

1
)
T

1
(x
i

1
)
2
+const
=
X
y
i
=1
(x
i

1
)
T

1
(x
i

1
)
2
+const
Take the derivative over 
1
and set it to zero, we have:
1
X
y
i
=1

1
(x
i

1
) = 0

1
=
1
N
1
X
y
i
=1
x
i
Similarly we have:

1
=
1
N
2
X
y
i
=0
x
i
Finally, we will estimate . Taking the part that contains , we have:

N
1
2
lnjj
1
2
N
X
i=1
y
i
(x
i

1
)
T

1
(x
i

1
)
N
2
2
lnjj
1
2
N
X
i=1
(1y
i
)(x
i

0
)
T

1
(x
i

0
)
=
N
2
lnjj
N
1
2
Tr(
1
S
1
)
N
2
2
Tr(
1
S
2
)
=
N
2
lnjj
N
2
Tr(
1
(
N
1
N
S
1
+
N
2
N
S
2
))
=
N
2
lnjj
N
2
Tr(
1
S)
Taking derivative over  and set it to zero, we have:
 =S =
N
1
N
S
1
+
N
2
N
S
2
2 Decision Boundary
The second problem that we need to solve is the inference problem | given x, we need to infer
its y value, aka, the prediction problem. Recall from our previous lectures that to minimize the
probability of misclassifying a given example x, we predicty to be the value that maximizesp(x;y).
Thus, we can consider the ratio:
p(x;y = 1)
p(x;y = 0)
and predict 1 if this ratio is greater than 1. This is equivalent to predictingy = 1 if log
p(x;y=1)
p(x;y=0)
> 0.
Note that
log
p(x;y = 1)
p(x;y = 0)
= log
N(xj
1
; )
(1)N(xj
0
; )
= log

1
+ log
N(xj
1
; )
N(xj
0
; )
= log

1

1
2
(x
1
)
T

1
(x
1
) +
1
2
(x
0
)
T

1
(x
0
)
= log

1

1
2
x
T

1
x +
T
1

1
x
1
2

T
1

1

1
+
1
2
x
T

1
x
T
0

1
x +
1
2

T
0

1

0
= (
1

0
)
T

1
x
1
2

T
1

1

1
+
1
2

T
0

1

0
+ log

1
= w
T
x +w
0
2
where w = 
1
(
1

0
) and w
0
=
1
2

T
1

1

1
+
1
2

T
0

1

0
+ log

1
. This indicates that LDA
learns a linear decision boundary.
Note that if we relax our modeling assumption such that the dierent classes have dierent
covariance matrix, the above derivation will result in a quadratic decision boundary.
3 Dimension reduction view of LDA
As shown in the slides posted on class website, we can also arrive at a similar solution by seeking
a projection vector w that maximizes the separation between the two classes. It turns out that
w = 
1
(
1

0
) is the optimal projection vector in this sense. We will skip this part at this point
of the class and revisit when we discuss dimension reduction techniques.
3
```
Offer running on EduRev: Apply code STAYHOME200 to get INR 200 off on our premium plan EduRev Infinity!