1. The Univariate Normal Distribution
It is first useful to visit the single variable case; that is, the well-known continuous probability distribution that depends only on a single random variable X. The normal distribution formula is a function of the mean µ and variance σ2 of the random variable, and is shown below.
This model is ubiquitous in applications ranging from Biology, Chemistry, Physics, Computer Science, and the Social Sciences. It’s discovery is dated as early as 1738 (perhaps ironically, the name of the distribution is not attributed to it’s founder; that is credited, instead, to Abraham de Moivre). But why is this model still so popular, and why has it seemed to gain relevance over time? Why do we use it as a paradigm for cat images and email data? The next section addresses three applications of the normal distribution, and in the process, derives it’s formula using elementary techniques.
1.1. The Continuous Approximation of the Binomial Distribution
Any introductory course in probability introduces counting arguments as a way to discuss probability; fundamentally, probability deals in subsets of larger supersets, so being able to count the cardinality of the subset and superset allows us to build a framework for thinking about probability. One of the first-introduced discrete distributions based on counting arguments is the binomial distribution, which counts the number of successes (or failures) in n independent trials that each have a probability p of success. The binomial distribution is defined as,
The argument is simple: the probability of a specific “k success” outcome is pk(1−p)n−k, by the independence of the outcomes. But there are more than just this specific way to achieve k successes in n trials. Therefore, to account for the undercounting problem, we note that we could fill n bins with k successes by using the combinatorial function
Which completes the argument. It can be shown that a binomial random variable X has expectation E(X) = np, variance Var(X) = np(1−p) and mode (the number of successes with maximal probability) bnp + pc.
In practice, we often want more than just the probability of a specific number of successes. A classical problem in this area of study is factory production. Let’s say I own a factory that produces iPads that are either functional or defective. If I produce defective iPads with 2% probability, what’s the probability that more than 50 in 10000 units are defective (assuming independence)? This problem is somewhat difficult to solve using the binomial formula, since I have a sum of combinatorial terms, i.e.,
For larger n, this can be difficult to compute even using software. Therefore, we are motivated to obtain a continuous distribution that approximates the binomial distribution in question, with well-known quantiles (the probability of an observation being less than a certain quantity). This leads to the following theorem
Theorem 1.1.1 (The Normal Approximation to the Binomial Distribution) The continuous approximation to the binomial distribution has the form of the normal density, with µ = np and σ2 = np(1−p). Proof. The proof follows the basic ideas of Jim Pitman in Probability.1
Define the height function H as the ratio between the probability of success in bucket k and the probability of success at the mode of the binomial distribution, m, i.e.
Define the consecutive heights ratio function R as the ratio of the heights of a bucket, or number of successes, k and it’s predecessor bucket, k−1, i.e.
For k > m, H(k) is the product of (m−k) consecutive heights ratios.
And, for k < m,
Denoting the natural logarithm function as log, we use this operator on the products above to turn them into sums
In order to evaluate the sum more explicitly, we express the logarithm of the consecutive heights ratio in a useful form, using some approximations from calculus. Namely, recall that log(1 + δ) ≈ δ for small δ. Letting k = m + x = np + x, for k > m,
There is an analogous formula for k < m. I will leave this case as an exercise for the reader for the remainder of the proof. In the above steps, we also used a few “large value” assumptions to obtain the results: from (2) to (3), we assume np + p ≈ np, which is a reasonable assumption for large n. We also assume, in the same step, that k + 1 ≈ k, which is reasonable for large k. Now, plugging in to obtain H(k), we obtain the non-normalized height of the normal density,
Therefore,
(8)
However, the quantity we seek is not the height of the histogram, but the probability density for some arbitrary k, that is,
But, we know that
So,
Using a continuous argument on the approximation in (8), we take the integral of the function H over all reals and obtain the normalizing constant , which completes the proof.
This is a powerful result, albeit one with limitations. Specifically, if k or n are small, the approximation is, obviously, much less accurate than we may otherwise desire. Other results in statistical theory over the two centuries since this approximation was discovered deal with the issue of smaller samples, but will not be covered here.
1.2. The Central Limit Theorem
The Central Limit Theorem is a seminal result that places us one step closer to establishing the practical footing that allows us to understand the underlying processes by which data are generated. Formally, the central limit theorem is stated below.
Theorem 1.2.1 (The Central Limit Theorem) Suppose that X1,...,Xn is a finite sequence of independent, identically distributed random variables with common mean E(Xi) = µ and finite variance σ2 = Var(Xi). Then, if we let we have that
Proof. The proof relies on the characteristic function from probability. The characteristic function of a random variable X is defined to be,
Where . For a continuous distribution, using the formula for expectation, we have,
This is the Fourier transform of the probability density function. It completely defines the probability density function, and is useful for deriving analytical results about probability distributions. The characteristic function for the univariate normal distribution is computed from the formula,
Recall from single-variable calculus that the integral Then, using the technique of u-substitution, we have,
(9)
It is a fact of characteristic functions that the characteristic function of a sum of independent two random variable Z = X + Y is the product of the characteristic functions. That is,
The proof of this little lemma is rather simple, so it is included below.
Then, for the sum of n independent random variables Z = X1 + ... + Xn, the characteristic function is,
For the central limit theorem, we specifically examine the random variable Sn = X1+...+Xn, when the Xj are independent, and identically distributed. Denote the random variable, Zj = Xj −µ
It turns out that the characteristic function of the random variable,
Is easier to study than the sum Sn, so we consider the characteristic function of this random variable below.
Using the MacLaurin series for ex, and ignoring the normalizing constant for a moment, we have that,
Where the definition of expectation was used in the ultimate step. Incorporating the normalizing constant,
Finally, we take n →∞in order to get the limit theorem in question. Note thatE(X1−µ) = 0 and that E((X1 −µ)2) = Var(X1 −µ) = σ2. Critically, all terms with order higher than two in the MacLaurin polynomial, which we denote εn go to zero as n →∞. Then,
Note that (10) is, by way of the derivation of the characteristic function for the normal density in (9),
Which completes the proof. We could also use the inverse Fourier transform to compute the closed form probability density function, which would be the probability density function of the N(0,1) random variable, but this is left as an exercise for the reader.
The idea of the Central Limit Theorem is unintuitive. It states that, even for the most poorly-behaved of distributions (non-symmetric, bimodal, etc.), the distribution of the sum is normal. This result has in some sense defined classical statistical inference. It allows us to develop the idea of a hypothesis test, which allows us to compare data to a theoretical distribution from which we hypothesize the data has been drawn, leading to a transition out of probability theory and into the realm of theoretical statistics. It is useful, therefore, for comparing proposed means, variances, and proportions to ones actually drawn from the population, which leads to a variety of applications that span from A/B testing at technology companies to polling results that allow scientists to obtain reasonable predictions about the results of an election (except, it seems, in 2016).2
The Central Limit Theorem also has fascinating applications in signal processing. Consider the following (potentially contrived) example. When using a cell phone, I make contact with a single cell tower for which my cell phone is in broadcast range. At each time step t, I send a deterministic (that is, not stochastic/random) signal X to the tower. However, the process is fairly noisy, so the signal that is received by the tower Y is,
Where ε is an arbitrary zero-mean noise distribution. Then, assuming independence of signal and noise, we have that,
If Var() is high, the signal that is received at the tower may be quite noisy. However, say I disperse two hundred tools for measuring cell signals throughout the area, all of which feed data to the single cell tower. Let the measurements be, Y1,...,Y200, with,
Where i is the same error distribution as before. Then, the cell tower can compute,
The Central Limit Theorem tells us that this quantity is normally distributed, with expectation and variance,
Therefore, the magic of the central limit theorem allows us to more precisely estimate the signal sent from one electronic device by using arithmetic averages.
1.3. The Noisy Process
The final example in the preceding question leads us to the ultimate and most useful interpretation of the Gaussian distribution, as an error curve. This leads us into a story whose characters are the scientific celebrities of the late 18th and early 19th century. The search for a model for error and imprecision was pioneered by astronomers. Galileo observed in 1632 that inexactness was ubiquitous in measurement, and he conjectured that an error distribution would capture five inherent truths:
1. There is only one number which gives the distance of a star from the center of the earth, the true distance.
2. All observations are encumbered with errors, due to the observer, the instruments, and the other observational conditions.
3. All observations are distributed symmetrically about the true value; that is, the errors are distributed symmetrically about zero.
4. Small errors occur more frequently than large errors.
5. The calculated distance is a function of the direct angular observations such that small adjustments of the observations may result in a large adjustment of the distance.
Following the precedent of the philosophy of Galileo, and in light of celestial events nearly 170 years later, a young mathematician by the name of Carl Friedrich Gauss gained fame in Europe for accurately predicting the orbit of the “heavenly body” Ceres.3 He delineated to the scientific community that he used a “least squares” estimate in order to locate the orbit that best fit the observations. His theory was grounded in the following three assumptions, which resemble the points of Galileo:
1. Small errors are more likely than large errors.
2. For any real number , the likelihood of errors of magnitudes and − are equal.
3. In the presence of several measurements of the same quantity, the most likely value of the quantity being measured is their average (from the “least squares” formulation - it can be shown that the average minimizes the sum of squares).
Based on these observations, Gauss settled on the error curve that we now know as the normal density.
Theorem 1.3.1 (The Normal Distribution as an Error Curve) The probability density for the error curve is,
Where h is a non-negative constant that represents the “precision of the measurement process”.
Proof. Begin the proof by assuming that we have measurements from a process, with true value m. Denote the measurements of a random process m1,...,mn, and denote φ(x) as the probability density function of the random errors. Since the distribution is symmetric, the function is even, so φ(x) = φ(−x). Assume that the function φ(x) is differentiable, and that the derivative is denoted φ'(x). Assuming independent measurements, the likelihood of a particular set of observations, given the true value, is,
Gauss assumed (from bullet 3 above) that the most likely value for the quantity m, given the n observations, was the maximum likelihood estimate of m, or the mean of the measurements. Therefore, we can rewrite the likelihood of a particular set of observations as the product of differences from the mean,
Differentiating with respect to m gives an important characteristic of the error curve.
It follows that,
It can be shown that this condition implies that the ratio of the derivative of the function and the function itself is linear, i.e.,
For some arbitrary real constant k. Solving this differential equation, we have,
For some positive constant A. In order for the distribution to be symmetric about zero (maximal for x = 0, k/2 must be some negative constant. So we let k/2 = −h2 for some constant h. Then, the final form of the distribution is obtained from integrating over the density to obtain the normalizing constant A.
Therefore,
Of course, this is exactly the form of the normal density, when the mean is zero (which Gauss assumed), and the constant h is
The results of this section are highly relevant to our exploration of digits, images, and emails in Machine Learning throughout the semester. The data generation and measurement process can be represented as the sum of a “truth” and a linear combination of symmetric error curves. The Central Limit theorem and the definition of the noisy process in this section provide us with some assurance that the normal distribution adequately models these errors, and therefore motivates the usage of them in statistical models. Next, we turn to the derivation of the Multivariate Gaussian, which is an adaptation of the univariate density for high-dimensional data vectors.
2. The Multivariate Normal Distribution
Our goal in this section is to derive the well-known density function for the multivariate normal distribution, which deals with random vectors instead of just individual random variables. To do this, we will start with a collection (a vector) of independent, normally distributed random variables and work ourselves up to the general case where they are no longer independent.
2.1. The Basic Case: Independent Univariate Normals
To derive the general case of the multivariate normal, we will start with a vector consisting of N independent, normally distributed random variables and mean 0: Z = (Z1,Z2,...,ZN), where . We denote the density of a single Zi as fZi. Then, since the variables are independent, the joint probability density function, fZ of all N variables will just be the product of their densities. That is,:
Where . In this case we say that . Unfortunately, this derivation is restricted to the case where these entries are independent and 0-centered. However, we will see in the next few sections that we can derive the general case using this result.
2.2. Affine Transformations of a Random Vector
Consider an affine transformation L : RN →RN, L(x) = Ax+b for an invertible matrixA ∈ RN×N and a constant vector b ∈ RN. It is easy to verify that when we apply this transformation to a random variable Z, with mean EZ = µ and covariance cov(Z) = ΣZ, we get a new random variable X = L(Z) such that:
In our case, the affine transformation will consist of a rotation followed by a scaling along the principal axes of our rotation, with one translation. That is, for a symmetric,
positive definite matrix Σ and constant vector µ, we will be looking at the transformation X = Σ1/2Z+µ. It is interesting to note that, given an orthogonal decomposition Σ = UΛUT, where U is orthogonal and Λ is a diagonal matrix consisting of the eigenvalues of Σ, entry xi of the new random vector is a weighted sum of originally independent random variables in Z. Let Ai denote the ith row of a matrix A. Then,
This is, in a sense, the source of the covariance between entries of X.
We now just need one more fact about a change of variables to derive the general multivariate normal PDF for this new random vector.
2.3. PDF of a Transformed Random Vector
Suppose that Z is a random vector taking on values in a subset S ⊆RN, with a continuous probability density function f. Suppose that X = r(Z) where r is a differentiable function from S onto some other subset T ⊆ RN. Then the probability density function g of X is given by:
Where if you recall from multivariable calculus, is the Jacobian of the inverse of r, and det(·) denotes the determinant of a matrix 4. Returning to our previous discussion, where X = Σ1/2Z + µ, we can see that the inverse transformation is given by Z = Σ−1/2(X−µ). It is easily verifiable that the determinant of the Jacobian of this inverse is given by . Now we have everything we need for thegeneral case.
2.4. The Multivariate Normal PDF
Consider the random vector where I is the identity matrix. As before we let for positive definite Σ and a constant vector µ. We can now find the density function g of X from the known density function f for Z.
This is probability density function for a multivariate normal distribution with mean vector µ and a covariance matrix Σ. We say that
One important insight is that by performing an affine transformation on a vector consisting of independent, normally distributed variables, we have ”induced” a measure of dependence, the covariance, between the entries of our resulting random vector. By using some properties related to a change of variables, we then derived the density function for the resulting distribution.
556 videos|198 docs
|
1. What is a univariate distribution function? |
2. What is a multivariate distribution function? |
3. What is the difference between univariate and multivariate distribution functions? |
4. How are univariate and multivariate distribution functions used in statistics? |
5. Can univariate and multivariate distribution functions be combined in statistical analysis? |
556 videos|198 docs
|
|
Explore Courses for Mathematics exam
|