Mathematics Exam  >  Mathematics Notes  >  Mathematics for IIT JAM, GATE, CSIR NET, UGC NET  >  Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET PDF Download

Abstract

The Bayesian interpretation of probability is one of two broad categories of interpretations. Bayesian inference updates knowledge about unknowns, parameters, with information from data. The LaplacesDemon package is a complete environment for Bayesian inference within R, and this vignette provides an introduction to the topic. This article introduces Bayes’ theorem, model-based Bayesian inference, components of Bayesian inference, prior distributions, hierarchical Bayes, conjugacy, likelihood, numerical approximation, prediction, Bayes factors, model fit, posterior predictive checks, and ends by comparing advantages and disadvantages of Bayesian inference.
 

This article is an introduction to Bayesian inference for users of the LaplacesDemon package (Statisticat LLC. 2015) in R (R Core Team 2014), often referred to as LD. LaplacesDemonCpp is an extension package that uses C++. A formal introduction to LaplacesDemon is provided in an accompanying vignette entitled “LaplacesDemon Tutorial”. Merriam-Webster defines ‘Bayesian’ as follows

Bayesian : being, relating to, or involving statistical methods that assign probabilities or distributions to events (as rain tomorrow) or parameters (as a population mean) based on experience or best guesses before experimentation and data collection and that apply Bayes’ theorem to revise the probabilities and distributions after obtaining experimental data.

In statistical inference, there are two broad categories of interpretations of probability: Bayesian inference and frequentist inference. These views often differ with each other on the fundamental nature of probability. Frequentist inference loosely defines probability as the limit of an event’s relative frequency in a large number of trials, and only in the context of experiments that are random and well-defined. Bayesian inference, on the other hand, is able to assign probabilities to any statement, even when a random process is not involved. In Bayesian inference, probability is a way to represent an individual’s degree of belief in a statement, or given evidence.

Within Bayesian inference, there are also different interpretations of probability, and different approaches based on those interpretations. The most popular interpretations and approaches are objective Bayesian inference (Berger 2006) and subjective Bayesian inference (Anscombe and Aumann 1963; Goldstein 2006). Objective Bayesian inference is often associated with Bayes and Price (1763), Laplace (1814), and Jeffreys (1961). Subjective Bayesian inference is often associated with Ramsey (1926), De Finetti (1931), and Savage (1954). The first major event to bring about the rebirth of Bayesian inference was De Finetti (1937). Differences in the interpretation of probability are best explored outside of this article1

This article is intended as an approachable introduction to Bayesian inference, or as a handy summary for experienced Bayesians. It is assumed that the reader has at least an elementary understanding of statistics, and this article focuses on applied, rather than theoretical, material. Equations and statistical notation are included, but it is hopefully presented so the reader does not need an intricate understanding of solving integrals, for example, but should understand the basic concept of integration. Please be aware that it is difficult to summarize Bayesian inference in such a short article. In which case, consider Robert (2007) for a more thorough and formal introduction.


1. Bayes’ Theorem

Bayes’ theorem shows the relation between two conditional probabilities that are the reverse of each other. This theorem is named after Reverend Thomas Bayes (1701-1761), and is also referred to as Bayes’ law or Bayes’ rule (Bayes and Price 1763)2. Bayes’ theorem expresses the conditional probability, or ‘posterior probability’, of an event A after B is observed in terms of the ‘prior probability’ of A, prior probability of B, and the conditional probability of B given A. Bayes’ theorem is valid in all common interpretations of probability. The two (related) examples below should be sufficient to introduce Bayes’ theorem.


1.1. Bayes’ Theorem, Example 1

Bayes’ theorem provides an expression for the conditional probability of A given B, which is equal to
Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET           (1)

For example, suppose one asks the question: what is the probability of going to Hell, conditional on consorting (or given that a person consorts) with Laplace’s Demon3. By replacing A with Hell and B with Consort, the question becomes

 

Note that a common fallacy is to assume that Pr(A|B) = Pr(B|A), which is called the conditional probability fallacy.
 

1.2. Bayes’ Theorem, Example 2

Another way to state Bayes’ theorem is

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

Let’s examine our burning question, by replacing Ai with Hell or Heaven, and replacing B with Consort

  •  Pr(A1) = Pr(Hell)
  •  Pr(A2) = Pr(Heaven)
  •  Pr(B) = Pr(Consort)
  • Pr(A1|B) = Pr(Hell|Consort)
  • Pr(A2|B) = Pr(Heaven|Consort)
  • Pr(B|A1) = Pr(Consort|Hell)
  • Pr(B|A2) = Pr(Consort|Heaven)
     

Laplace’s Demon was conjured and asked for some data. He was glad to oblige. Data

  • 6 people consorted out of 9 who went to Hell.
  •  5 people consorted out of 7 who went to Heaven.
  • 75% of the population goes to Hell.
  • 25% of the population goes to Heaven.

Now, Bayes’ theorem is applied to the data. Four pieces are worked out as follows

  •  Pr(Consort|Hell) = 6/9 = 0.666
  • Pr(Consort|Heaven) = 5/7 = 0.714
  • Pr(Hell) = 0.75
  •  Pr(Heaven) = 0.25

Finally, the desired conditional probability Pr(Hell|Consort) is calculated using Bayes’ theorem 

  •  Pr(Hell|Consort) =Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET
  •  Pr(Hell|Consort) = 0.737

The probability of someone consorting with Laplace’s Demon and going to Hell is 73.7%, which is less than the prevalence of 75% in the population. According to these findings, consorting with Laplace’s Demon does not increase the probability of going to Hell. With that in mind, please continue...
 

2. Model-Based Bayesian Inference

The basis for Bayesian inference is derived from Bayes’ theorem. Here is Bayes’ theorem, equation 1, again

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

Replacing B with observations y, A with parameter set Θ, and probabilities Pr with densities p (or sometimes π or function f), results in the following

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

where p(y) will be discussed below, p(Θ) is the set of prior distributions of parameter set Θ before y is observed, p(y|Θ) is the likelihood of y under a model, and p(Θ|y) is the joint posterior distribution, sometimes called the full posterior distribution, of parameter set Θ that expresses uncertainty about parameter set Θ after taking both the prior and data into account. Since there are usually multiple parameters, Θ represents a set of j parameters, and may be considered hereafter in this article as

Θ = θ1,...,θj

The denominator

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

defines the “marginal likelihood” of y, or the “prior predictive distribution” of y, and may be set to an unknown constant c. The prior predictive distribution4 indicates what y should look like, given the model, before y has been observed. Only the set of prior probabilities and the model’s likelihood function are used for the marginal likelihood of y.

The presence of the marginal likelihood of y normalizes the joint posterior distribution, p(Θ|y), ensuring it is a proper distribution and integrates to one. By replacing p(y) with c, which is short for a ‘constant of proportionality’, the model-based formulation of Bayes’ theorem becomes

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

By removing c from the equation, the relationship changes from ’equals’ (=) to ’proportional to’ (∝)5 

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET  (2)

This form can be stated as the unnormalized joint posterior being proportional to the likelihood times the prior. However, the goal in model-based Bayesian inference is usually not to summarize the unnormalized joint posterior distribution, but to summarize the marginal distributions of the parameters. The full parameter set Θ can typically be partitioned into

Θ = {Φ,Λ}

where Φ is the sub-vector of interest, and Λ is the complementary sub-vector of Θ, often referred to as a vector of nuisance parameters. In a Bayesian framework, the presence of nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter is a parameter that exists in the joint posterior distribution of a model, though it is not a parameter of interest. The marginal posterior distribution of φ, the parameter of interest, can simply be written as 

Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET

 In model-based Bayesian inference, Bayes’ theorem is used to estimate the unnormalized joint posterior distribution, and finally the user can assess and make inferences from the marginal posterior distributions.


3. Components of Bayesian Inference

The components6 of Bayesian inference are

1. p(Θ) is the set of prior distributions for parameter set Θ, and uses probability as a means of quantifying uncertainty about Θ before taking the data into account.

2. p(y|Θ) is the likelihood or likelihood function, in which all variables are related in a full probability model.

3. p(Θ|y) is the joint posterior distribution that expresses uncertainty about parameter set Θ after taking both the prior and the data into account. If parameter set Θ is partitioned into a single parameter of interest φ and the remaining parameters are considered nuisance parameters, then p(φ|y) is the marginal posterior distribution.
 

4. Prior Distributions

In Bayesian inference, a prior probability distribution, often called simply the prior, of an uncertain parameter θ or latent variable is a probability distribution that expresses uncertainty about θ before the data are taken into account7. The parameters of a prior distribution are called hyperparameters, to distinguish them from the parameters (Θ) of the model.

When applying Bayes’ theorem, the prior is multiplied by the likelihood function and then normalized to estimate the posterior probability distribution, which is the conditional distribution of Θ given the data. Moreover, the prior distribution affects the posterior distribution. Prior probability distributions have traditionally belonged to one of two categories: informative priors and uninformative priors. Here, four categories of priors are presented according to information8 and the goal in the use of the prior. The four categories are informative, weakly informative, least informative, and uninformative.


4.1. Informative Priors

When prior information is available about θ, it should be included in the prior distribution of θ. For example, if the present model form is similar to a previous model form, and the present model is intended to be an updated version based on more current data, then the posterior distribution of θ from the previous model may be used as the prior distribution of θ for the present model.

In this way, each version of a model is not starting from scratch, based only on the present data, but the cumulative effects of all data, past and present, can be taken into account. To ensure the current data do not overwhelm the prior, Ibrahim and Chen (2000) introduced the power prior. The power prior is a class of informative prior distribution that takes previous data and results into account. If the present data is very similar to the previous data, then the precision of the posterior distribution increases when including more and more information from previous models. If the present data differs considerably, then the posterior distribution of θ may be in the tails of the prior distribution for θ, so the prior distribution contributes less density in its tails. Hierarchical Bayes is also a popular way to combine data sets.

Sometimes informative prior information is not simply ready to be used, such as when it resides in another person, as in an expert. In this case, their personal beliefs about the probability of the event must be elicited into the form of a proper probability density function. This process is called prior elicitation.
 

4.2. Weakly Informative Priors 

Weakly Informative Prior (WIP) distributions use prior information for regularization9 and stabilization, providing enough prior information to prevent results that contradict our knowledge or problems such as an algorithmic failure to explore the state-space. Another goal is for WIPs to use less prior information than is actually available. A WIP should provide some of the benefit of prior information while avoiding some of the risk from using information that doesn’t exist. WIPs are the most common priors in practice, and are favored by subjective Bayesians.

Selecting a WIP can be tricky. WIP distributions should change with the sample size, because a model should have enough prior information to learn from the data, but the prior information must also be weak enough to learn from the data. Following is an example of a WIP in practice. It is popular, for good reasons, to center and scale all continuous predictors (Gelman 2008). Although centering and scaling predictors is not discussed here, it should be obvious that the potential range of the posterior distribution of θ for a centered and scaled predictor should be small. A popular WIP for a centered and scaled predictor may be 

θ ∼N(0,10000)

where θ is normally-distributed according to a mean of 0 and a variance of 10,000, which is equivalent to a standard deviation of 100, or precision of 1.0E-4. In this case, the density for θ is nearly flat. Nonetheless, the fact that it is not perfectly flat yields good properties for numerical approximation algorithms. In both Bayesian and frequentist inference, it is possible for numerical approximation algorithms to become stuck in regions of flat density, which become more common as sample size decreases or model complexity increases. Numerical approximation algorithms in frequentist inference function as though a flat prior were used, so numerical approximation algorithms in frequentist inference become stuck more frequently than numerical approximation algorithms in Bayesian inference. Prior distributions that are not completely flat provide enough information for the numerical approximation algorithm to continue to explore the target density, the posterior distribution. After updating a model in which WIPs exist, the user should examine the posterior to see if the posterior contradicts knowledge. If the posterior contradicts knowledge, then the WIP must be revised by including information that will make the posterior consistent with knowledge (Gelman 2008).

A popular objective Bayeisan criticism against WIPs is that there is no precise, mathematical form to derive the optimal WIP for a given model and data.

Vague Priors

A vague prior, also called a diffuse prior10, is difficult to define, after considering WIPs. The first formal move from vague to weakly informative priors is Lambert, Sutton, Burton, Abrams, and Jones (2005). After conjugate priors were introduced (Raiffa and Schlaifer 1961), most applied Bayesian modeling has used vague priors, parameterized to approximate the concept of uninformative priors (better considered as least informative priors, see section 4.3). For more information on conjugate priors, see section 6.

Typically, a vague prior is a conjugate prior with a large scale parameter. However, vague priors can pose problems when the sample size is small. Most problems with vague priors and small sample size are associated with scale, rather than location, parameters. The problem can be particularly acute in random-effects models, and the term random-effects is used rather loosely here to imply exchangeable11, hierarchical, and multilevel structures. A vague prior is defined here as usually being a conjugate prior that is intended to approximate an uninformative prior (or actually, a least informative prior), and without the goals of regularization and stabilization.
 

4.3. Least Informative Priors

The term ‘Least Informative Priors’, or LIPs, is used here to describe a class of prior in which the goal is to minimize the amount of subjective information content, and to use a prior that is determined solely by the model and observed data. The rationale for using LIPs is often said to be ‘to let the data speak for themselves’. LIPs are favored by objective Bayesians.

Flat Priors

The flat prior was historically the first attempt at an uninformative prior. The unbounded, uniform distribution, often called a flat prior, is

θ ∼U(−∞,∞)

where θ is uniformly-distributed from negative infinity to positive infinity. Although this seems to allow the posterior distribution to be affected soley by the data with no impact from prior information, this should generally be avoided because this probability distribution is improper, meaning it will not integrate to one since the integral of the assumed p(θ) is infinity (which violates the assumption that the probabilities sum to one). This may cause the posterior to be improper, which invalidates the model. 

Reverend Thomas Bayes (1701-1761) was the first to use inverse probability (Bayes and Price 1763), and used a flat prior for his billiard example so that all possible values of θ are equally likely a priori (Gelman, Carlin, Stern, and Rubin 2004, p. 34-36). Pierre-Simon Laplace (1749-1827) also used the flat prior to estimate the proportion of female births in a population, and for all estimation problems presented or justified as a reasonable expression of ignorance. Laplace’s use of this prior distribution was later referred to as the ‘principle of indifference’ or ‘principle of insufficient reason’, and is now called the flat prior (Gelman et al. 2004, p. 39). Laplace was aware that it was not truly uninformative, and used it as a LIP. Another problem with the flat prior is that it is not invariant to transformation. For example, a flat prior on a standard deviation parameter is not also flat for its variance or precision.
 

Hierarchical Prior

A hierarchical prior is a prior in which the parameters of the prior distribution are estimated from data via hyperpriors, rather than with subjective information (Gelman 2008). Parameters of hyperprior distributions are called hyperparameters. Subjective Bayesians prefer the hierarchical prior as the LIP, and the hyperparameters are usually specified as WIPs. Hierarchical priors are presented later in more detail in the section entitled ‘Hierarchical Bayes’.
 

Jeffreys Prior

Jeffreys prior, also called Jeffreys rule, was introduced in an attempt to establish a least informative prior that is invariant to transformations (Jeffreys 1961). Jeffreys prior works well for a single parameter, but multi-parameter situations may have inappropriate aspects accumulate across dimensions to detrimental effect.
 

MAXENT

A MAXENT prior, proposed by Jaynes (1968), is a prior probability distribution that is selected among other candidate distributions as the prior of choice when it has the maximum entropy (MAXENT) in the considered set, given constraints on the candidate set. More entropy is associated with less information, and the least informative prior is preferred as a MAXENT prior. The principle of minimum cross-entropy generalizes MAXENT priors from mere selection to updating the prior given constraints while seeking the maximum, possible entropy.
 

Reference Priors

Introduced by Bernardo (1979), reference priors do not express personal beliefs. Instead, reference priors allow the data to dominate the prior and posterior (Berger, Bernardo, and Dongchu 2009). Reference priors are estimated by maximizing the expected intrinsic discrepancy between the posterior distribution and prior distribution. This maximizes the expected posterior information about y when the prior density is p(y). In some sense, p(y) is the ‘least informative’ prior about y (Bernardo 2005). Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys rule) may result in priors with problematic behavior. When reference priors are used, the analysis is called reference analysis, and the posterior is called the reference posterior.

Subjective Bayesian criticisms of reference priors are that the concepts of regularization and stabilization are not taken into account, results that contradict knowledge are not prevented, a numerical approximation algorithm may become stuck in low-probability or flat regions, and it may not be desirable to let the data speak fully.
 

4.4. Uninformative Priors
Traditionally, most of the above descriptions of prior distributions were categorized as uninformative priors. However, uninformative priors do not truly exist (Irony and Singpurwalla 1997), and all priors are informative in some way. Traditionally, there have been many names associated with uninformative priors, including diffuse, minimal, non-informative, objective, reference, uniform, vague, and perhaps weakly informative.

The document Elementary Bayesian inference - 1, CSIR-NET Mathematical Sciences | Mathematics for IIT JAM, GATE, CSIR NET, UGC NET is a part of the Mathematics Course Mathematics for IIT JAM, GATE, CSIR NET, UGC NET.
All you need of Mathematics at this link: Mathematics
556 videos|198 docs
556 videos|198 docs
Download as PDF
Explore Courses for Mathematics exam
Signup for Free!
Signup to see your scores go up within 7 days! Learn & Practice with 1000+ FREE Notes, Videos & Tests.
10M+ students study on EduRev
Related Searches

pdf

,

Previous Year Questions with Solutions

,

Summary

,

video lectures

,

UGC NET

,

CSIR NET

,

Sample Paper

,

CSIR NET

,

practice quizzes

,

UGC NET

,

Elementary Bayesian inference - 1

,

CSIR-NET Mathematical Sciences | Mathematics for IIT JAM

,

Viva Questions

,

Free

,

CSIR-NET Mathematical Sciences | Mathematics for IIT JAM

,

Elementary Bayesian inference - 1

,

GATE

,

mock tests for examination

,

CSIR-NET Mathematical Sciences | Mathematics for IIT JAM

,

past year papers

,

UGC NET

,

MCQs

,

Extra Questions

,

study material

,

Exam

,

Semester Notes

,

Important questions

,

CSIR NET

,

GATE

,

shortcuts and tricks

,

ppt

,

Elementary Bayesian inference - 1

,

Objective type Questions

,

GATE

;