Statistical Modeling and Inference
A breif review over the foundations of statistical inference
Statistical Models and Inference
statistical inference: a formal approach to characterizing a random phenomenon using observations, either by providing a description of a past phenomenon or by giving some predictions about future phenomenon of similar nature.
1. Statistical Models
The first step in statistical inference is to specify a statistical model, under some simplifying assumptions (i.e. independence assumptions).
- Hierarchical models: the probability distribution of one parameter is dependent on the values of other hierachical paramters (i.e. conditional independent).
Steps:
Set assumptions (i.e. independent), parameter and model. Make explicit assumptions on the probability distributions.
- focus on parametric modeling, because of limited data
- nonparametric not consider here, used for hypothesis testing or when sample size is very large.
Once the model is specified, then choose a method of inference, as well as an algorithm to obtain estimates. Most commonly use:
- Maximum likelihood inference
- Bayesian inference
2. Maximum likelihood inference
Quantifying confidence: the Fisher Information Matrix
Newton’s algorithm
Approximate Techniques
- Monte Carlo Sampling for intractable likelihoods
- Composite likelihood
3. Bayesian Inference
A statistical model describes the uncertainty about how the data was produced. The ultimate aim for statistical inference is to obtain information about the unknown parameter $\theta$ given the data $\mathcal{D}$.
- Frequentist: $\theta$ is fixed but unknown quantity.
- Bayesian: use a fully probabilistic model and treat $\theta$ as a random quantity. To do so,
- chose an appropriate prior distribution $\mathbb{P}(\theta)$, which reflects the knowledge (i.e. uncertainty) about $\theta$ prior to the experiment
- the goal is to update the knowledge given the information contained in the data $\mathcal{D}$.
- the updated knowledge (i.e. reduced uncertainty) is encapsulated in the posterior distribution $\mathbb{P}(\theta \vert \mathcal{D})$, which is calculated via Bayes’theorem.
$$ \mathbb{P}(\boldsymbol{\theta} | \mathcal{D})=\frac{\mathbb{P}(\mathcal{D} | \boldsymbol{\theta}) \mathbb{P}(\boldsymbol{\theta})}{\mathbb{P}(\mathcal{D})} $$
The bayesian paradigm boils down to the slogan: posterior $\propto$ likelihood $\times$ prior
3.1 Choice of prior distributions
- Conjugate priors
- the prior and the posterior lie in the same class of distributions.
- often chosen, because it leads to a well-known form of the posterior, which simplifies the calculations
- choose a prior that contains as little information about the parameter as possible
- at first choice would, of course, be a locally uniform prior. Under a uniform prior we have $\mathbb{P}(\boldsymbol{\theta} \vert \boldsymbol{D}) \propto \mathcal{L}(\boldsymbol{\theta})$.
- Jeffrey’s prior, but often hard to come by
3.2 Bayesian point estimates and confidence intervals
Bayesian point estimates: the posterior mean, mode and median
$$ \hat{\theta}=\mathbb{E}[\theta | D]=\int \theta \mathbb{P}(\theta | \mathcal{D}) \mathrm{d} \theta $$
confidence: highest posterior density (HPD) region
for a threshold value $\pi$, the region $\mathcal{C}_{\alpha}={\theta: \mathbb{P}(\theta \vert \mathcal{D})>\pi}$, we get
$$ \int_{C_{\alpha}} \mathbb{P}(\theta | \mathcal{D}) \mathrm{d} \theta=1-\alpha $$
This region $\mathcal{C}_{\alpha}$ is the HPD region.
3.3 Markov Chain Monte Carlo
A common challenge in Bayesian inference is that the integral
$$ \mathbb{P}(D)=\int \mathbb{P}(D | \theta) \mathbb{P}(\theta) d_{\theta} $$
can’t be solved analytically.
to be continued…
3.4 Empirical Bayes for Latent Variable Problems
The first step is to infer point estimates for the parameters at higher levels by integrating out those at lower levels, and the infer posterior distributions for lower level parameters while setting those at a higher level to their point estimate.
3.5 Approximate Bayesian Computation
Approximate Bayesian computation (ABC) is a class of simulation-based techniques to conduct Bayesian inference under models with intractable likelihoods
3.6 Model selection
how to compare the several candidate models explaining the data $\mathcal{D}$?
the most commonly used methods:
- likelihood ratio statistic
- model posterior probabilities
- Bayes factors
- others: cross-validation, Akaike’s information criterion (AIC), Bayesian information criterion (BIC)
4. Naive Bayes and Bayesian estimation
Naive Bayes and Bayesian estimation are two different concepts!
Naive Bayes
is a statistical learning method. For a give training set, learn the join probability distribution of $P(X,Y)$. Based on this model, for a given input $x$, output a $y$ with maximal posterior probability (Bayes theorem).
Set prior prob distribution:
$$ P\left(Y=c_{k}\right), \quad k=1,2, \cdots, K $$
Conditional prob distribution:
$$ P\left(X=x | Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right), \quad k=1,2, \cdots, K $$
Naive Bayes
make a strong assumption that conditional prob distribution are all conditional independent, which is:
$$ \begin{aligned} P\left(X=x | Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \cr &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned} $$
then, posterior prob is:
$$ P\left(Y=c_{k} | X=x\right)=\frac{P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)} $$
If conditional probability of each input variable is not independent, then model become Baysian Network!
Naive Bayes Classifier is:
$$ y=f(x)=\arg \max_{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) }{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)} $$
and it’s short form:
$$ y=f(x)= \arg \max_{c_{k}} \overbrace{P\left(Y=c_{k}\right)}^{\text{prior}} \overbrace{ \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}^{\text{likelihood}} $$
In sentiment analysis of NLP, the naive bayes classifiter make two assumptions.
- bag of words assumption: position doesn’t matter. Each feature only encode word identity not position.
- naive bayes assumption: conditional independence.