1. Maximum Likelihood Estimation
Suppose we have a probability distribution $P(x;\theta)$, and $\theta$ is the parameter of this probability distribution. From this probability distribution we sample an observation $x_1$. The probability of obtaining $x_1$ from probability distribution $P(x;\theta)$ is $p(x_1; \theta)$. When we have $n$ independent observations $X = \{x_1, ..., x_n\}$, the probability of observing $X$ under probability distribution $P(x; \theta)$ is:
\begin{equation}
p(X; \theta) = \prod_{i=1}^{n} P(x_i; \theta)
\end{equation}
For a given dataset $X$, $p(X; \theta)$ is a function of $\theta$:
\begin{equation}
\label{eq-likelihood}
L(\theta) = p(X; \theta) = \prod_{i=1}^{n} P(x_i; \theta)
\end{equation}
We call this function the likelihood function: $L(\theta)$. The maximum likelihood estimation is a process of finding the best $\theta$ that maximize the value of likelihood function. This process can be described as following:
\begin{equation}
\label{eq-lmax}
\hat{\theta} = argmax_{\theta}\ L(\theta)=argmax_{\theta}\prod_{i=1}^{n} P(x_i; \theta)
\end{equation}
Usually, it is easier to process summations instead of products. Thus we would like to add a log in both sides of equation \eqref{eq-likelihood}. And we will give $log(L(\theta))$ a new name: log-likelihood function, and denote it by $l(\theta)$:
\begin{equation}
l(\theta) = log(L(\theta)) = \sum_{i=1}^{n} log(P(x_i; \theta))
\end{equation}
Because the log is a non-decreasing function, maximizing the likelihood function is equivalent to maximizing the log-likelihood function. Equation \eqref{eq-lmax} can be written as following:
\begin{equation}
\hat{\theta} = argmax_{\theta}\ l(\theta) = argmax_{\theta}\sum_{i=1}^{n} log(P(x_i; \theta))
\end{equation}
To find the optimized $\theta$, we can use gradient descent, coordinate descent, Newton's iteration, etc.
2. Expectation Maximization
2.1 Difficulties of MLE
When we are able to observe all the random variables in a model, we can use MLE to find the optimized parameter as discussed in previous section. However, in some situations there are hidden variables in a model. We can not observe those variables, and the MLE method needs to be changed a little to incorporate those hidden variables. The expectation maximization (EM) is a popular method to solve the parameter estimation problem with hidden variables. EM can be understood as a variation of MLE method.
We denote the probability distribution with hidden random variables as $P(x,z;\theta)$. $x$ is observable random variables, $z$ is hidden random variables, and $\theta$ is the parameter for this model. Assume we have one observation $x_1$. The probability of observing $x_1$ under this probability distribution is
\begin{equation}
p(x_1;\theta) = \sum_{z_1}P(x_1, z_1; \theta)
\end{equation}
Here $p(x;\theta)$ is the marginal distribution of x, thus we need to sum over all possible values of hidden random variable $z$. When we observed n independent observable random variables $X=\{x_1, ..., x_n\}$, we can obtain the likelihood function $L(\theta)$ as well as log-likelihood function $l(\theta)$:
\begin{equation}
L(\theta) = \prod_{i=1}^n p(x_i;\theta) = \prod_{i=1}^n \sum_{z_i}P(x_i, z_i ; \theta)
\end{equation}
\begin{equation}
\label{eq-hiddenll}
l(\theta) = \sum_{i=1}^n log(p(x_i;\theta)) = \sum_{i=1}^n log(\sum_{z_i}P(x_i, z_i; \theta))
\end{equation}
In this setup, the MLE method is to find the $\theta$ that maximize the log-likelihood \eqref{eq-hiddenll}. It can be written as following:
\begin{equation}
\label{eq-mleh}
\hat{\theta} = argmax_{\theta}\ l(\theta) = argmax_{\theta}\sum_{i=1}^n log(\sum_{z_i}P(x_i, z_i; \theta))
\end{equation}
Here we can find that there is a summation in log. This makes equation \eqref{eq-mleh} very difficult to optimize. Thus the traditional MLE method is not feasible for this kind of problems. In order to solve this problem, we introduce the expectation maximization (EM) algorithm.
2.2 Jensen's Inequality
In EM algorithm, we modify the log-likelihood through Jensen's inequality, and make the log-likelihood easier to solve. The general form of Jensen's Inequality is:
\begin{equation}
\phi(E[X]) \geq E[\phi(X)],\\ given\ \phi(X)\ is\ concave
\end{equation}
The equality holds if and only if $X$ is constant. In EM algorithm, we modify the log-likelihood as following:
\begin{eqnarray}
l(\theta)&=\sum_{i=1}^n log(\sum_{z_i}P(x_i, z_i; \theta))\\
&= \sum_{i=1}^n log(\sum_{z_i}q(z_i)\frac{P(x_i, z_i; \theta)}{q(z_i)})\\
&= \sum_{i=1}^n log(E_{q(z_i)}[\frac{P(x_i, z_i; \theta)}{q(z_i)}])\\
&\geq \sum_{i=1}^n E_{q(z_i)}[log(\frac{P(x_i, z_i; \theta)}{q(z_i)})]\\
&= \sum_{i=1}^n \sum_{z_i}q(z_i)log(\frac{P(x_i, z_i; \theta)}{q(z_i)})
\end{eqnarray}
2.3 Iterations and Q-function
This optimization problem is going to be solved by iterations, which means the log-likelihood will be maximized step by step. We use $t$ to denote steps, for example $t = 0,1,2,\dots$. Under this notation, we denote the estimated parameter $\theta$ at step $t$ to be $\theta^{(t)}$. The $q(z_i)$ is a user constructed probability mass function. Without loss of generality, we assume that at step $t$, the user constructs $q(z_i)$ based on the parameter $\theta^{(t)}$, which means $q(z_i)$ is also a function of $\theta^{(t)}$:
\begin{equation}
At\ step\ t:\ q(z_i) = q(z_i, \theta^{(t)})
\end{equation}
Bring $q(z_i, \theta^{(t)})$ back to previously modified log-likelihood function, we finally obtain the objective function of EM algorithm (and we call it Q-function: $Q(\theta, \theta^{(t)}))$:
\begin{equation}
Q(\theta, \theta^{(t)})= \sum_{i=1}^n \sum_{z_i}q(z_i, \theta^{(t)})log(P(x_i, z_i; \theta))
\end{equation}
Now, we have a new question:
How should we construct $q(z_i)$ such that we can truly maximize $l(\theta)$ by maximizing $Q(\theta, \theta^{(t)})$ through iterations?
Here we choose the $q(z_i, \theta^{(t)})$ to be:
\begin{equation}
q(z_i, \theta^{(t)}) = \frac{P(z_i, x_i ; \theta^{(t)})}{\sum_{z_i}P(z_i, x_i ; \theta^{(t)})} = P(z_i | x_i; \theta^{(t)})
\end{equation}
Later we will prove that under this choice of $q(z_i, \theta^{(t)})$, the log-likelihood $l(\theta)$ is non-decreasing in EM iterations. In other words, we keep on finding $\theta's$ such that make $l(\theta)$ larger and larger.
2.4 Final Form of EM Algorithm
Here is the final form of EM algorithm:
\begin{eqnarray}
\begin{aligned}
&t = 0\\
&Initialize\ \theta^{(0)}\\
&While\ \theta^{(t)}\ not\ converge:\\
&\qquad q(z_i, \theta^{(t)})= P(z_i | x_i; \theta^{(t)})\\
&\qquad Q(\theta, \theta^{(t)})= \sum_{i=1}^n \sum_{z_i}q(z_i, \theta^{(t)})log(P(x_i, z_i; \theta))\\
&\qquad \theta^{(t+1)} = \underset{\theta}{argmax}\ Q(\theta, \theta^{(t)})\\
&\qquad t = t+1
\end{aligned}
\end{eqnarray}
3. Converge of EM Algorithm
In this section, we will prove that in EM steps the $l(\theta)$ is non-decreasing.
\begin{eqnarray}
&l(\theta^{(t+1)})=\sum_{i=1}^n log(\sum_{z_i}P(x_i, z_i; \theta^{(t+1)}))\\
& \geq \sum_{i=1}^n \sum_{z_i}q(z_i)log(\frac{P(x_i, z_i; \theta^{(t+1)})}{q(z_i)})\\
&= \sum_{i=1}^n \sum_{z_i}q(z_i, \theta^{(t)})log(\frac{P(x_i, z_i; \theta^{(t+1)})}{q(z_i, \theta^{(t)})})\\
&\geq \sum_{i=1}^n \sum_{z_i}q(z_i, \theta^{(t)})log(\frac{P(x_i, z_i; \theta^{(t)})}{q(z_i, \theta^{(t)})})\\
&= \sum_{i=1}^n log(\sum_{z_i}q(z_i, \theta^{(t)})\frac{P(x_i, z_i; \theta^{(t)})}{q(z_i, \theta^{(t)})})\\
&=\sum_{i=1}^n log(\sum_{z_i}P(x_i, z_i; \theta^{(t)}))\\
&=l(\theta^{(t)})
\end{eqnarray}
Thus, in EM algorithm it is guaranteed that:
\begin{equation}
l(\theta^{(t+1)}) \geq l(\theta^{(t)})
\end{equation}