Maximum Likelihood with Exponential Graphical Models


Graphical models provide a powerful framework to model interactions between large families of random variables. Following pioneering work from J. Besag (JRSS-B 36, 1974), they have been introduced to the image processing community through the famous paper from S. and D. Geman (T-PAMI, 6 1984).  Parameter estimation for this kind of models is quite difficult, however, and the algorithm that is described below, which belong to the family of stochastic gradient methods, provides addresses this issue in an iterative way. The resulting approach is more flexible than iterative scaling (Daroch and Ratcliff annals of math. stats. 1972), and more efficient (in the statistical sense) than the pseudo-likelihood estimator introduced by Besag. More details can be found in the following papers.

Estimation and annealing for Gibbsian fields, L. Younes, Annales de l'IHP Probabilités et statistiques  24  269--294  (1988)
Parametric inference for imperfectly observed Gibbsian fields, L. Younes, Probability Theory and Related Fields  82  625--645  (1989)
Maximum likelihood estimation for Gibbsian fields, L. Younes, Lecture Notes-Monograph Series    403--426  (1991)
Stochastic gradient estimation strategies for Markov random fields, L. Younes, SPIE's International Symposium on Optical Science, Engineering, and Instrumentation    315--325  (1998)
Parameter estimation for imperfectly observed Gibbs fields and some comments on Chalmond's EM Gibbsian algorithm, L. Younes, Stochastic Models, Statistical Methods, and Algorithms in Image Analysis, Lecture Notes in Statistics Volume 74, 1992, pp 240-258
On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates, L. Younes, Stochastics: An International Journal of Probability and Stochastic Processes  65  177--228  (1999)




The following papers also use stochastic gradient methods in different contexts related to image processing.

A stochastic algorithm for probabilistic independent component analysis, S. Allassonniere and L. Younes, The Annals of Applied Statistics  6  125--160  (2012)
A stochastic algorithm for feature selection in pattern recognition, S. Gadat and L. Younes, The Journal of Machine Learning Research  8  509--547  (2007)



This paper explores an alternative method for the calibration of Markov random field parameters, based on a paradigm in which parameters are fixed in order to constrain local minimizer of a cost or energy function, instead of using a standard statistical approach. A similar paradigm, called energy-based learning, has subsequently been introduced by Y. Le Cun.


Calibrating parameters of cost functionals, L. Younes, Computer Vision—ECCV 2000    212--223  (2000)










The following discussion is taken from lecture notes written for my class Graphical Models at JHU.


Consider a parametrized model for a Gibbs distribution

\[
\pi_\theta(x) = \frac{1}{Z_\theta} \exp(-\theta^TU(x))
\] where \(\theta\) is a $d$ dimensional parameter and $U$ is a function from $F_V$ to $\mathbb R^d$. For example, if $\pi$ is an Ising model with
\[
\pi(x) = \frac{1}{Z} \exp\big( \alpha \sum_{s\in V} x_s + \beta \sum_{s\sim t}
x_s x_t\big),
\]
we would take $\theta = (\alpha, \beta)$ and $U(x) = - (\sum_s x_s, \sum_{s\sim t} x_s x_t)$. Most of the Markov random fields models that are used in practice can be put in this form.

The constant $Z_\theta$ is
$$
Z_\theta  = \sum_{x\in F_V} \exp(-\theta^TU(x))
$$
and is usually not computable.

Now, assume that an $N$-sample, $x^{(1)}, \ldots, x^{(N)}$ is observed for this distribution. The maximum likelihood estimator maximizes
$$
\ell(\theta) = \frac{1}{N} \sum_{k=1}^N \ln \pi_\theta(x^{(k)}) = -\theta^T \bar
U_N - \ln Z_\theta
$$
with $\bar U_N = (U(x^{(1)})+\cdots+U(x^{(N)}))/N$.

We have the following proposition, which is a well-known property of exponential families of probabilities.

The log-likelihood, $\ell$, is a concave function of $\theta$, with
\[
\nabla{\ell}(\theta) = E_\theta(U) - \bar U_N
\]
\[
D^2(\ell)(\theta) = - \text{Var}_\theta(U)
\]
where $E_\theta$ denotes the expectation with respect to $\pi_\theta$ and $\text{Var}_\theta$ the covariance matrix under the same distribution.


This proposition implies that a local  maximum of $\theta \mapsto \ell(\theta)$ must also be global. Any such maximum must be a solution of
$$
E_\theta(U) = V(x_0)
$$
and conversely.

The function \(\ell\) has a finite maximum if and only if there is no direction $u\in \mathbb R^d$ such that $\alpha^TU(x) -
\bar U_N \leq 0$ for all $x\in F_V$. Equivalently, $\bar U_N$ must belong to the interior of the convex hull of the finite set
$$
\{U(x), x\in F_V\}\subset \mathbb R^d.
$$

In such a case, that we hereafter assume, computing the maximum likelihood estimator boils down to solving the equation
$$
E_\theta (U) = \bar U_N.
$$
Because the maximization problem is concave, we know that numerical algorithms like gradient descent,
\[
\theta(t+1) = \theta(t) + \epsilon (E_{\theta(t)} (U) - \bar U_N),
\]
or Newton-Raphson,
\[
\theta(t+1) = \theta(t) + \epsilon \text{Var}_{\theta(t)}(U)^{-1} (E_{\theta(t)} (U) - \bar U_N),
\]
which is more efficient, converge to the optimal parameter. Unfortunately, the computation of the expectations and covariance matrices can only be made explicitly for a very specific class of models. For general loopy graphical models, the expectation can be estimated iteratively using Monte-Carlo methods. It turns out that this estimation can be synchronized with gradient descent to obtain a consistent algorithm.

As remarked above, for fixed $\theta$,  there exist Markov-chain Monte Carlo algorithms that asymptotically sample form $\pi_\theta$. Select one of these algorithms, and let $ P_\theta$ be the corresponding stochastic matrix that provides the associated transition probabilities for a given $\theta$. Then, define the iterative algorithm, initialized with arbitrary $\theta(0)$ and $x(0) \in F_V$, that loops over the following two steps.

Then, the following holds.

If $P_\theta$ corresponds to the Gibbs sampler or Metropolis algorithm, and $\gamma(t+1) = \epsilon/(t+1)$ for small enough $\epsilon$, the algorithm that iterates (SG1) and (SG2) converges to the maximum likelihood estimator.

This is a particular instance of a stochastic gradient algorithm, in which the expectation in the update step is crudely estimated at each step by the evaluation of $U$ on the random sample $x(t)$. One of the reason for the convergence (and almost a necessary condition for it) is that the average of the left-hand term in (SG2) over $x(t+1)$ for the invariant distribution of $ P_{\theta(t+1)}$ is precisely the gradient. This averaging effect then takes place over time to provide the correct limit.