Maximum Likelihood with Exponential Graphical Models
Graphical models provide a powerful framework to model
interactions between large families of random variables.
Following pioneering work from J. Besag (JRSS-B 36, 1974),
they have been introduced to the image processing community
through the famous paper from S. and D. Geman (T-PAMI, 6
1984). Parameter estimation for this kind of models is
quite difficult, however, and the algorithm that is
described below, which belong to the family of stochastic
gradient methods, provides addresses this issue in an
iterative way. The resulting approach is more flexible than
iterative scaling (Daroch and Ratcliff annals of math.
stats. 1972), and more efficient (in the statistical sense)
than the pseudo-likelihood estimator introduced by Besag.
More details can be found in the following papers.
Estimation and annealing for Gibbsian fields
, L. Younes
, Annales de l'IHP
Probabilité et statistiques
24 269--294
(1988)
Parametric inference for
imperfectly observed Gibbsian fields
, L. Younes
, Probability Theory and
Related Fields 82
625--645 (1989)
Maximum likelihood estimation for Gibbsian fields
, L. Younes
, Lecture
Notes-Monograph Series
403--426 (1991)
Stochastic gradient estimation strategies for
Markov random fields
, L. Younes
, SPIE's
International Symposium on Optical Science, Engineering, and
Instrumentation
315--325 (1998)
Parameter estimation for imperfectly observed Gibbs fields and some
comments on Chalmond's EM Gibbsian algorithm
, L. Younes
, Stochastic Models, Statistical Methods, and Algorithms in Image Analysis
, Lecture Notes in Statistics Volume 74, 1992, pp 240-258
On the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates,
L. Younes, Stochastics: An
International Journal of Probability and Stochastic
Processes 65 177--228 (1999)
The following papers also use
stochastic gradient methods in different contexts related to
image processing.
A stochastic algorithm for probabilistic independent component analysis
, S. Allassonniere and L. Younes
,
The Annals of Applied Statistics
6 125--160
(2012)
A stochastic algorithm for feature selection in pattern recognition
, S. Gadat and L. Younes
, The Journal of Machine Learning Research
8 509--547
(2007)
This paper explores an alternative method for the calibration
of Markov random field parameters, based on a paradigm in
which parameters are fixed in order to constrain local
minimizer of a cost or energy function, instead of using a
standard statistical approach. A similar paradigm, called energy-based learning,
has subsequently been introduced by Y. Le Cun.
Calibrating parameters of cost functionals, L. Younes, Computer
Vision ECCV 2000 212--223 (2000)
The following discussion is taken from lecture
notes written for my class Graphical Models at JHU.
Consider a parametrized model for a Gibbs distribution
\[
\pi_\theta(x) = \frac{1}{Z_\theta} \exp(-\theta^TU(x))
\] where \(\theta\) is a \(d\) dimensional parameter and \(U\)
is a function from \(F_V\) to \(\mathbb R^d\). For example, if
\(\pi\) is an Ising model with
\[
\pi(x) = \frac{1}{Z} \exp\big( \alpha \sum_{s\in V} x_s +
\beta \sum_{s\sim t}
x_s x_t\big),
\]
we would take \(\theta = (\alpha, \beta)\) and \(U(x) = -
(\sum_s x_s, \sum_{s\sim t} x_s x_t)\). Most of the Markov
random fields models that are used in practice can be put
in this form.
The constant \(Z_\theta\) is equal to
\(
\sum_{x\in F_V} \exp(-\theta^TU(x))
\)
and is usually not computable.
Now, assume that an \(N\)-sample, \(x^{(1)}, \ldots, x^{(N)}\)
is observed for this distribution. The maximum likelihood
estimator maximizes
\[
\ell(\theta) = \frac{1}{N} \sum_{k=1}^N \ln
\pi_\theta(x^{(k)}) = -\theta^T \bar
U_N - \ln Z_\theta
\]
with \(\bar U_N = (U(x^{(1)})+\cdots+U(x^{(N)}))/N\).
We have the following proposition, which is a well-known
property of exponential families of probabilities.
The log-likelihood, \(\ell\), is a concave function of
\(\theta\), with
\[
\nabla{\ell}(\theta) = E_\theta(U) - \bar U_N
\]
\[
D^2(\ell)(\theta) = - \text{Var}_\theta(U)
\]
where \(E_\theta\) denotes the expectation with respect to
\(\pi_\theta\) and \(\text{Var}_\theta\) the covariance matrix
under the same distribution.
This proposition implies that a local maximum of
\(\theta \mapsto \ell(\theta)\) must also be global. Any
such maximum must be a solution of
\[
E_\theta(U) = V(x_0)
\]
and conversely.
The function \(\ell\) has a finite maximum if and only if
there is no direction \(u\in \mathbb R^d\) such that
\(\alpha^TU(x)
\bar U_N \leq 0\) for all \(x\in F_V\). Equivalently, \(\bar
U_N\) must belong to the interior of the convex hull of the
finite set
\[
\{U(x), x\in F_V\}\subset \mathbb R^d.
\]
In such a case, that we hereafter assume, computing the
maximum likelihood estimator boils down to solving the
equation
\[
E_\theta (U) = \bar U_N.
\]
Because the maximization problem is concave, we know that
numerical algorithms like gradient descent,
\[
\theta(t+1) = \theta(t) + \epsilon (E_{\theta(t)} (U) -
\bar U_N),
\]
or Newton-Raphson,
\[
\theta(t+1) = \theta(t) + \epsilon
\text{Var}_{\theta(t)}(U)^{-1} (E_{\theta(t)} (U) - \bar
U_N),
\]
which is more efficient, converge to the optimal
parameter. Unfortunately, the computation of the expectations and covariance
matrices can only be made explicitly for a very specific
class of models. For general loopy graphical models, the
expectation can be estimated
iteratively using Monte-Carlo methods. It turns out that
this estimation can be synchronized with gradient descent
to obtain a consistent algorithm.
As remarked above, for fixed \(\theta\), there exist
Markov-chain Monte Carlo algorithms that asymptotically
sample form $\pi_\theta$. Select one of
these algorithms, and let \( P_\theta\) be the corresponding
stochastic matrix that provides the associated transition
probabilities for a given \(\theta\). Then, define the
iterative algorithm, initialized
with arbitrary \(\theta(0)\) and \(x(0) \in F_V\), that loops
over the following two steps.
Then, the following holds.
If \(P_\theta\) corresponds to the Gibbs sampler or
Metropolis algorithm, and \(\gamma(t+1) = \epsilon/(t+1)\) for
small enough \(\epsilon\), the algorithm
that iterates (SG1) and (SG2) converges to the maximum
likelihood estimator.
This is a particular instance of a stochastic gradient
algorithm, in which the expectation in the update step is
crudely estimated at each step by the
evaluation of \(U\) on the random sample \(x(t)\). One of the
reason for the convergence (and almost a necessary condition for it) is that the average of the
left-hand term in (SG2) over \(x(t+1)\) for the invariant
distribution of \( P_{\theta(t+1)}\) is precisely
the gradient. This averaging effect then takes place over time
to provide the correct limit.