**Graphical
models** provide a powerful framework to model
interactions between large families of random variables.
Following pioneering work from J. Besag (JRSS-B 36, 1974),
they have been introduced to the image processing community
through the famous paper from S. and D. Geman (T-PAMI, 6
1984). Parameter estimation for this kind of models is
quite difficult, however, and the algorithm that is
described below, which belong to the family of stochastic
gradient methods, provides addresses this issue in an
iterative way. The resulting approach is more flexible than
iterative scaling (Daroch and Ratcliff annals of math.
stats. 1972), and more efficient (in the statistical sense)
than the pseudo-likelihood estimator introduced by Besag.
More details can be found in the following papers.

Estimation and annealing for Gibbsian fields, L. Younes, Annales de l'IHP
Probabilités et statistiques 24 269--294
(1988)

Parametric inference for
imperfectly observed Gibbsian fields, L. Younes, Probability Theory and
Related Fields 82
625--645 (1989)

Maximum
likelihood estimation for Gibbsian fields, L. Younes, Lecture
Notes-Monograph Series
403--426 (1991)

Stochastic gradient estimation strategies for
Markov random fields, L. Younes, SPIE's
International Symposium on Optical Science, Engineering, and
Instrumentation
315--325 (1998)

Parameter
estimation for imperfectly observed Gibbs fields and some
comments on Chalmond's EM Gibbsian algorithm, L. Younes, Stochastic
Models, Statistical Methods, and Algorithms in Image Analysis, Lecture Notes in Statistics Volume 74,
1992, pp 240-258

On
the convergence of Markovian stochastic algorithms with rapidly
decreasing ergodicity rates, L.
Younes, Stochastics: An
International Journal of Probability and Stochastic
Processes 65 177--228 (1999)

The following papers also use stochastic gradient methods in different contexts related to image processing.

A stochastic algorithm for probabilistic independent component analysis, S. Allassonniere and L. Younes, The Annals of Applied Statistics 6 125--160 (2012)

This paper explores an alternative method for the calibration of Markov random field parameters, based on a paradigm in which parameters are fixed in order to constrain local minimizer of a cost or energy function, instead of using a standard statistical approach. A similar paradigm, called

Calibrating parameters of cost functionals, L. Younes, Computer Vision—ECCV 2000 212--223 (2000)

Consider a parametrized model for a Gibbs distribution

\[\pi_\theta(x) = \frac{1}{Z_\theta} \exp(-\theta^TU(x))

\] where \(\theta\) is a $d$ dimensional parameter and $U$ is a function from $F_V$ to $\mathbb R^d$. For example, if $\pi$ is an Ising model with

\[

\pi(x) = \frac{1}{Z} \exp\big( \alpha \sum_{s\in V} x_s + \beta \sum_{s\sim t}

x_s x_t\big),

\]

we would take $\theta = (\alpha, \beta)$ and $U(x) = - (\sum_s x_s, \sum_{s\sim t} x_s x_t)$. Most of the Markov random fields models that are used in practice can be put in this form.

The constant $Z_\theta$ is

$$

Z_\theta = \sum_{x\in F_V} \exp(-\theta^TU(x))

$$

and is usually not computable.

Now, assume that an $N$-sample, $x^{(1)}, \ldots, x^{(N)}$ is observed for this distribution. The maximum likelihood estimator maximizes

$$

\ell(\theta) = \frac{1}{N} \sum_{k=1}^N \ln \pi_\theta(x^{(k)}) = -\theta^T \bar

U_N - \ln Z_\theta

$$

with $\bar U_N = (U(x^{(1)})+\cdots+U(x^{(N)}))/N$.

We have the following proposition, which is a well-known property of exponential families of probabilities.

The log-likelihood, $\ell$, is a concave function of $\theta$, with

\[

\nabla{\ell}(\theta) = E_\theta(U) - \bar U_N

\]

\[

D^2(\ell)(\theta) = - \text{Var}_\theta(U)

\]

where $E_\theta$ denotes the expectation with respect to $\pi_\theta$ and $\text{Var}_\theta$ the covariance matrix under the same distribution.

This proposition implies that a local maximum of $\theta \mapsto \ell(\theta)$ must also be global. Any such maximum must be a solution of

$$

E_\theta(U) = V(x_0)

$$

and conversely.

The function \(\ell\) has a finite maximum if and only if there is no direction $u\in \mathbb R^d$ such that $\alpha^TU(x) -

\bar U_N \leq 0$ for all $x\in F_V$. Equivalently, $\bar U_N$ must belong to the interior of the convex hull of the finite set

$$

\{U(x), x\in F_V\}\subset \mathbb R^d.

$$

In such a case, that we hereafter assume, computing the maximum likelihood estimator boils down to solving the equation

$$

E_\theta (U) = \bar U_N.

$$

Because the maximization problem is concave, we know that numerical algorithms like gradient descent,

\[

\theta(t+1) = \theta(t) + \epsilon (E_{\theta(t)} (U) - \bar U_N),

\]

or Newton-Raphson,

\[

\theta(t+1) = \theta(t) + \epsilon \text{Var}_{\theta(t)}(U)^{-1} (E_{\theta(t)} (U) - \bar U_N),

\]

which is more efficient, converge to the optimal parameter. Unfortunately, the computation of the expectations and covariance matrices can only be made explicitly for a very specific class of models. For general loopy graphical models, the expectation can be estimated iteratively using Monte-Carlo methods. It turns out that this estimation can be synchronized with gradient descent to obtain a consistent algorithm.

As remarked above, for fixed $\theta$, there exist Markov-chain Monte Carlo algorithms that asymptotically sample form $\pi_\theta$. Select one of these algorithms, and let $ P_\theta$ be the corresponding stochastic matrix that provides the associated transition probabilities for a given $\theta$. Then, define the iterative algorithm, initialized with arbitrary $\theta(0)$ and $x(0) \in F_V$, that loops over the following two steps.

- (SG1) Sample from the distribution $ P_{\theta(t)}(x(t),\cdot)$ to obtain a new configuration $x(t+1)$.
- (SG2)
Update the parameter using $\theta(t+1) = \theta(t) +
\gamma(t+1) (U(x(t+1)) - \bar U_N)$.

Then, the following holds.

This is a particular instance of a stochastic gradient algorithm, in which the expectation in the update step is crudely estimated at each step by the evaluation of $U$ on the random sample $x(t)$. One of the reason for the convergence (and almost a necessary condition for it) is that the average of the left-hand term in (SG2) over $x(t+1)$ for the invariant distribution of $ P_{\theta(t+1)}$ is precisely the gradient. This averaging effect then takes place over time to provide the correct limit.