Understanding Dropout
Dropout regularization is widely used in neural network training and entails stochastically setting the output of hidden units to 0 at each iteration of network training. While widely used as a means to present networks from overfitting, a precise mathematical understanding of the regularization induced by dropout is often lacking. In this project we work on characterizing the regularization properties that are induced by using Dropout during network training.
The simplest model that one can use to study Dropout is simple matrix factorization. Essentially, matrix factorization is just a single hidden layer neural network with no non-linearity on the hidden units, and ‘dropping out’ a hidden unit is simply removing one rank-1 matrix from the factorization. For example, if one has the following matrix factorization problem:
$$\min_{U,V} ||X - UV^\top||_F^2$$
then training the model with Dropout regularization is equivalent to performing stochastic gradient descent (SGD) on an objective of the form
$$\min_{U,V} \mathbb{E}_{r_\theta} ||X - \frac{1}{\theta} U \text{diag}(r_\theta) V^\top ||_F^2$$
where $r_\theta$ is a vector of iid Bernoulli random variables which take value 1 with probability $\theta$ (in the above the scaling by $\frac{1}{\theta}$ is simply for algebraic convenience). The fact that training the original model is just SGD on the above equation is then easily seen by noting that at each iteration Dropout simply draws a sample of the $r_\theta$ vector and the performs a gradient descent step. Due to this fact, simply evaluating the above expectation gives the deterministic regularization which is induced by Dropout
$$\min_{U,V} \mathbb{E}_{r_\theta} ||X - \frac{1}{\theta} U \text{diag}(r_\theta) V^\top ||_F^2 = \min_{U,V} ||X - UV^\top ||_F^2 + \tfrac{1-\theta}{\theta} \sum_i^d || U_i||_2^2 || V_i||_2^2$$
where $(U_i,V_i)$ denotes the $i^\text{th}$ column of $U$ and $V$ and $d$ is the number of columns in $(U,V)$. To understand the deterministic regularization induced by Dropout, we show that if the Dropout rate $(1-\theta)$ is adapted to the number of columns in $(U,V)$, then the deterministic regularization induced by Dropout is closely related to nuclear norm regularization. In particular, it can be shown that under certain conditions the above Dropout training formulation will give solutions to the following optimization problem in the product space, $A = UV^\top$
$$ \min_{A} || X-A ||_F^2 + \lambda || A ||_*^2$$
where $||\cdot||_*$ denotes the nuclear norm (sum of the singular values) and $\lambda$ is a constant that depends on $\theta$. Due to the fact that the nuclear norm induces low-rank solutions, one interpretation of Dropout training for this problem is that it induces a low-rank regularization on the output of the model.
Beyond this simple linear model, the same style of analysis can also be applied to understand the effects of Dropout in the final layer of a deep network if the final layer is linear (as is often the case). Further, instead of sampling the $r_\theta$ vector to have iid Bernoulli entries, the same style of analysis can also be applied to more complicated forms of stochastic samples (for example, in Block Dropout continuous blocks of variables are dropped simultaneously).
References
Pal, Lane, Vidal, Haeffele. “On the Regularization Properties of Structured Dropout.” CVPR (2020)