Structured Matrix Factorization
One application of this work is to segment videos of neural Calcium image data into temporal signals and spatial segmentations.
Matrix factorization models are very wide-spread in machine learning and encompass many classical and fundamental algorithms with Principal Component Analysis (PCA), Non-Negative Matrix Factorization (NMF), and Sparse Dictionary Learning (SDL) being a few well-known examples. In the most basic form, matrix factorization models simply mean that given some data matrix $Y$ we wish to find an approximation (or exact factorization) $Y \approx U V^\top$. More generally, we might also wish to perform linear regression or make some other prediction about $Y$ via a predictor $Y \approx f(UV^\top,X)$ that depends on some additional data $X$ problems 1 can be captured via an optimization problem of the form:
$$\min_{U,V} \ell(Y, f(UV^\top, X)) + \lambda \Theta(U,V)$$
where $\ell$ is some loss function which measures the distance between the prediction $\widehat Y = f(UV^\top, X)$ and $Y$ and $\Theta$ is some function which promotes various properties or constraints in $(U,V)$. A key challenge in matrix factorization problems is that the above optimization function is inherently non-convex largely regardless of the choice of $\ell$, $f$, or $\Theta$ due to the fact that the matrix product $UV^\top$ destroys convexity.
For some problems, such as PCA, one is fortunate and the resulting non-convex optimization problem can still be solved, but for most other matrix factorization models the optimization problem can be challenging to solve.
Structured Factorization
Beyond the inherent non-convexity, in many other matrix factorization problems we don’t wish to just have a generic factorization $UV^\top$, but we also desire for $U$ and $V$ to satisfy various properties. For example, in NMF $U$ and $V$ are constrained to be non-negative, while in SDL we require one of the factors to be sparse (i.e., most of the entries are 0), which is usually accomplished through an appropriate choice of the $\Theta$ function.
In this work we study the optimization problems associated of structured matrix factorization by linking it to a related convex optimization problem in the product space (i.e., $Z = UV^\top$). Specifically, we consider the case where $\Theta$ can be defined as a sum over the a function that acts on the columns of $(U,V)$,
$$ \Theta(U,V) = \sum_{i=1}^r \theta(U_i,V_i) $$
where $(U_i, V_i)$ denotes the $i^\text{th}$ column of $(U,V)$, $r$ is the number of columns in $(U,V)$. If $\theta$ satisfies a few basic properties, then it is possible to define a convex function (convex w.r.t. $Z$)
$$ \Omega_{\theta} (Z) \equiv { \min_{r,U,V} \sum_{i=1}^r \theta(U_i,V_i) \ \ \text{s.t.} \ \ Z = UV^\top } $$
which, given a matrix $Z$, attempts to find the factorization of $Z$ into $UV^\top$ over all possible factorizations which minimizes the sum of $\theta$ terms on the columns of $(U,V)$. From this, one can then define an optimization problem on $Z$, which is tightly linked with our original problem of interest on ($U,V$):
$$ \min_{Z} \ell(Y, f(Z, X)) + \lambda \Omega_{\theta}(Z) $$
Further, because $\Omega_{\theta}(Z)$ is convex w.r.t. $Z$, then if $\ell(Y, f(Z,X))$ is convex w.r.t. $Z$ (as is often the case) then the overall problem above is convex w.r.t. $Z$ as well. From the definition of $\Omega_{\theta}$ this above objective is also tightly linked to our non-convex objective of interest w.r.t. $(U,V)$. As a result, we can use tools from convex analysis to study the convex objective (w.r.t. $Z$) and using the link of $\Omega_{\theta}$ provide theoretical guarantees about the non-convex problem over $(U,V)$.
This whole approach can also be generalized in some situations to factorizations involving more than one matrix (e.g., $Z = R U V^\top$ where we optimize over $R$, $U$, and $V$), which arises in some applications like seperable dictionary learning, where we can provide polynomial time guarantees for solving the non-convex objective.
References
-
Note this can be generalized even further if there are additional variables $Q$ which are ‘unfactorized’. For example, in Robust PCA there is a matrix of sparse outliers. This can also be handeled fairly simply if the overall objective is jointly convex w.r.t. $(Q,Z)$. ↩︎