Variational Inference Techniques in Deep Learning

2017/11/24 Machine Learning

Overview

Suppose there are two random variables ${ {\boldsymbol{\mathrm{ {x} } } } }$ and ${ {\boldsymbol{\mathrm{ {z} } } } }$, where ${ {\boldsymbol{\mathrm{ {x} } } } }$ is observed and ${ {\boldsymbol{\mathrm{ {z} } } } }$ is not. We assume these two variables are modeled with a neural network, with parameter ${ {\boldsymbol{\mathrm{ {\theta} } } } }$. The prior of ${ {\boldsymbol{\mathrm{ {z} } } } }$ is $p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } })$, while the posterior of ${ {\boldsymbol{\mathrm{ {x} } } } }$ given ${ {\boldsymbol{\mathrm{ {z} } } } }$ is $p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } } \vert { {\boldsymbol{\mathrm{ {z} } } } })$. Although computing $p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } } \vert { {\boldsymbol{\mathrm{ {z} } } } })$ is straightforward in this scenario, the posterior of ${ {\boldsymbol{\mathrm{ {z} } } } }$ given ${ {\boldsymbol{\mathrm{ {x} } } } }$, i.e., $p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$, is intractable by analytic methods.

While it is possible to use sampling based methods to evaluate $p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } })$ and further obtain $p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } }) = p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) / p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } })$, it often demands a large number of samples. Alternatively, one may fit another neural network $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ to approximate the true posterior $p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$, by using the variational inference techniques.

In the context of deep learning, a variational inference algorithm that fits $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ should basically have two elements. One is the training objective, which we may call variational objective hereafter. The other is the gradient estimator, since almost the entire deep learning is based on gradient descent techniques.

Evidence Lower Bound (ELBO)

Variational Objective

The ELBO is defined as Eqn.\eqref{eqn:elbo}, and is tractable with Monte Carlo methods, using Eqn.\eqref{eqn:elbo-computation}.

$$ \begin{align} \log p_{\theta}({ {\boldsymbol{\mathrm{ { { {\boldsymbol{\mathrm{ {x} } } } } } } } } }) &\geq \log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }) - {\operatorname{KL}\left[{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\,\big\|\,{p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) \tag{1}\label{eqn:elbo} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }) + \log p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }|{ {\boldsymbol{\mathrm{ {z} } } } }) + \log p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \tag{2}\label{eqn:elbo-computation} \end{align} $$

It is also possible to re-organize the ELBO to obtain Eqn.\eqref{eqn:elbo-reorganized}, which contains a KL-divergence between the posterior $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ and the prior $p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } })$. In some configuration, this KL-divergence is analytically tractable. However, it would bring little benefit, thus I would suggest to use Eqn.\eqref{eqn:elbo-computation} in most situations.

$$ \begin{align} \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }|{ {\boldsymbol{\mathrm{ {z} } } } }) + \log p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }|{ {\boldsymbol{\mathrm{ {z} } } } })}\right]} - {\operatorname{KL}\left[{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\,\big\|\,{p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } })}\right]} \tag{3}\label{eqn:elbo-reorganized} \end{align} $$

Stochastic Gradient Variational Bayes (SGVB)

SGVB [4] is the simplest gradient estimator for ELBO. It can be adopted if:

  1. is a continuous random variable (enforced by the second condition).

  2. can be written as ${ {\boldsymbol{\mathrm{ {z} } } } } = f _ {\phi}({ {\boldsymbol{\mathrm{ {\epsilon} } } } })$, where ${ {\boldsymbol{\mathrm{ {\epsilon} } } } } \sim p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })$ is some random variable independent with $\phi$, and $f _ {\phi}({ {\boldsymbol{\mathrm{ {\epsilon} } } } })$ is a differentiable function.

Since $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })\prod{\mathrm{d} }{z _ i} = p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })\prod{\mathrm{d} }{\epsilon _ i}$, the ELBO can be re-written as Eqn.\eqref{eqn:elbo-reparameterized}.

$$ \begin{align} \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \tag{4}\label{eqn:elbo-reparameterized} \end{align} $$

This is the re-parameterization trick [4]. The gradient estimator of $\theta$ can be derived as Eqn.\eqref{eqn:sgvb-model-estimator}, while that of $\phi$ can be derived as Eqn.\eqref{eqn:sgvb-variational-estimator}. They can be further computed by Monte Carlo methods.

$$ \begin{align} {\nabla}_{\theta} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\nabla}_{\theta} \, {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{ {\nabla}_{\theta} \log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })}\right]} \tag{5}\label{eqn:sgvb-model-estimator} \\ {\nabla}_{\phi} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\nabla}_{\phi} \, {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\nabla}_{\phi} \int p({ {\boldsymbol{\mathrm{ {\epsilon} } } } }) \Big[\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \Big]\, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {\epsilon} } } } } } \\ &= \int p({ {\boldsymbol{\mathrm{ {\epsilon} } } } }) \, {\nabla}_{\phi} \Big[\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \Big]\, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {\epsilon} } } } } } \\ &= {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{- {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \tag{6}\label{eqn:sgvb-variational-estimator} \end{align} $$

Neural Variational Inference and Learning (NVIL)

SGVB allows us to train the whole network using gradient descent techniques. However, it requires ${ {\boldsymbol{\mathrm{ {z} } } } }$ to be rewritten with a differentiable mapping $f _ {\phi}({ {\boldsymbol{\mathrm{ {\epsilon} } } } })$. This is not always feasible. As an alternative, it may be possible to use Eqn.\eqref{eqn:elbo-direct-model-estimator} and Eqn.\eqref{eqn:elbo-direct-variational-estimator} as the gradient estimator.

$$ \begin{align} {\nabla}_{\theta} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\nabla}_{\theta} \, {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{ {\nabla}_{\theta} \log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) }\right]} \tag{7}\label{eqn:elbo-direct-model-estimator} \\ {\nabla}_{\phi} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\nabla}_{\phi} \, {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \\ &= {\nabla}_{\phi} \int q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, \Big[\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })\Big] \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {\epsilon} } } } } } \\ &= \int \Big[\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })\Big] \, {\nabla}_{\phi} \, q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ & \quad -\int q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= \int q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \Big[\log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })\Big] {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{ \Big[ \log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \Big] \cdot {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) }\right]} \tag{8}\label{eqn:elbo-direct-variational-estimator} \end{align} $$

We have used the fact of Eqn.\eqref{eqn:elbo-gradient-fact-1}.

$$ \begin{align} \int c \cdot q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } &= c \cdot \int {\nabla}_{\phi}\, q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= c \cdot {\nabla}_{\phi} \int q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \, {\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= c \cdot {\nabla}_{\phi} 1 \\ &= 0 \tag{9}\label{eqn:elbo-gradient-fact-1} \end{align} $$

Eqn.\eqref{eqn:elbo-direct-variational-estimator} can be an unbiased estimator, however, it is well-known that this estimator has huge variance, and can cause unacceptably slow training . Another thing to notice that Eqn.\eqref{eqn:elbo-direct-variational-estimator} would effectively fitting $\log q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ to $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$.

NVIL [5] is a variant of REINFORCE algorithm, which deals with the difficulty of Eqn.\eqref{eqn:elbo-direct-variational-estimator}. It introduces a pair of baselines, the input-dependent baseline $C _ {\psi}({ {\boldsymbol{\mathrm{ {x} } } } })$ and the input-independent baseline $c$, to reduce the variance. $C _ {\psi}({ {\boldsymbol{\mathrm{ {x} } } } })$ is proposed to cancel out the contribution of $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } })$ from $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$, such that $\log q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ is approximately fit to $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$. For shorter notations, we use $l _ {\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$ to denote $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - \log q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$, which is called the learning signal in [5]. Take Eqn.\eqref{eqn:elbo-gradient-fact-1} into account, we know that subtracting any ${ {\boldsymbol{\mathrm{ {z} } } } }$-independent component from $l _ {\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$ would not change the value of gradient estimator, thus we can obtain Eqn.\eqref{eqn:nvil-variational-estimator}.

$$ \begin{align} {\nabla}_{\phi} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{ l_{\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) \cdot {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) }\right]} \\ &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{ \Big[ l_{\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - C_{\psi}({ {\boldsymbol{\mathrm{ {x} } } } }) - c \Big] \cdot {\nabla}_{\phi} \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) }\right]} \tag{10}\label{eqn:nvil-variational-estimator} \end{align} $$

The input-dependent baseline $C _ {\psi}({ {\boldsymbol{\mathrm{ {x} } } } })$ should be fit by minimizing an extra objective ${\operatorname{\mathbb{E} } _ { { q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } }) } }\left[{ \left(l _ {\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) - C _ {\psi}({ {\boldsymbol{\mathrm{ {x} } } } })-c\right)^2 }\right]}$, which serves to center the adjusted learning signal. This baseline technique, according to [5], works by reducing the magnitude of the learning signal (i.e., $l _ {\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$ before applying the baselines), thus reducing the variance of gradient estimator in every mini-batch. The gradient estimator of $\psi$ should be Eqn.\eqref{eqn:nvil-baseline-estimator}.

$$ \begin{align} &{\nabla}_{\psi} \, {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\Big(l_{\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })-C_{\psi}({ {\boldsymbol{\mathrm{ {x} } } } })-c\Big)^2}\right]} \\ =& {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{-2\, \Big(l_{\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })-C_{\psi}({ {\boldsymbol{\mathrm{ {x} } } } })-c\Big) \, {\nabla}_{\psi} \, C_{\psi}({ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \tag{11}\label{eqn:nvil-baseline-estimator} \end{align} $$

The input-independent baseline $c$ should be maintained as the moving average of $l _ {\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$. In an auto-grad system, it is sufficient to directly glue the two objectives together with equal weights, to achieve the same effect as applying Eqn.(\ref{eqn:elbo-direct-model-estimator},\ref{eqn:nvil-variational-estimator},\ref{eqn:nvil-baseline-estimator}) individually.

Monte Carlo Objective

Objective

While it is possible to use multiple samples of ${ {\boldsymbol{\mathrm{ {z} } } } }$ in ELBO (Eqn.\eqref{eqn:elbo}), it shows no significant improvement over just one sample [4]. The Monte Carlo objective, on the other hand, directly derives a $K$-sample estimator for $p({ {\boldsymbol{\mathrm{ {x} } } } })$, instead of for $\log p({ {\boldsymbol{\mathrm{ {x} } } } })$, as Eqn.\eqref{eqn:mco}. We use ${ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}$ to denote the all $K$ samples of ${ {\boldsymbol{\mathrm{ {z} } } } }$ given ${ {\boldsymbol{\mathrm{ {x} } } } }$, and use ${ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}$ to denote the $k$-th sample. Note that when $K=1$, the Monte Carlo objective is equivalent to the ELBO.

$$ \begin{align} \hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}) &= \frac{1}{K} \sum_{k=1}^K \frac{p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {x} } } } })} \\ \mathcal{L}_K({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log I({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})}\right]} \tag{12}\label{eqn:mco} \end{align} $$

The following facts are proven in [2], which suggests the Monte Carlo objective should provide a tighter lower bound for $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } })$ than the ELBO.

  1. $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } }) \geq \mathcal{L} _ K$.

  2. $\mathcal{L} _ K \geq \mathcal{M}$, for $K \geq M$.

  3. $\log p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } }) = \lim _ {K \to \infty} \mathcal{L} _ K$, assuming $p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) / q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ is bounded.

Relationship with the Importance Sampling

$\hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$ can be viewed from the perspective of importance sampling.

$$ \begin{align} p({ {\boldsymbol{\mathrm{ {x} } } } }) &= \int p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }|{ {\boldsymbol{\mathrm{ {z} } } } }) \, p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } }) \,{\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= \int p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } }|{ {\boldsymbol{\mathrm{ {z} } } } }) \,\frac{p_{\theta}({ {\boldsymbol{\mathrm{ {z} } } } })}{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} \, q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \,{\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \\ &= \int \frac{p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })}{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })} \, q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } }) \,{\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } } \end{align} $$

Thus $\hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$ itself can be seen as an unbiased estimator of $p({ {\boldsymbol{\mathrm{ {x} } } } })$, using importance sampling. Furthermore, using Jensen’s inequality, we can prove:

$$ \begin{align} {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log \hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})}\right]} &\leq \log {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})}\right]} \\ &= \log p({ {\boldsymbol{\mathrm{ {x} } } } }) \end{align} $$

We use $\hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$ to denote $\log \hat{I}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$.

SGVB

The SGVB gradient estimator for Eqn.\eqref{eqn:mco} is first proposed as part of the importance weighted autoencoders [2]. With the re-parameterization trick applied to $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$, the general gradient estimator can be derived as Eqn.\eqref{eqn:mco-sgvb-estimator}, where we use $w _ k$ to denote $p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}) / q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)} \vert { {\boldsymbol{\mathrm{ {x} } } } })$, and $\tilde{w} _ k$ to denote $w _ k / \sum _ {i=1}^K w _ i$. The gradient estimator for $\theta$ and $\phi$ can be derived as Eqn.\eqref{eqn:mco-sgvb-model-estimator} and Eqn.\eqref{eqn:mco-sgvb-variational-estimator}.

$$ \begin{align} &{\nabla}\, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) = {\nabla}\, {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\log \frac{1}{K} \sum_{k=1}^K w_k}\right]} = {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{ {\nabla}\log \frac{1}{K} \sum_{k=1}^K w_k}\right]} = \\ &\quad {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\frac{\sum_{k=1}^K {\nabla}\, w_k}{\sum_{k=1}^K w_k} }\right]} = {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\frac{\sum_{k=1}^K w_k {\nabla}\log w_k}{\sum_{k=1}^K w_k} }\right]} = {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\sum_{k=1}^K \tilde{w}_k \, {\nabla}\log w_k}\right]} \tag{13}\label{eqn:mco-sgvb-estimator} \\ \end{align} $$
$$ \begin{align} {\nabla}_{\theta} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{\sum_{k=1}^K \tilde{w}_k \, {\nabla}_{\theta} \, \log p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}\right]} \tag{14}\label{eqn:mco-sgvb-model-estimator} \\ {\nabla}_{\phi} \, \mathcal{L}({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {p({ {\boldsymbol{\mathrm{ {\epsilon} } } } })} }\left[{-\sum_{k=1}^K \tilde{w}_k \, {\nabla}_{\phi} \, \log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} \tag{15}\label{eqn:mco-sgvb-variational-estimator} \end{align} $$

Variational Inference for Monte Carlo Objectives (VIMCO)

VIMCO can be applied on a general type of objectives with the form of Eqn.\eqref{eqn:vimco-objective-family}.

$$ \begin{align} \mathcal{L}_K({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\log \frac{1}{K} \sum_{k=1}^K f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}\right]} \tag{16}\label{eqn:vimco-objective-family} \end{align} $$

We use $\hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$ to denote $\log \frac{1}{K} \sum _ {k=1}^K f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})$ for short hereafter. We shall have the general gradient as Eqn.\eqref{eqn:vimco-direct-general-gradient}, where $\tilde{w} _ k = f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})/\sum _ {i=1}^K f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(i)})$.

$$ \begin{align} {\nabla}\mathcal{L}_K({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\sum_{k=1}^K \hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}) \, {\nabla}\log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} + \\ & \quad {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\sum_{k=1}^K \tilde{w}_k\,{\nabla}\log f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}\right]} \tag{17}\label{eqn:vimco-direct-general-gradient} \end{align} $$

The 2nd term of this gradient estimator is well-defined, however, the 1st is not, in that: (1) it does not implement credit assignment within each set of $K$ samples, since every sample shares the same coefficient $\hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$, and (2) the norm of the 1st term can be much larger than the 2nd term, introducing too much noise into the gradient estimation. The baselines introduced by NVIL can tackle the 2nd problem, but not the 1st one.

In order to tackle this problem, [6] introduces a per-sample baseline, rather than the per-input baselines in [5]. The learning signal $\hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)})$ of VIMCO for each sample then becomes Eqn.\eqref{eqn:vimco-learning-signal}.

$$ \hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}) = \hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}) - \log \frac{1}{K} \bigg(\hat{f}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)})+\sum_{i \neq k} f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(i)})\bigg) \tag{18}\label{eqn:vimco-learning-signal} $$

Here ${ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}$ denotes all the ${ {\boldsymbol{\mathrm{ {z} } } } }$ samples except ${ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}$, and $\hat{f}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)})$ is an estimation of $f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})$ using only ${ {\boldsymbol{\mathrm{ {x} } } } }$ and ${ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}$. In [6], the authors found $\hat{f}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}) = \exp\big(\frac{1}{K-1} \sum _ {i \neq k} \log f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(i)})\big)$ a good choice, but $\hat{f}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}) = \frac{1}{K-1} \sum _ {i \neq k} f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(i)})$ is also a possible alternative. The general gradient estimator of VIMCO then becomes Eqn.\eqref{eqn:vimco-general-gradient}.

$$ \begin{align} {\nabla}\mathcal{L}_K({ {\boldsymbol{\mathrm{ {x} } } } }) &= {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\sum_{k=1}^K \hat{L}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {z} } } } }^{(-k)}) \, {\nabla}\log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]} + \\ & \quad {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\sum_{k=1}^K \tilde{w}_k\,{\nabla}\log f({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}\right]} \tag{19}\label{eqn:vimco-general-gradient} \end{align} $$

Reweighted Wake-Sleep Algorithm

Reweighted wake-sleep algorithm [1] is a multi-sample extension of the wake-sleep algorithm [3]. Like the wake-sleep algorithm, it uses a two-phase optimization in training, the wake phase and the sleep phase.

The wake phase optimizes the Monte Carlo objective (§3.1). The optimization algorithm in [1] turns out to be very similar to SGVB on the Monte Carlo objective (§3.3), except that the gradient along $q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ of the expectation $\int f _ {\theta,\phi}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })\,q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })\,{\mathrm{d} }{ { {\boldsymbol{\mathrm{ {z} } } } } }$ is totally ignored. The general gradient estimator can be formulated as Eqn.\eqref{eqn:rws-wake-gradient-estimator}.

$$ {\nabla}\, \mathcal{L}_{\text{wake} }({ {\boldsymbol{\mathrm{ {x} } } } }) = {\operatorname{\mathbb{E} }_{ {q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(1:K)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\left[{\sum_{k=1}^K \tilde{w}_k \, {\nabla}\log \frac{p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }^{(k)})}{q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }^{(k)}|{ {\boldsymbol{\mathrm{ {x} } } } })} }\right]}\tag{20}\label{eqn:rws-wake-gradient-estimator} $$

The sleep phase in [1] optimizes $\log q _ {\phi}({ {\boldsymbol{\mathrm{ {z} } } } } \vert { {\boldsymbol{\mathrm{ {x} } } } })$ according to $({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } }) \sim p _ {\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })$, i.e.., according to prior samples. The gradient estimator is thus Eqn.\eqref{eqn:rws-sleep-gradient-estimator}.

$$ {\nabla}\, \mathcal{L}_{\text{sleep} }({ {\boldsymbol{\mathrm{ {x} } } } }) = {\operatorname{\mathbb{E} }_{ {p_{\theta}({ {\boldsymbol{\mathrm{ {x} } } } },{ {\boldsymbol{\mathrm{ {z} } } } })} }\left[{\log q_{\phi}({ {\boldsymbol{\mathrm{ {z} } } } }|{ {\boldsymbol{\mathrm{ {x} } } } })}\right]}\tag{21}\label{eqn:rws-sleep-gradient-estimator} $$

References

[1]Bornschein, J. and Bengio, Y. 2014. Reweighted wake-sleep. arXiv preprint arXiv:1406.2751. (2014).
[2]Burda, Y. et al. 2015. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. (2015).
[3]Hinton, G.E. et al. 1995. The “wake-sleep” algorithm for unsupervised neural networks. Science. 268, 5214 (1995), 1158.
[4]Kingma, D.P. and Welling, M. 2014. Auto-Encoding Variational Bayes. Proceedings of the International Conference on Learning Representations (2014).
[5]Mnih, A. and Gregor, K. 2014. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030. (2014).
[6]Mnih, A. and Rezende, D. 2016. Variational Inference for Monte Carlo Objectives. PMLR (Jun. 2016), 2188–2196.

搜索

    章节目录