What is Distributional Kernel Regression?

Mathematical Problem Formulation

We assume a dataset: $$ \mathcal{D}={(x_1,y_1),\ldots,(x_N,y_N)}, $$ where $x_1,\ldots,x_N\in\mathbb{R}^{D_{i}}$ and $y_1,\ldots,y_N\in\mathbb{R}^{D_{o}}$ with $D_i$ and $D_o$ being the input and output dimensionality, respectiveley. We want to predict the parameters of a parametric observational conditional likelihood in a non-parametric manner. This means we are interested in $\hat{p}_\theta(y\mid x)$ where $\hat{p}_\theta$ can be any parametric distribution such as a normal distribution, a Poisson distribution, an exponential distribution, or many others. $\theta$ denotes the parameters of those distributions that are dependent on $x$.

Furthermore, we assume in the style of kernel regression a kernel function $k$ where $k(x,\mathbf{x}_{1:N})\in\mathbb{R}^N$ returns the weights of all samples $1:N$. Common kernel functions include the radial basis functions (RBF), the linear kernel, or the Matern kernel. Following classical kernel regression, we would obtain $\hat{y}=\mathbf{y}_{1:N}^\top k(x,\mathbf{x}_{1:N})$. For distributional kernel regression, we propose a maximum likelihood approach for estimating $\theta (x)$.

Formally, we obtain $\theta (x)$ as solution to the following problem: $$ 0 = \dfrac{\partial}{\partial \theta}\left(\sum_{i=1}^N k(x,x_i)\log\hat{p}_\theta(y_i\mid x)\right). $$ For many distributions the above equation can be solved exactly. Below are a couple of examples.

Gaussian Observation Likelihood

The density of a normal distribution is: $$ \hat{p}_{\mu,\sigma}(y\mid x) = \dfrac{1}{\sigma (x)\sqrt{2\pi}}\mathrm{exp}\left( -\dfrac{1}{2}\left(\dfrac{y-\mu (x)}{\sigma (x)}\right)^2\right). $$

In log-space, this becomes: $$ \log \hat{p}_{\mu,\sigma}(y\mid x) = - \log\left(\sigma(x)\sqrt{2\pi}\right) - \dfrac{1}{2}\left(\dfrac{y-\mu(x)}{\sigma (x)}\right)^2. $$

This is sufficient information to solve the MLE problem: $$ \mu (x) = \dfrac{\sum_{i=1}^N k(x,x_i) y_i}{\sum_{i=1}^N k(x,x_i)}, $$ $$ \sigma (x) = \sqrt{\dfrac{\sum_{i=1}^N k(x,x_i)\left(\mu (x)^2 + y_i^2 -2\mu (x) y_i\right)}{\sum_{i=1}^N k(x,x_i)}}. $$ This represents a non-parametric approximation of the parameters of the normal distribution $\mu$ and $\sigma$. Similarly, we can express covariance matrix for the multivariate case: $$ \Sigma(x)=\dfrac{1}{\sum_{i=1}^N k(x,x_i)}\sum_{i=1}^N \left[ k(x,x_i) \left(y_i-\mu(x)\right)\left(y_i-\mu(x)\right)^\top\right]. $$

Poisson Observation Likelihood

The probability mass of the Poisson distribution is: $$ \hat{p}_\lambda (y\mid x) = \dfrac{\lambda (x)^{y} \mathrm{exp}\left(-\lambda (x)\right)}{y!}. $$

Solving the MLE problem for the Poisson likelihood gives an estimate for $\lambda(x)$: $$ \lambda(x) = \dfrac{\sum_{i=1}^N k(x,x_i)y_i}{\sum_{i=1}^N k(x,x_i)}. $$

Bernoulli Observation Likelihood

The probability density of the Bernoulli distribution is: $$ \hat{p}_\rho(y\mid x) = \rho(x) y + (1-\rho(x)) (1-y). $$

Solving MLE problem with the Bernoulli likelihood gives us an estimate for $\rho(x)$: $$ \rho(x) = \dfrac{\sum_{i=1}^N k(x,x_i)y_i}{\sum_{i=1}^N k(x,x_i)}. $$