Skip to content

What is Distributional Kernel Regression?

Mathematical Problem Formulation

We assume a dataset: $$ \mathcal{D}={(x_1,y_1),\ldots,(x_N,y_N)}, $$ where \(x_1,\ldots,x_N\in\mathbb{R}^{D_{i}}\) and \(y_1,\ldots,y_N\in\mathbb{R}^{D_{o}}\) with \(D_i\) and \(D_o\) being the input and output dimensionality, respectiveley. We want to predict the parameters of a parametric observational conditional likelihood in a non-parametric manner. This means we are interested in \(\hat{p}_\theta(y\mid x)\) where \(\hat{p}_\theta\) can be any parametric distribution such as a normal distribution, a Poisson distribution, an exponential distribution, or many others. \(\theta\) denotes the parameters of those distributions that are dependent on \(x\).

Furthermore, we assume in the style of kernel regression a kernel function \(k\) where \(k(x,\mathbf{x}_{1:N})\in\mathbb{R}^N\) returns the weights of all samples \(1:N\). Common kernel functions include the radial basis functions (RBF), the linear kernel, or the Matern kernel. Following classical kernel regression, we would obtain \(\hat{y}=\mathbf{y}_{1:N}^\top k(x,\mathbf{x}_{1:N})\). For distributional kernel regression, we propose a maximum likelihood approach for estimating \(\theta (x)\).

Formally, we obtain \(\theta (x)\) as solution to the following problem: $$ 0 = \dfrac{\partial}{\partial \theta}\left(\sum_{i=1}^N k(x,x_i)\log\hat{p}_\theta(y_i\mid x)\right). $$ For many distributions the above equation can be solved exactly. Below are a couple of examples.

Gaussian Observation Likelihood

The density of a normal distribution is: $$ \hat{p}_{\mu,\sigma}(y\mid x) = \dfrac{1}{\sigma (x)\sqrt{2\pi}}\mathrm{exp}\left( -\dfrac{1}{2}\left(\dfrac{y-\mu (x)}{\sigma (x)}\right)^2\right). $$

In log-space, this becomes: $$ \log \hat{p}_{\mu,\sigma}(y\mid x) = - \log\left(\sigma(x)\sqrt{2\pi}\right) - \dfrac{1}{2}\left(\dfrac{y-\mu(x)}{\sigma (x)}\right)^2. $$

This is sufficient information to solve the MLE problem: $$ \mu (x) = \dfrac{\sum_{i=1}^N k(x,x_i) y_i}{\sum_{i=1}^N k(x,x_i)}, $$ $$ \sigma (x) = \sqrt{\dfrac{\sum_{i=1}^N k(x,x_i)\left(\mu (x)^2 + y_i^2 -2\mu (x) y_i\right)}{\sum_{i=1}^N k(x,x_i)}}. $$ This represents a non-parametric approximation of the parameters of the normal distribution \(\mu\) and \(\sigma\). Similarly, we can express covariance matrix for the multivariate case: $$ \Sigma(x)=\dfrac{1}{\sum_{i=1}^N k(x,x_i)}\sum_{i=1}^N \left[ k(x,x_i) \left(y_i-\mu(x)\right)\left(y_i-\mu(x)\right)^\top\right]. $$

Poisson Observation Likelihood

The probability mass of the Poisson distribution is: $$ \hat{p}_\lambda (y\mid x) = \dfrac{\lambda (x)^{y} \mathrm{exp}\left(-\lambda (x)\right)}{y!}. $$

Solving the MLE problem for the Poisson likelihood gives an estimate for \(\lambda(x)\): $$ \lambda(x) = \dfrac{\sum_{i=1}^N k(x,x_i)y_i}{\sum_{i=1}^N k(x,x_i)}. $$

Bernoulli Observation Likelihood

The probability density of the Bernoulli distribution is: $$ \hat{p}_\rho(y\mid x) = \rho(x) y + (1-\rho(x)) (1-y). $$

Solving MLE problem with the Bernoulli likelihood gives us an estimate for \(\rho(x)\): $$ \rho(x) = \dfrac{\sum_{i=1}^N k(x,x_i)y_i}{\sum_{i=1}^N k(x,x_i)}. $$