Skip to contents

Beta Kernel Process

The Beta Kernel Process (BKP) provides a flexible and computationally efficient nonparametric framework for modeling spatially varying binomial probabilities.
Let x=(x1,x2,,xd)𝒳d\vec{x} = (x_1, x_2, \ldots, x_d) \in \mathcal{X} \subset \mathbb{R}^d denote a dd-dimensional input, and suppose the success probability surface π(x)[0,1]\pi(\vec{x}) \in [0,1] is unknown. At each location x\vec{x}, the observed data is modeled as

y(x)Binomial(m(x),π(x)), y(\vec{x}) \sim \mathrm{Binomial}(m(\vec{x}), \pi(\vec{x})),

where y(x)y(\vec{x}) is the number of successes out of m(x)m(\vec{x}) independent trials.
The full dataset comprises nn observations 𝒟n={(xi,yi,mi)}i=1n\mathcal{D}_n = \{(\vec{x}_i, y_i, m_i)\}_{i=1}^n, where we write yi=y(xi)y_i= y(\vec{x}_i) and mi=m(xi)m_i= m(\vec{x}_i) for brevity.


Prior

In line with the Bayesian paradigm, we assign a Beta prior to the unknown probability function:

π(x)Beta(α0(x),β0(x)), \pi(\vec{x}) \sim \mathrm{Beta}(\alpha_0(\vec{x}), \beta_0(\vec{x})),

where α0(x)>0\alpha_0(\vec{x}) > 0 and β0(x)>0\beta_0(\vec{x}) > 0 are spatially varying shape parameters.


Posterior

Let k:𝒳×𝒳[0,1]k: \mathcal{X} \times \mathcal{X} \to [0,1] denote a user-defined kernel function measuring the similarity between input locations.
By the kernel-based Bayesian updating strategy, the BKP model defines a closed-form posterior distribution for π(x)\pi(\vec{x}) as

π(x)𝒟nBeta(αn(x),βn(x)), \pi(\vec{x}) \mid \mathcal{D}_n \sim \mathrm{Beta}\left(\alpha_n(\vec{x}), \beta_n(\vec{x})\right),

where

αn(x)=α0(x)+i=1nk(x,xi)yi=α0(x)+k(x)y, \alpha_n(\vec{x}) = \alpha_0(\vec{x}) + \sum_{i=1}^{n} k(\vec{x}, \vec{x}_i) y_i = \alpha_0(\vec{x}) + \vec{k}(\vec{x})^\top \vec{y},

βn(x)=β0(x)+i=1nk(x,xi)(miyi)=β0(x)+k(x)(my), \beta_n(\vec{x}) = \beta_0(\vec{x}) + \sum_{i=1}^{n} k(\vec{x}, \vec{x}_i) (m_i - y_i) = \beta_0(\vec{x}) + \vec{k}(\vec{x})^\top (\vec{m} - \vec{y}),

and k(x)=[k(x,x1),,k(x,xn)]\vec{k}(\vec{x}) = [k(\vec{x}, \vec{x}_1), \ldots, k(\vec{x}, \vec{x}_n)]^\top is the vector of kernel weights.


Posterior summaries

Based on the posterior distribution above, the posterior mean is

π̂n(x)=αn(x)αn(x)+βn(x), \widehat{\pi}_n(\vec{x}) = \frac{\alpha_n(\vec{x})}{\alpha_n(\vec{x}) + \beta_n(\vec{x})},

which serves as a smooth estimator of the latent success probability.

The corresponding posterior variance is

sn2(x)=π̂n(x){1π̂n(x)}αn(x)+βn(x)+1, s^2_n(\vec{x}) = \frac{\widehat{\pi}_n(\vec{x})\{1 - \widehat{\pi}_n(\vec{x})\}} {\alpha_n(\vec{x}) + \beta_n(\vec{x}) + 1},

which provides a local measure of epistemic uncertainty.
These posterior summaries can be used to visualize prediction quality across the input space, particularly highlighting regions with sparse data coverage.


Binary classification

For binary classification, the posterior mean can be thresholded to produce hard predictions:

ŷ(x)={1if π̂n(x)>π0,0otherwise, \widehat{y}(\vec{x}) = \begin{cases} 1 & \text{if } \widehat{\pi}_n(\vec{x}) > \pi_0, \\ 0 & \text{otherwise}, \end{cases}

where π0(0,1)\pi_0 \in (0,1) is a user-specified threshold, typically set to 0.50.5.


Dirichlet Kernel Process

The Dirichlet Kernel Process (DKP) naturally extends the BKP framework to multi-class responses by replacing the binomial likelihood with a multinomial model and the Beta prior with a Dirichlet prior .

Let the response at input x𝒳d\vec{x} \in \mathcal{X} \subset \mathbb{R}^d be y(x)=(y1(x),,yq(x)), \vec{y}(\vec{x}) = \left(y_1(\vec{x}), \ldots, y_q(\vec{x})\right), where ys(x)y_s(\vec{x}) denotes the count of class ss out of m(x)=s=1qys(x)m(\vec{x}) = \sum_{s=1}^q y_s(\vec{x}) total trials. Assume y(x)Multinomial(m(x),π(x)), \vec{y}(\vec{x}) \sim \mathrm{Multinomial}(m(\vec{x}), \vec{\pi}(\vec{x})), with class probabilities π(x)=(π1(x),,πq(x)),s=1qπs(x)=1. \vec{\pi}(\vec{x}) = (\pi_1(\vec{x}), \ldots, \pi_q(\vec{x})), \quad \sum_{s=1}^q \pi_s(\vec{x}) = 1.


Prior

A Dirichlet prior is placed on π(x)\vec{\pi}(\vec{x}): π(x)Dirichlet(α0(x)), \vec{\pi}(\vec{x}) \sim \mathrm{Dirichlet}(\vec{\alpha}_0(\vec{x})), where α0(x)=(α0,1(x),,α0,q(x)) \vec{\alpha}_0(\vec{x}) = (\alpha_{0,1}(\vec{x}), \ldots, \alpha_{0,q}(\vec{x})) are prior concentration parameters.


Posterior

Given training data 𝒟n={(xi,yi)}i=1n\mathcal{D}_n = \{(\vec{x}_i, \vec{y}_i)\}_{i=1}^n, define the response matrix as Y=[y1,,yn]n×q. \vec{Y} = [\vec{y}_1, \ldots, \vec{y}_n]^\top \in \mathbb{R}^{n \times q}.

The kernel-smoothed conjugate posterior distribution becomes π(x)𝒟nDirichlet(αn(x)),withαn(x)=α0(x)+k(x)Y.\begin{equation}\label{eq:DKP_model} \vec{\pi}(\vec{x}) \mid \mathcal{D}_n \sim \mathrm{Dirichlet}\left(\vec{\alpha}_n(\vec{x})\right), \quad \text{with} \quad \vec{\alpha}_n(\vec{x}) = \vec{\alpha}_0(\vec{x}) + \vec{k}(\vec{x})^\top \vec{Y}. \end{equation}


Posterior mean

The posterior mean π̂n,s(x)=αn,s(x)s=1qαn,s(x),s=1,,q, \widehat{\pi}_{n,s}(\vec{x}) = \frac{\alpha_{n,s}(\vec{x})}{\sum_{s'=1}^q \alpha_{n,s'}(\vec{x})}, \quad s = 1, \ldots, q, provides a smooth estimate of the class probabilities.


Categorical classification

For classification tasks, labels are assigned by the maximum a posteriori (MAP) decision rule: ŷ(x)=argmaxs{1,,q}π̂n,s(x). \widehat{y}(\vec{x}) = \mathrm{argmax}_{s \in \{1,\ldots,q\}}\; \widehat{\pi}_{n,s}(\vec{x}).