KDE-diffusion

Kernel density estimation is a statistical method to infer the true probability density function that governs the distribution of a random variable from discrete observations of that same entity. The variable may have more than one component, i.e. be described by several coordinates.

An instructive, two-dimensional example is population density, which is derived from discrete locations, such as people’s home addresses or animals encountered at specific places in the wild. A technical use case is the determination of a spatially-resolved particle flux as measured by a detector array that is sensitive to rare, individual impacts.

Kernel density estimation basically works like this: Bin the discrete observations in a histogram. This is straightforward and takes little computation time. Then smooth the data over the bins/grid with an image filter that adds adequate blur. The shape of the filter function is referred to as the “kernel” and its spatial extent as the “bandwidth”. The trick is to find the optimal filter size, one that does not smear out the data too much, but also averages over the artifacts that are due to the discrete nature of the input.

This library provides the adaptive kernel density estimator based on linear diffusion processes for one-dimensional and two-dimensional input data as outlined in the 2010 paper by Botev et al. The reference implementation for 1d and 2d, in Matlab, was provided by the paper’s first author, Zdravko Botev. This is a re-implementation in Python, with added test coverage.

The diffusion-inspired method is particularly fast. Orders of magnitude faster, for instance, than SciPy’s Gaussian kernel estimator. Or those provided by Scikit-Learn. And most of KDEpy’s — except for FFTKDE, which uses a very similar algorithm, but has no automatic bandwidth selection in dimensions higher than one.

Automatic bandwidth selection is however key. Otherwise one may as well just apply a Gaussian filter and manually tune its size, i.e. the bandwidth, until the results look pleasing to the human eye. The bandwidth selection is what makes kernel density estimation a non-parametric method, so that we avoid making — possibly misguided — assumptions about the nature of the data.