# KDE-diffusion¶

Kernel density estimation is a statistical method to infer the true probability density function that governs the distribution of a random variable from discrete observations of that same entity. The variable may have more than one component, i.e. be described by several coordinates.

An instructive, two-dimensional example is population density, which is derived from discrete locations, such as people’s home addresses or animals encountered at specific places in the wild. A technical use case is the determination of a spatially-resolved particle flux as measured by a detector array that is sensitive to rare, individual impacts.

Kernel density estimation basically works like this: Bin the discrete observations in a histogram. This is straightforward and takes little computation time. Then smooth the data over the bins/grid with an image filter that adds adequate blur. The shape of the filter function is referred to as the “kernel” and its spatial extent as the “bandwidth”. The trick is to find the optimal filter size, one that does not smear out the data too much, but also averages over the artifacts that are due to the discrete nature of the input.

This library provides the adaptive kernel density estimator based on linear diffusion processes for one-dimensional and two-dimensional input data as outlined in the 2010 paper by Botev et al. The reference implementation for 1d and 2d, in Matlab, was provided by the paper’s first author, Zdravko Botev. This is a re-implementation in Python, with added test coverage.

The diffusion-inspired method is particularly fast. Orders of magnitude faster, for instance, than SciPy’s Gaussian kernel estimator. Or those provided by Scikit-Learn. And most of KDEpy’s — except for FFTKDE, which uses a very similar algorithm, but has no automatic bandwidth selection in dimensions higher than one.

Automatic bandwidth selection is however key. Otherwise one may as well just apply a Gaussian filter and manually tune its size, i.e. the bandwidth, until the results look pleasing to the human eye. The bandwidth selection is what makes kernel density estimation a non-parametric method, so that we avoid making — possibly misguided — assumptions about the nature of the data.

## Installation¶

The library is available on PyPI and can be readily installed via pip:

pip install KDE-diffusion


Run pip uninstall KDE-diffusion in order to remove the package from your system.

Requires NumPy and SciPy, which pip will install automatically in case they are missing.

## Usage¶

Code example for one-dimensional input data:

# Sample data points from normal distribution.
from numpy.random import normal
x = normal(size=1000)

# Estimate density within ±5 standard deviations.
from kde_diffusion import kde1d
(density, grid, bandwidth) = kde1d(x, n=256, limits=5)

# Calculate actual density on same grid.
from scipy.stats import norm
actual = norm.pdf(grid)

# Plot estimated and actual density.
from matplotlib import pyplot
figure = pyplot.figure()
axes = figure.add_subplot()
axes.plot(grid, density, label='estimated', linewidth=5, color='goldenrod')
axes.plot(grid, actual,  label='actual', linestyle='--', color='black')
axes.legend()
pyplot.show()


Code example for two-dimensional input data:

# Sample data points from normal distribution.
from numpy.random import normal
x = normal(size=1000)
y = normal(size=1000)

# Estimate density within ±5 standard deviations.
from kde_diffusion import kde2d
(density, grid, bandwidth) = kde2d(x, y, n=256, limits=5)

# Display estimated density as image.
from matplotlib import pyplot
figure = pyplot.figure()
axes = figure.add_subplot()
axes.imshow(density.T)
pyplot.show()


Note that the density is returned in matrix index order, also known as Cartesian indexing, i.e. with the first index referring to the x-axis and the second to the y-axis. This is the common convention for 2d histograms and kernel density estimations, or science in general. Images, however, are universally indexed the other way around: y before x. This is why the density in the example is transposed before being displayed.

## Implementation¶

This Python library was developed based on the existing Matlab implementation for 1d and 2d, which was used as the primary reference (albeit possibly in earlier versions previously stored at the same locations), and the original paper as a secondary source, mostly to understand the nomenclature and general idea behind the method.

The Matlab implementation was first rewritten in Python to give the exact same result, including intermediate steps, for the same test case: a univariate normal sampling. See the verification folder in the source-code repository for details.

Subsequently, the Python code was refactored in order to blend in better with the existing software ecosystem, namely by leveraging SciPy’s forward and backward discrete cosine transformation for one or n dimensions, dct/dctn and idct/idctn, as well as NumPy’s histogram and histogram2d, instead of the custom versions the Matlab reference employs.

The reference uses a cosine transformation with a weight for the very first component that is different from the one in any of the four types of the transformation supported by SciPy. There is an easy work-around for that, which is used in the current code. It should however be possible to rewrite the algorithm in a more elegant way, one that avoids the work-around altogether.

The Matlab implementation also bins the data somewhat differently in 1d vs. the 2d case. This minor inconsistency was removed. The change is arguably insignificant as far as the final results are concerned, but is a deviation nonetheless. In practical use, based on a handful of tests, both implementations yield indiscernible results.

The 2d density is returned in matrix index order, also known as Cartesian indexing, in which the first index (the matrix row) refers to the x-coordinate and the second index (the matrix column) to y. This is consistent with results returned by other kernel density estimations, such as SciPy’s, as well as NumPy’s 2d-histogram function. When saving or displaying the 2d density as an image, a different memory layout is expected and the index order has to be reversed: y before x. This comes down to a simple transposition, i.e. adding .T in the code.

In very broad strokes, the method is this:

• Bin the data on a regular grid.

• Transform to Fourier space.

• This leaves Gaussian kernels intact.

• Gaussians are also elementary solutions to the diffusion equation.

• Leverage this to define condition for optimal smoothing.

• Find optimum by iteration in Fourier space.

• Smooth transformed data with optimized Gaussian kernel.

• Reverse transformation to obtain density estimation.

## API¶

Code documentation of the public application programming interface provided by this library.

kde1d(x, n=1024, limits=None)[source]

Estimates the 1d density from discrete observations.

The input is a list/array x of numbers that represent discrete observations of a random variable. They are binned on a grid of n points within the data limits, if specified, or within the limits given by the values’ range. n will be coerced to the next highest power of two if it isn’t one to begin with.

The limits may be given as a tuple (xmin, xmax) or a single number denoting the upper bound of a range centered at zero. If any of those values are None, they will be inferred from the data.

After binning, the function determines the optimal bandwidth according to the diffusion-based method. It then smooths the binned data over the grid using a Gaussian kernel with a standard deviation corresponding to that bandwidth.

Returns the estimated density and the grid upon which it was computed, as well as the optimal bandwidth value the algorithm determined. Raises ValueError if the algorithm did not converge.

kde2d(x, y, n=256, limits=None)[source]

Estimates the 2d density from discrete observations.

The input is two lists/arrays x and y of numbers that represent discrete observations of a random variable with two coordinate components. The observations are binned on a grid of n×n points. n will be coerced to the next highest power of two if it isn’t one to begin with.

Data limits may be specified as a tuple of tuples denoting ((xmin, xmax), (ymin, ymax)). If any of the values are None, they will be inferred from the data. Each tuple, or even both of them, may also be replaced by a single value denoting the upper bound of a range centered at zero.

After binning, the function determines the optimal bandwidth according to the diffusion-based method. It then smooths the binned data over the grid using a Gaussian kernel with a standard deviation corresponding to that bandwidth.

Returns the estimated density and the grid (along each of the two axes) upon which it was computed, as well as the optimal bandwidth values (per axis) that the algorithm determined. Raises ValueError if the algorithm did not converge or x and y are not the same length.