Scalable Kernel Learning and
Gaussian Processes

Kernel methods, such as Gaussian processes, have had an exceptionally consequential impact on machine learning theory and practice. However, these methods face two fundamental open questions:

for n training points. This scalability typically limits GP applicability to at most a few thousand datapoints. Moreover, standard approaches to GP scalability typically only work for up to about fifty thousand training points, and sacrifice too much model structure for kernel learning.

We view kernel learning and scalability as two sides of one coin. We most want flexible models on large datasets, which often contain the necessary information to learn rich statistical representations through kernel learning -- for great predictive performance, and new scientific insights into our modelling problems.

There has been exciting recent work targeted at addressing these limitations.

On this page, we have code, tutorials, demos, and resources for our work on scalable kernel learning. We introduce approaches for scalable inference and automatically learning kernel functions. Several of these approaches have the inductive biases of deep learning architectures, while retaining a probabilistic non-parametric representation, and allowing for O(n) training and O(1) testing (compared

to O(n^3) and O(n^2) for standard GPs). Our approaches to scalability are largely grounded in exploiting algebraic structure in kernel

matrices in combination with local kernel interpolation, for efficient and highly accurate numerical linear algebra operations.

Moreover, we view kernel methods and deep learning as

Many of these themes are described in my talk "Scalable Gaussian Processes for Scientific Discovery".

We always welcome constructive feedback. Please feel free to reach out with bug reports or suggestions. To understand this material

you must have a basic understanding of Gaussian processes as described in the first five chapters of GPML (particularly chapters 2, 4, 5), or in the first two chapters of my PhD thesis.

All of the code here builds upon the GPML library. Before using this code, I suggest following some of the tutorials on the GPML website.

To jump navigate to the various sections:

[Spectral Mixture Kernels | Kronecker Inference | Kronecker non-Gaussian Likelihoods | KISS-GP | Deep Kernel Learning][Stochastic Variational Deep Kernel Learning | GP-LSTM | Datasets]

Andrew Gordon Wilson

November 2, 2016

The spectral mixture (SM) kernel is an expressive basis for all stationary covariance functions. It is formed by modelling the spectral

density (Fourier transform) of a kernel as a scale-location Gaussian mixture. One can analytically take the inverse Fourier transform to recover the closed form expression for this kernel. Learning the kernel amounts to learning the hyperparameters of this kernel function, using the same techniques as one would learn the hyperparameters of an RBF kernel, but with special care for how these parameters are initialized. This kernel has been used to discover a wide range of flexible statistical representations, enabling long range crime prediction, time series, image and video extrapolation, as well as new interpretable scientific insights into our modelling problems.

Key Reference:

[arXiv, PDF, Supplementary, Slides, Video Lecture, Typo correction, BibTeX]

Typo correction regarding the spectral mixture kernel for multidimensional inputs

Our other papers discussing spectral mixture kernels:

PhD Thesis, January 2014

[PDF, BibTeX]

[PDF, BibTeX, Code, Slides]

To appear at the

[PDF, Supplement, arXiv, BibTeX, Theme Song]

Fast kronecker inference in Gaussian processes with non-Gaussian likelihoods

[PDF, Supplement, Code]

[PDF, BibTeX]

Andrew Gordon Wilson

Technical Report, University of Cambridge. June 2012.

[PDF]

William Herlands, Andrew Gordon Wilson, Seth Flaxman, Daniel Neill, Wilbert van Panhuis, and Eric P. Xing

[PDF, BibTeX]

[PDF, arXiv, BibTeX]

[PDF, arXiv, Supplement, BibTeX]

Code:

- GPML v3.6 [older version]

- covSM.m

- covSMfast.m [only works if the inputs are 1D]

- initSMhypers.m

- initSMhypersadvanced.m

- smspect.m

- empspect.m

- plotspectral.m

- testSMairline.m

- testSMCO2.m

- testSMCO2_adv.m

- testSMaudio_adv.m

- airline dataset

- CO2 dataset

- audio dataset

- everything for this demo (except GPML) in one zip

The spectral mixture kernel has a simple form:

Here

In some papers (and code) the

One can learn the hyperparameters by maximizing the log marginal likelihood objective (with training data

Or one can intergrate away these hyperparameters using sampling techniques.

Step 1: Download the above files.

Step 2: Run testSMairline.m to produce the following time series extrapolation results

(I have included an example run result in

Step 3: Try the same experiment with different random seeds. Do you find consistent results?

Replace 'covSM.m' with 'covSMfast.m'. How does it affect the runtime of testSMairline?

[The newer versions of covSM.m are fast and you may not notice a difference].

Step 4: The spectral mixture kernel is flexible, capable of discovering many powerful representations, but the learning surface for a spectral mixture kernel is often highly multimodal, making initialization very important for good predictive success. Thankfully there are several initialization routines for the spectral mixture kernel which are reasonably simple and work on a variety of datasets with different properties.

There are four key hyperparameters: the

The variances can be thought of as inverse length-scales, and means as frequencies, and the weights as signal variances.

Step 5: Set the number of components Q = 1. Hard-code the variance to zero, the frequency to 2, the weight to 1, and plot a draw from a GP with this spectral mixture kernel. Notice that it's a strictly periodic function with a period of 0.5. Now take a draw with the variance set to 1. The resulting draws should become quasiperiodic.

Step 6: We can gain more insights into the parameters of the spectral mixture kernel by looking at its associated spectral density. Plot the (log) spectral densities, with help from the

Run

Notice that the empirical spectrum looks like a very noisy version of the learned spectral mixture spectral density, with major frequency peaks in the same places [we will use this later for initialization].

Step 7: The SM kernel can match a wide range of different covariance functions corresponding to different settings of its hyperparameters. However, learning a good set of hyperparameters can be difficult, because the learning objective is typically quite multimodal in the frequency parameters. On the airline dataset, try looking at the log marginal likelihood as a function of the frequency parameters, fixing the variances and weights.

Step 8: In step 2 we used the script initSMhypers.m to initialize the spectral mixture kernel. It's a relatively simple procedure, where we initialize the

We can then evaluate the marginal likelihood a number of times (e.g. 100) sampling from this initialization script each time, and start optimizing the marginal likelihood with the initialization that has the highest marginal likelihood [this is what we do in the next section on image inpainting]. Alternatively, we can try a smaller number (e.g. 10) of short optimization runs, and then a longer run, starting with the initialization that had the highest marginal likelihood after the short runs. Try both approaches and see which you prefer! In particular, in testSMairline.m, try playing with the trade-off between the parameters

Step 9: In Step 8, we considered a simple initialization strategy. It won't always work nicely, particularly if you can't find a good setting for Fs. Take a look at initSMhypersadvanced.m. Here we take a look at the empirical spectral density (see

Notice that all of the major peaks in the empirical spectrum appear to be in the right places. We can use this property to find a good initialization for the frequency parameters.

Step 10: The empirical spectral density is often too noisy to use to directly create a good kernel function. However, it often has its sharp peaks near the right frequencies. We can treat the empirical spectral density as observed, and then directly fit a mixture of Gaussians to this empirical spectral density, using a standard Gaussian mixture density estimator (by taking samples from the empirical inverse CDF of the spectral density). We can then use the fitted parameters as an

Step 11: In addition to fitting the empirical spectral density, script

Step 12: Try the included audio dataset, and look at an initialization that works well for these data. In general, try writing your own initialization scripts. There are many ways to initialize the spectral mixture kernel, and I haven't found a magic bullet (but you might!). There are some more advanced ideas in our work on change surfaces. I have found initialization gets easier with more data. If you are interested in using the SM kernel, I highly recommend following the next tutorial, which considers new initialization scripts for more data and multidimensional inputs.

When inputs are on a multidimensional grid (e.g., a Cartesian product across grid dimensions), and the kernel decomposes as a product across grid dimensions, then the data covariance matrix decomposes as a Kronecker product of kernels operating on each individual grid dimension.

Consider, for example, modelling a 1000 x 1000 pixel image, using a GP with an RBF kernel. The inputs are pixel locations (in R^2), and the outputs are pixel intensities. There are therefore 1 million datapoints, and the corresponding 10^6 x 10^6 covariance matrix could not normally even be represented on a computer. However, in this case, the inputs are on a fully connected two dimensional grid: every horizontal pixel coordinate has an associated vertical pixel coordinate. Moreover, the RBF (or, more generally, ARD) kernel decomposes as a product of RBF kernels acting on each input dimension. This decomposition corresponds to a soft a priori assumption of independence across input dimensions, but it does not rule out posterior correlations, and in fact, provides a useful inductive bias.

Therefore the 10^6 x 10^6 covariance matrix will decompose into a a Kronecker product of two 10^3 x 10^3 covariance matrices, enabling

In this section, we will demo a method which allows Kronecker inference even in the case of partial grid structure: suppose we have missing pixels in an image, or water covering up part of a map in a spatiotemporal statistics problem. Moreover, we will combine this scalable and exact inference with spectral mixture kernels, with demos on image inpainting problems.

The main reference for this section is:

[PDF, BibTeX, Code, Slides]

Other references which use this methodology include:

PhD Thesis, January 2014

[PDF, BibTeX]

[PDF, Supplement, Code]

William Herlands, Andrew Gordon Wilson, Seth Flaxman, Daniel Neill, Wilbert van Panhuis, and Eric P. Xing

[PDF, BibTeX]

Video Resource:

Scalable Gaussian Processes for Scientific Discovery (EPFL, Februrary 2016)

Audio Resource:

Abstract, Slides, and Audio, for my November 2014 Oxford lecture "Kernel Methods for Large Scale Representation Learning"

1) Download GPML v3.5 and add the files to the Matlab (or Octave path).

2) For the tutorials, the only required external files are 'vincent.m', 'spectral_init.m', 'test_covGrid.m', 'test_infGrid.m', and the datasets. 'covSMfast.m' is an optional faster implementation of covSM. All tutorial files are available in this zip file. Unzip these tutorial files into one directory.

covGrid is used to generate a covariance matrix with Kronecker product structure, which can be exploited by infGrid. This test shows the generated covariance matrix is identical to what would be obtained using standard methods.

infGrid exploits Kronecker structure for scalable inference and learning. In the case where the inputs are on a grid, with no missing inputs, the inference and learning is also exact. In this demo, the data only has

Exercise 2: Try changing

Exercise 3: Cycle between the three test cases, cs = 1, 2, 3.

Some example output:

Note the learned spectral mixture kernels. It would be difficult to know to a priori hard code kernels like these. The learned structure is interpretable and enables good predictions and extrapolation.

Exercise 2: Study the initialisation script spectral_init.m. This is a basic and robust initialisation for spectral mixture kernels. We sample initial parameters from several basic distributions multiple times, and then start the optimization run from the initialization with the highest marginal likelihood. One of the critical parameters here is Fs, which is used to bracket the uniform distribution on frequencies used to sample initial frequency parameters. This parameter must be adjusted intelligently to the problem at hand, as described in the previous tutorial. Also note that we use a product of 1D spectral mixture kernels (covSMfast.m), and initialize parameters separately from each input dimension. Can you think of other initialisations which might offer even better performance? For an in depth initialisation procedure, see

When we use a non-Gaussian observation model (for example, a Student-t noise model, or Poisson process with a GP intensity, or classification), then GP inference becomes approximate. Popular approaches to approximate GP inference include MCMC (such as Elliptical Slice Sampling), and deterministic approximations (Laplace, EP, Variational Methods). In these settings, we are no longer simply taking the Kronecker decompositionof (K+ \sigma^2 I), but rather typically encounter (K+A) where A is a full diagonal matrix, which cannot undergo a straightforward eigendecomposition via Kronecker algebra.

We show how to approach these linear systems for scalable inference and learning by exploiting the Kronecker structure in K:

Seth Flaxman, Andrew Gordon Wilson, Daniel Neill, Hannes Nickisch, and Alexander J. Smola

To appear at the

[PDF, Supplement, Video Lecture, BibTeX]

This paper also uses spectral mixture kernels, to make long range crime rate forecasts, as part of a Poisson process (and negative Binomial process) observation model, where the GP is modelling the rate function. The spectral mixture kernels are able to automatically discover interpretable structure in the data, which helps inform new policy decisions, as well as make crime predictions on unprecedented time scales.

Here we include a simple demo on Hickory data.

Files

Download infGrid_Laplace.m

Contents

Hickory data

Author: Seth Flaxman 4 March 2015

clear all, close all % This comes from a dataset giving the locations of trees by type in a forest in % Michigan [D. J. Gerrard 1969]. I obtained it from the R package spatstat % as follows (run this code in R: % library(spatstat) % hicks <- split(lansing)$hickory % write.table(data.frame(hicks),"hickories.csv", row.names=F,col.names=F,sep=",") xy = csvread('hickories.csv'); plot(xy(:,1),xy(:,2),'.') % create the "computational grid" n1 = 32; n2 = 32; x1 = linspace(0,1.0001,n1+1); x2 = linspace(-0.0001,1,n2+1); xg = {(x1(1:n1)+x1(2:n1+1))'/2, (x2(1:n2)+x2(2:n2+1))'/2}; i = ceil((xy(:,1)-x1(1))/(x1(2)-x1(1))); j = ceil((xy(:,2)-x2(1))/(x2(2)-x2(1))); counts = full(sparse(i,j,1)); % setup the GP cv = {@covProd,{{@covMask,{1,@covSEiso}},{@covMask,{2,@covSEisoU}}}}; cvGrid = {@covGrid, { @covSEiso, @covSEisoU}, xg}; hyp0.cov = log([.1 1 .1]); y = counts(:); X = covGrid('expand', cvGrid{3}); Idx = (1:length(y))'; lik = {@likPoisson, 'exp'}; hyp0.mean = .5;

Laplace approximation without grid structure

tic hyp1 = minimize(hyp0, @gp, -100, @infLaplace, @meanConst, cv, lik, X, y); toc exp(hyp1.cov)

Function evaluation 0; Value 1.114380e+03

Function evaluation 5; Value 1.104834e+03

Function evaluation 8; Value 1.098415e+03

Function evaluation 12; Value 1.097589e+03

Function evaluation 16; Value 1.097445e+03

Function evaluation 18; Value 1.097417e+03

Function evaluation 23; Value 1.096718e+03

Function evaluation 26; Value 1.096703e+03

Function evaluation 30; Value 1.096424e+03

Function evaluation 33; Value 1.096419e+03

Function evaluation 35; Value 1.096419e+03

Function evaluation 37; Value 1.096419e+03

Function evaluation 46; Value 1.096419e+03 Elapsed time is 300.589011 seconds. ans = 0.0815 0.6522 0.0878

Laplace approximation with grid structure and Kronecker inference

tic hyp2 = minimize(hyp0, @gp, -100, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y); toc exp(hyp2.cov)

Function evaluation 0; Value 1.151583e+03

Function evaluation 12; Value 1.124633e+03

Function evaluation 14; Value 1.116309e+03

Function evaluation 17; Value 1.115676e+03

Function evaluation 20; Value 1.115373e+03

Function evaluation 23; Value 1.115308e+03 Elapsed time is 41.778548 seconds. ans = 0.2346 0.5824 0.1100

Model comparison

The models learned slightly different hyperparameters. Are the likelihoods different?

fprintf('Log-likelihood learned w/out Kronecker: %.04f\n', gp(hyp1, @infLaplace, @meanConst, cv, lik, X, y)); fprintf('Log-likelihood learned with Kronecker: %.04f\n', gp(hyp2, @infLaplace, @meanConst, cv, lik, X, y)); % and what about the likelihood we actually calculated in infGrid_Laplace, % using the Fiedler bound? fprintf('Log-likelihood with Kronecker + Fiedler bound: %.04f\n', gp(hyp2, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y)); % let's compare our posterior predictions--to save time, we'll use infGrid_Laplace to % make predictions in both cases (this won't hurt the accuracy, as we explain in the paper.) [Ef1 Varf1 fmu1 fs1 ll1] = gp(hyp1, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y, Idx, y); [Ef2 Varf2 fmu2 fs2 ll2] = gp(hyp2, @infGrid_Laplace, @meanConst, cvGrid, lik, Idx, y, Idx, y); figure(1); imagesc([0 1], [0 1], (reshape(Ef1,32,32)')); set(gca,'YDir','normal'); colorbar; figure(2); imagesc([0 1], [0 1], (reshape(Ef2,32,32)')); set(gca,'YDir','normal'); colorbar;

Log-likelihood learned w/out Kronecker: 1096.4188 Log-likelihood learned with Kronecker: 1100.3819 Log-likelihood with Kronecker + Fiedler bound: 1115.3084

Kronecker + grid enables a finer resolution

now let's exploit the grid structure to get a finer resolution map we'll also switch covariances, just for fun.

hyp.cov = log([.3 1 .2 1]); lik = {@likPoisson, 'exp'}; hyp.lik = []; mean = {@meanConst}; hyp.mean = .5; n1 = 64; n2 = 64; x1 = linspace(0,1.0001,n1+1); x2 = linspace(-0.0001,1,n2+1); xg = {(x1(1:n1)+x1(2:n1+1))'/2, (x2(1:n2)+x2(2:n2+1))'/2}; % binning y = histcn(xy,x1,x2); i = ceil((xy(:,1)-x1(1))/(x1(2)-x1(1))); j = ceil((xy(:,2)-x2(1))/(x2(2)-x2(1))); counts = full(sparse(i,j,1)); y = counts(:); xi = (1:numel(y))'; cov = {@covGrid, { {@covMaterniso,5}, {@covMaterniso,5}}, xg}; opt.cg_maxit = 400; opt.cg_tol = 1e-2; inf_method = @(varargin) infGrid_Laplace(varargin{:},opt); tic hyp2 = minimize(hyp, @gp, -100, inf_method, mean, cov, lik, xi, y(:)); toc exp(hyp2.cov) % let's compare our posterior predictions % since we don't care about variance, we can save memory by doing this: [post nlZ dnlZ] = infGrid_Laplace(hyp2, mean, cov, lik, xi, y(:)); post.L = @(x) 0*x; ymu = gp(hyp2, @infGrid_Laplace, mean, cov, lik, xi, post, xi); figure(3), imagesc([0 1], [0 1], reshape(ymu,size(counts))'), hold on; set(gca,'YDir','normal'); plot(xy(:,1),xy(:,2),'w.'); colorbar hold off

Function evaluation 0; Value 1.954834e+03

Function evaluation 6; Value 1.946841e+03

Function evaluation 8; Value 1.940877e+03

Function evaluation 11; Value 1.940012e+03

Function evaluation 14; Value 1.938218e+03

Function evaluation 17; Value 1.936884e+03

Function evaluation 20; Value 1.934900e+03

Function evaluation 22; Value 1.933195e+03 Elapsed time is 187.843026 seconds. ans = 0.3051 0.7512 0.1636 0.7512

Published with MATLAB® R2013a

Structure exploiting (such a Kronecker and Toeplitz methods) approaches are extremely scalable and

Conversely, inducing point methods, are popular for their general purpose "out of the box" applicability, without requiring any special structure in the data. However, these methods are limited by requiring a small (

With KISS-GP, we combine the scalability and accuracy of structure exploiting methods with the general applicability of inducing point approaches. These approaches are O(n + g(m)) for inference and learning, and O(1) per test prediction, where m is the number of inducing points, and g(m) ~ m. The ability to use a large number of inducing points

The basic idea behind KISS-GP is to create a multidimensional grid of latent inducing points, and exploit the resulting Kronecker, Toeplitz, and Circulant structure in the resulting covariance matrix evaluated over all pairs of these inducing points. We then perform local kernel interpolation to go back into the observed input space, which may have arbitrary structure. We can view this approach as creating an approximate kernel given user specified kernel which admits fast numerical linear algebra operations, for inference and learning with Gaussian processes. The quality of this approximation is often close to exact within numerical precision, given the ability of KISS-GP to use many inducing points.

In short, given a user-specified kernel

The full details are described in these

[PDF, Supplement, arXiv, BibTeX, Theme Song, Video Lecture]

arXiv pre-print, 2015

(See KISS-GP and Deep Kernel Learning for more empirical demonstrations).

[arXiv, PDF, BibTeX, Music]

Video Lecture: KISS-GP Introduction at ICML 2015

Tutorial:

Step 1: Download kissgpdemo.zip

Step 2: Add all folders to your path.

The key files are

Read the documentation in these files closely: it will tell you how to

switch on/off Toeplitz structure, Circulant structure, BTTB structure, Kronecker structure, etc.,

and how to switch interpolation strategies.

Step 3: Run all the files in /test

Step 4: Run the the files in /examples. In

n

In constructing the grid, pay close attention to whether single braces { or double braces {{ are used.

Double braces enables the BTTB functionality.

Step 5: Run the supervised projection example in

This helper function is extremely valuable for creating grids of inducing points based on standard input data

For any problem with greater than 1D inputs, I would highly recommend using covGrid('create'), and getting practice with its use.

One of the defining features of KISS-GP is performing local kernel interpolation. You can see the routines for local kernel interpolation, including multidimensional interpolation, in

For higher dimensional inputs, and KISS-GP with kernel learning, see Deep Kernel Learning. This is the code I would recommend using for good performance on multidimensional input problems, with a reasonable number

KISS-GP allows us to naturally create and exploit algebraic structure for highly scalable inference and learning, without requiring input data on a grid. However, to exploit Kronecker structure, the inducing points in KISS-GP must live in a low dimensional space (e.g. D<5). The paper "Thoughts on massively scalable Gaussian processes" discusses how to learn supervised projections of the data inputs into this low dimensional, focusing on linear projections.

We can also use very flexible non-linear projections of the kernel inputs, such as deep feedforward or convolutional networks. Note that in regression, for example, deep feed-forward neural networks will nearly always project into a one dimensional output space.

By learning a deep transformation of the kernel inputs, we are both learning a very flexible (and also interpretable)

We test deep kernel learning on a wide range of regression problems (the entire UCI repository), with a variety of different input dimensionalities, data sizes, and properties, showing the

The basic idea is to start with a base kernel, with parameters \theta, using a non-linear function

We take

We then replace the exact deep kernel with the approximate KISS-GP approximation for this kernel. We then learn both the parameters

We often find using spectral mixture base kernels provides a good boost in performance.

[PDF, arXiv, BibTeX]

arXiv pre-print, 2015

(See KISS-GP and Deep Kernel Learning for more empirical demonstrations).

[arXiv, PDF, BibTeX, Music]

Here we include the code to implement Deep Kernel Learning, and to reproduce many of the results in the paper. It requires GPML 3.6 and Caffe as dependencies. The GPML version used is included in the zip.

We
provide 3 (`examples/`

) :

- UCI Kin40K regression (section 5.1 of the ppaer)
- MNIST digit magnitude extraction (section 5.3)
- Olivetti face orientation extraction (section 5.2)

The
modified caffe library is under `caffe/`

. Please refer
to the installation instructions for
prerequisites and configurations. Run
the following commands for
installation:

make make matcaffe

We take the olivetti face task for example.

cd caffe unzip examples/face/data/face_images.zip sh examples/face/create_face.sh # create databases for caffe CNN training and DKL training

Databases
will be created under `caffe/examples/face/data`

Under `caffe/`

:

sh examples/face/train_face.sh

Please refer to the tutorial for CNN configurations, etc.

After
training you will see the trained CNN
model `caffe/examples/face/face_iter_6000.caffemodel`

.
We have also provided a pre-trained
model `caffe/examples/face/face_iter_6000.caffemodel.example`

,
so that you can skip this step and go
straight to DKL model training.

The
DKL codes are under `examples/face`

, where `dkl_rbf_face.m`

uses
the RBF kernel as the base kernel, and `dkl_sm_face.m`

uses
the spectral mixture kernel as the
base kernel. Training parameters are
configured at the beginning of the
respective files, including the
pre-trained CNN model for
initialization, number of training
iterations, learning rate, and grid
size, etc.

Under `examples/face`

, run
the models using matlab

run ../../gpml/startup.m % setup gpml run dkl_rbf_face.m % or dkl_sm_face.m

- For preparing the MNIST dataset, run

```
cd caffe
sh examples/mnist/data/get_mnist.sh
sh examples/mnist/create_mnist.sh
```

- For
preparing the Kin40k dataset, simply
run
`sh examples/kin40k/create_kin40k.sh`

under`caffe/`

- Use GPU for the DNN part to speedup training

We may want to use Gaussian processes and KISS-GP with stochastic gradient descent and mini-batches, for further scalability. However, the Gaussian process marginal likelihood does not factorize. Here we introduce

To appear in

[arXiv, PDF]

Here we provide code to reproduce results in the paper. We focus on classification problems.

The
GP-LSTM

Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic procedure and exploit the structure of these kernels for fast and scalable training and prediction. We demonstrate state-of-the-art performance on several benchmarks, and investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP-LSTM are uniquely valuable.

For scalability, we introduce a new provably convergent semi-stochastic training procedure for kernel learning via GP marginal likelihoods, and combine this procedure with KISS-GP.

** **

**Learning Scalable Deep Kernels with Recurrent
Structure**

**
**Maruan Al-Shedivat, Andrew Gordon Wilson,
Yunus Saatchi, Zhiting Hu, Eric P. Xing

arXiv pre-print

Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic procedure and exploit the structure of these kernels for fast and scalable training and prediction. We demonstrate state-of-the-art performance on several benchmarks, and investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP-LSTM are uniquely valuable.

For scalability, we introduce a new provably convergent semi-stochastic training procedure for kernel learning via GP marginal likelihoods, and combine this procedure with KISS-GP.

arXiv pre-print

The codebase contains tutorial examples. It also provides an alternative implementation of Deep Kernel Learning using Keras with Theano or Tensorflow, and shows how to call KISS-GP functionality from Python. However, it is still in alpha testing, and may not be as stable as the Caffe based Deep Kernel Learning code.

In several papers, we have compared against all UCI regression datasets:

À la carte - learning fast kernels

[PDF, BibTeX]

[PDF, arXiv, BibTeX]

We provide these data here (kindly compiled by Zichao Yang) with the normalization and pre-processing that we used, so that you have a way to test whether this code is working for you: can you reproduce the results in Table 1 of Deep Kernel Learning?

These data also provide an easy way to benchmark your own methods on all of these problems. If you use the data for this purpose, please provide a reference to this webpage. Since the code for many of the methods we used are available (e.g. DKL), please verify some of these results using the code here before taking the results in the table for granted in your own comparison. You may, for instance, have forgotten to run the pre-processing or normalization routines, or the pre-processing may have changed between posting these data and the writing of those papers. You may also not be applying the code on this page correctly on other problems, so testing on these data provides a sanity check.

Many thanks to my wonderful collaborators. Particular thanks to Hannes Nickisch, Zhiting Hu, and Maruan Al-Shedivat, for their contributions to the implementations of KISS-GP, Deep Kernel Learning, and the GP-LSTM, respectively.