Efficient Hadamard multitask model #1910

AdrianSosic · 2022-02-09T09:47:03Z

AdrianSosic
Feb 9, 2022

Hi all,

I am currently trying to set up a multitask model and I'm wondering how to execute it efficiently. The code is based on this gpytorch example, where the covariance matrix is constructed using an IndexKernel as follows:

def forward(self,x,i):     
    mean_x = self.mean_module(x)

    # Get input-input covariance
    covar_x = self.covar_module(x)
    # Get task-task covariance
    covar_i = self.task_covar_module(i)
    # Multiply the two together to get the covariance we want
    covar = covar_x.mul(covar_i)

    return gpytorch.distributions.MultivariateNormal(mean_x, covar)

I am puzzled about this line,

covar = covar_x.mul(covar_i)

which in contrast to covar_x and covar_i returns a NonLazyTensor. Now the issue is that I'm using the model in combination with botorch's acquisition functions, which, to evaluate a set of candidate points, consider these points as independent "t-batches" (see here). This means that, in order to evaluate N candidates, there will be N covariance matrices constructed, each covering the training data + 1 of the candidate points, even though only the N marginal distributions of the candidate points will be needed in the end. My expectation was that this would be efficiently handled through lazy tensor evaluation, which however does not seem to be the case and destroys the computation performance. I've also tried to explicitly use a lazy multiplication like

covar = gpytorch.lazy.MulLazyTensor(covar_x, covar_i)

but this apparently does not solve the problem as the computational burden gets shifted to the root_decomposition()calls in
MulLazyTensor, which I haven't really grasped yet.

I'm currently stuck and help would be much appreciated =)

wjmaddox · 2022-02-09T16:30:05Z

wjmaddox
Feb 9, 2022
Collaborator

Have you tried using botorch's own Multitask GP here? https://botorch.org/api/models.html#botorch.models.multitask.MultiTaskGP

In general, I believe that you can exploit botorch's acquisition optimization to fix dimensions such as the task dimension in what you're trying to setup.

I also think that what should be happening internally in the standard MTGP setup is that a MulLazyTensor is formed as the outputs of both the data covariance module and the task covariance are lazy tensors. I'm not sure I completely follow why this would be slow in the end as the t batch should be something like N x q x q for the data covariance which you would need for a batch size of q?

Maybe try asking in the botorch discussions instead if the botorch implementation isnt efficient enough for you.

2 replies

AdrianSosic Feb 10, 2022
Author

Hi @wjmaddox. First of all, thank you for taking the time to think about this!

Concerning your first question: yes, I've tried it with botorch's version today and observed the same behavior. In fact, the corresponding forward function in botorch executes the exact same steps.

Regarding the second point: I've again double-checked and the result of the multiplication operation is indeed a NonLazyTensor (both in gpytorch and in botorch), not a LazyTensor as we both would have expected. Here is a minimal example:

from botorch.models import MultiTaskGP
import torch

train_x = torch.tensor([[0., 0.], [1., 1.]])
train_y = torch.tensor([[0., 1.]])

model = MultiTaskGP(train_x, train_y, task_feature=-1)
mvn = model.forward(torch.tensor([[0., 0.]]))
print(type(mvn._covar))

which returns the following output.

<class 'gpytorch.lazy.non_lazy_tensor.NonLazyTensor'>

Going step by step through gpytorch's/botorch's forward function, you will see the change in type happening during the multiplication operation in the multitask kernel. (Any thoughts on why this is the case?)

I agree with you that I see no reason why (from a pure mathematical point of view) the computation should be costly because I only need marginal distributions (as you said, an N x q x q covariance but with q=1 in my case). However, I observe that the memory usage explodes in the moment when gpytorch internally evaluates the covariance matrix on the full_inputs (= concatenated training + test data), which happens in the __call__() function of ExactGP:

full_output = super(ExactGP, self).__call__(*full_inputs, **kwargs)

Since this is not evaluated lazily as described above, it gives a memory explosion.

In fact, it is possible to side-step the problem using an approach that is unfavourable from a complexity perspective, though (but works in my case because it avoids t-batching): Instead of passing the N query points as a t-batch, I can pass them as a q-batch. This way, gpytorch internally constructs a single (M+N) x (M+N) matrix (still as a NonLazyTensor though) instead of N separate matrices of size (M+1) x (M+1), with M being the size of the training data. I currently use this as a workaround, which however prevents me from using botorch's built-in acquisition functions. And of course, I would prefer to find a proper solution to the problem =)

gpleiss Feb 17, 2022
Maintainer

@AdrianSosic to some degree, everything that you've posted here is expected behavior. If covar_x has no special structure associated with it, then having covar_x * covar_i returning a NonLazyTensor isn't really inefficient, since it will take up as much memory as the original covar_x.

Can you supply a full example? This will make it easier to understand your use case and expected behavior.

AdrianSosic · 2022-02-14T16:58:40Z

AdrianSosic
Feb 14, 2022
Author

I dug a little deeper and found that memory usage is normal as long as simple kernel structures are used. However, as soon as kernels are combined, memory usage increases sharply. I am not sure if this extra memory is really needed or if a lazy tensor evaluation is inefficiently executed somewhere -- hence, I open this discussion as a potential issue.

Below is an example that demonstrates the problem. It is largely based on the Hadamard example with only the following minor differences:

Change from univariate to multivariate input to allow usage of structured kernels.
The MultitaskGPModel has an additional covar_module argument to pass the kernel from the outside.

In the code, I've marked the point from which onwards the original logic is used.

With the trigger_memory_issue flag, the behavior of the two settings (built-in vs. structured kernel) can be compared. Note that you might need to adjust n_points so that the memory usage becomes easily visible. When running on my machine, I observe about ~9% memory usage for the built-in kernel vs ~32% memory usage for the structured kernel.

import math
import torch
import gpytorch
from gpytorch.kernels import LinearKernel, ScaleKernel, MaternKernel

trigger_memory_issues = False
n_points = 6000

if trigger_memory_issues:
    covar_module = ScaleKernel(LinearKernel(active_dims=0) * MaternKernel(ard_num_dims=99, active_dims=torch.arange(1, 100))) \
                   + ScaleKernel(LinearKernel(active_dims=99) * MaternKernel(ard_num_dims=99, active_dims=torch.arange(0, 99)))
else:
    covar_module = gpytorch.kernels.MaternKernel()

train_x1 = torch.rand(n_points, 100)
train_x2 = torch.rand(n_points, 100)

train_y1 = torch.sin(train_x1.sum(dim=-1) * (2 * math.pi)) + torch.randn(train_x1.size(0)) * 0.2
train_y2 = torch.cos(train_x2.sum(dim=-1) * (2 * math.pi)) + torch.randn(train_x2.size(0)) * 0.2


class MultitaskGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood, covar_module):
        super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
        self.mean_module = gpytorch.means.ConstantMean()
        self.covar_module = covar_module

        ###################################################################
        # from here onwards, the code is unchanged
        ###################################################################

        # We learn an IndexKernel for 2 tasks
        # (so we'll actually learn 2x2=4 tasks with correlations)
        self.task_covar_module = gpytorch.kernels.IndexKernel(num_tasks=2, rank=1)

    def forward(self,x,i):
        mean_x = self.mean_module(x)

        # Get input-input covariance
        covar_x = self.covar_module(x)
        # Get task-task covariance
        covar_i = self.task_covar_module(i)
        # Multiply the two together to get the covariance we want
        covar = covar_x.mul(covar_i)

        return gpytorch.distributions.MultivariateNormal(mean_x, covar)

likelihood = gpytorch.likelihoods.GaussianLikelihood()

train_i_task1 = torch.full((train_x1.shape[0],1), dtype=torch.long, fill_value=0)
train_i_task2 = torch.full((train_x2.shape[0],1), dtype=torch.long, fill_value=1)

full_train_x = torch.cat([train_x1, train_x2])
full_train_i = torch.cat([train_i_task1, train_i_task2])
full_train_y = torch.cat([train_y1, train_y2])

# Here we have two iterms that we're passing in as train_inputs
model = MultitaskGPModel((full_train_x, full_train_i), full_train_y, likelihood, covar_module)


# this is for running the notebook in our testing framework
import os
smoke_test = ('CI' in os.environ)
training_iterations = 2 if smoke_test else 50


# Find optimal model hyperparameters
model.train()
likelihood.train()

# Use the adam optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  # Includes GaussianLikelihood parameters

# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

for i in range(training_iterations):
    optimizer.zero_grad()
    output = model(full_train_x, full_train_i)
    loss = -mll(output, full_train_y)
    loss.backward()
    print('Iter %d/50 - Loss: %.3f' % (i + 1, loss.item()))
    optimizer.step()

2 replies

gpleiss Feb 17, 2022
Maintainer

It's possible that we need to revisit the logic behind the lazy tensor evaluation for Hadamard products. However, this behavior is generally expected. Multiplying multiple kernel matrices together will generally require more memory (as we need to keep the component kernel matrices in memory for backpropagation).

AdrianSosic Feb 18, 2022
Author

Hi @gpleiss, thanks a lot for sharing your thoughts. I definitely get your point that more complicated kernels require more memory. Nonetheless, I have the hunch that something does not work as it should in this case, due to the following two reasons:

First of all, the moment in when the memory explodes seems really weird to me. As described in my first message, nothing spectacular happens until the line covar = covar_x.mul(covar_i) gets executed. At this point, both covar_x and covar_i exist as LazyTensors and hence I don't understand why the execution of that line causes any complication (both in the sense that additional memory is required and that there is an actual computation triggered in the first place! Shouldn't this be finished instantaneously, without any computation, if things are really executed lazily?)
The second point is more from a mathematical complexity point of view: I find it suspicious evaluating the predictive distribution for number of test points using this kernel requires significantly more resources when done as a "t-batch" instead of a "q-batch", even though the "q-batch" variant is the one that gives the full joint distribution of the test points. In my first message, I mentioned that the problem happens when gpytorch internally evaluates the "stacked" distribution (containing both training and test points). If things get evaluated truly lazily and cached, then this phenomenon should not occur, I guess.

What are your thoughts on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient Hadamard multitask model #1910

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Efficient Hadamard multitask model #1910

AdrianSosic Feb 9, 2022

Replies: 2 comments · 4 replies

wjmaddox Feb 9, 2022 Collaborator

AdrianSosic Feb 10, 2022 Author

gpleiss Feb 17, 2022 Maintainer

AdrianSosic Feb 14, 2022 Author

gpleiss Feb 17, 2022 Maintainer

AdrianSosic Feb 18, 2022 Author

AdrianSosic
Feb 9, 2022

Replies: 2 comments 4 replies

wjmaddox
Feb 9, 2022
Collaborator

AdrianSosic Feb 10, 2022
Author

gpleiss Feb 17, 2022
Maintainer

AdrianSosic
Feb 14, 2022
Author

gpleiss Feb 17, 2022
Maintainer

AdrianSosic Feb 18, 2022
Author