09_gpu.qmd

---
title: "GPU programming"
engine: knitr
---

## Installing CUDA.jl

In a context such as Sockeye where it is not possible to access GPU nodes with 
internet access, precompilation becomes more complicated than 
[the earlier page on non-CUDA precompilation](05_pkg_cache.qmd). 

We provide a workaround, `pkg_gpu.nf`, which offers the same functionality 
as `pkg.nf` but is slower since all precompilation has to occur on the login 
node.

First, add the package as before:

```{julia output = F}
ENV["JULIA_PKG_PRECOMPILE_AUTO"]=0 # Hold off precompile since we are in login node
using Pkg 
Pkg.activate("experiment_repo/julia_env")
Pkg.add("CUDA")
```

Next, use the GPU precompilation script:

```{bash}
cd experiment_repo 
./nextflow run nf-nest/pkg_gpu.nf 
```


## Running nextflow processes requiring GPU

An example of a workflow using GPUs:

```{groovy}
#| eval: false
#| file: experiment_repo/nf-nest/examples/gpu.nf
```

We run it using the same command as usual:

```{bash}
cd experiment_repo
./nextflow run nf-nest/examples/gpu.nf -profile cluster
```


## GPU kernel development

One way to leverage GPUs is to use 
[array programming](https://cuda.juliagpu.org/stable/usage/array/) as 
demonstrated in the example above. When a problem cannot be cast into 
an array problem, an alternative is to construct a custom GPU kernel. 

Designing custom GPU kernels is especially attractive in Julia. This is 
in big part thanks to  
[KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl), 
which allows 
the same code to be emit both CPU and GPU versions. 
Since error messages are easier to interpret when doing CPU development, 
it is useful to be able to test both CPU and GPU targets. 

Compared to Julia CPU development, the main constraint when doing GPU 
development is that inside the kernel there should not be heap allocations. 
Seasoned Julia developers are often already often avoiding to allocate in 
the inner loop due to garbage collection costs. 

For a concrete example of KernelAbstractions.jl in action, 
see [these kernels](https://github.com/alexandrebouchard/sais-gpu/blob/main/kernels.jl) 
used to implement [Sequential Annealed Importance Sampling](https://arxiv.org/pdf/2408.12057).