The DPC++ Compiler compiles C++ and SYCL* source files with code for both CPU and a wide range of compute accelerators such as GPU and FPGA.
- Prerequisites
- Build DPC++ toolchain
- Build DPC++ toolchain with libc++ library
- Build DPC++ toolchain with support for NVIDIA CUDA
- Build DPC++ toolchain with support for HIP AMD
- Build DPC++ toolchain with support for HIP NVIDIA
- Build DPC++ toolchain with support for ESIMD CPU Emulation
- Build DPC++ toolchain with support for runtime kernel fusion
- Build Doxygen documentation
- Deployment
- Use DPC++ toolchain
- C++ standard
- Known Issues and Limitations
- Find More
git
- Downloadcmake
version 3.20 or later - Downloadpython
- Downloadninja
- Download- C++ compiler
- See LLVM's host compiler toolchain requirements
Alternatively, you can use a Docker image that has everything you need for building pre-installed:
docker run --name sycl_build -it -v /local/workspace/dir/:/src ghcr.io/intel/llvm/ubuntu2204_base /bin/bash
This command will start a terminal session, from which you can proceed with the instructions below. See Docker BKMs for more info on Docker commands.
Throughout this document DPCPP_HOME
denotes the path to the local directory
created as DPC++ workspace. It might be useful to create an environment variable
with the same name.
Linux:
export DPCPP_HOME=~/sycl_workspace
mkdir $DPCPP_HOME
cd $DPCPP_HOME
git clone https://github.com/intel/llvm -b sycl
Windows (64-bit):
Open a developer command prompt using one of two methods:
- Click start menu and search for "x64 Native Tools Command Prompt for VS XXXX", where XXXX is a version of installed Visual Studio.
- Ctrl-R, write "cmd", click enter, then run
"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
set DPCPP_HOME=%USERPROFILE%\sycl_workspace
mkdir %DPCPP_HOME%
cd %DPCPP_HOME%
git clone --config core.autocrlf=false https://github.com/intel/llvm -b sycl
The easiest way to get started is to use the buildbot configure and compile scripts.
In case you want to configure CMake manually the up-to-date reference for variables is in these files.
Linux:
python $DPCPP_HOME/llvm/buildbot/configure.py
python $DPCPP_HOME/llvm/buildbot/compile.py
Windows (64-bit):
python %DPCPP_HOME%\llvm\buildbot\configure.py
python %DPCPP_HOME%\llvm\buildbot\compile.py
You can use the following flags with configure.py
(full list of available
flags can be found by launching the script with --help
):
--werror
-> treat warnings as errors when compiling LLVM--cuda
-> use the cuda backend (see Nvidia CUDA)--hip
-> use the HIP backend (see HIP)--hip-platform
-> select the platform used by the hip backend,AMD
orNVIDIA
(see HIP AMD or see HIP NVIDIA)--enable-esimd-emulator
-> enable ESIMD CPU emulation (see ESIMD CPU emulation)--enable-all-llvm-targets
-> build compiler (but not a runtime) with all supported targets--shared-libs
-> Build shared libraries-t
-> Build type (Debug or Release)-o
-> Path to build directory--cmake-gen
-> Set build system type (e.g.--cmake-gen "Unix Makefiles"
)
You can use the following flags with compile.py
(full list of available flags
can be found by launching the script with --help
):
-o
-> Path to build directory-t
,--build-target
-> Build target (e.g.,clang
orllvm-spirv
). Default isdeploy-sycl-toolchain
-j
,--build-parallelism
-> Number of threads to use for compilation
Please note that no data about flags is being shared between configure.py
and
compile.py
scripts, which means that if you configured your build to be
placed in non-default directory using -o
flag, you must also specify this flag
and the same path in compile.py
options. This allows you, for example, to
configure several different builds and then build just one of them which is
needed at the moment.
There is experimental support for building and linking DPC++ runtime with libc++ library instead of libstdc++. To enable it the following CMake options should be used.
Linux:
-DSYCL_USE_LIBCXX=ON \
-DSYCL_LIBCXX_INCLUDE_PATH=<path to libc++ headers> \
-DSYCL_LIBCXX_LIBRARY_PATH=<path to libc++ and libc++abi libraries>
You can also use configure script to enable:
python %DPCPP_HOME%\llvm\buildbot\configure.py --use-libcxx \
--libcxx-include <path to libc++ headers> \
--libcxx-library <path to libc++ and libc++ abi libraries>
python %DPCPP_HOME%\llvm\buildbot\compile.py
To enable support for CUDA devices, follow the instructions for the Linux or
Windows DPC++ toolchain, but add the --cuda
flag to configure.py
. Note,
the CUDA backend has Windows support; windows subsystem for
linux (WSL) is not needed to build and run the CUDA backend.
Enabling this flag requires an installation of at least CUDA 10.2 on the system, refer to NVIDIA CUDA Installation Guide for Linux or NVIDIA CUDA Installation Guide for Windows
Errors may occur if DPC++ is built with a toolkit version which is higher than
the CUDA driver version. In order to check that the CUDA driver and toolkits
match, use the CUDA executable deviceQuery
which is usually found in
$CUDA_INSTALL_DIR/cuda/extras/demo_suite/deviceQuery
.
NOTE: An installation of at least CUDA 11.6 is recommended because there is a known issue with some math builtins when using -O1/O2/O3 Optimization options for CUDA toolkits prior to 11.6 (This is due to a bug in earlier versions of the CUDA toolkit: see this issue).
An installation of at least CUDA 11.0 is required to fully utilize Turing (SM 75) devices and to enable Ampere (SM 80) core features.
The CUDA backend should work on Windows or Linux operating systems with any GPU compatible with SM 50 or above. The default SM for the NVIDIA CUDA backend is 5.0. Users can specify lower values, but some features may not be supported.
Non-standard CUDA location:
If the CUDA toolkit is installed in a non-default location on your system, two considerations must be made.
Firstly, do not add the toolkit to your standard environment variables (PATH
, LD_LIBRARY_PATH
), as to do so will create conflicts with OpenCL headers.
Secondly, set the CUDA_LIB_PATH
environment variable and pass the CMake variable CUDA_TOOLKIT_ROOT_DIR
as follows:
CUDA_LIB_PATH=/path/to/cuda/toolkit/lib64/stubs CC=gcc CXX=g++ python $DPCPP_HOME/llvm/buildbot/configure.py --cuda --cmake-opt="-DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda/toolkit"
CUDA_LIB_PATH=/path/to/cuda/toolkit/lib64/stubs CC=gcc CXX=g++ python $DPCPP_HOME/llvm/buildbot/compile.py
$DPCPP_HOME/llvm/build/bin/clang++ -std=c++17 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda --cuda-path=/path/to/cuda/toolkit *.cpp -o a.out
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$DPCPP_HOME/llvm/build/lib ./a.out
There is experimental support for DPC++ for HIP on AMD devices. Note as this is still experimental and there is no continuous integration for this yet there are therefore no guarantees for supported platforms or configurations.
To enable support for HIP devices, follow the instructions for the Linux
DPC++ toolchain, but add the --hip
flag to configure.py
Enabling this flag requires an installation of ROCm on the system, for instruction on how to install this refer to AMD ROCm Installation Guide for Linux.
The DPC++ build assumes that ROCm is installed in /opt/rocm
, if it
is installed somewhere else, the directory must be provided through
the CMake variable SYCL_BUILD_PI_HIP_ROCM_DIR
which can be passed
using the --cmake-opt
option of configure.py
as follows:
python $DPCPP_HOME/llvm/buildbot/configure.py --hip \
--cmake-opt=-DSYCL_BUILD_PI_HIP_ROCM_DIR=/usr/local/rocm
Currently, this has only been tried on Linux, with ROCm 4.2.0 or 4.3.0 and using the MI50 (gfx906) and MI100 (gfx908) devices.
LLD is necessary for the AMDGPU compilation chain. The AMDGPU backend generates a standard ELF [ELF] relocatable code object that can be linked by lld to produce a standard ELF shared code object which can be loaded and executed on an AMDGPU target. The LLD project is enabled by default when configuring for HIP. For more details on building LLD refer to LLD Build Guide.
There is experimental support for DPC++ for HIP on Nvidia devices. Note as this is still experimental and there is no continuous integration for this yet there are therefore no guarantees for supported platforms or configurations.
This is a compatibility feature and the CUDA backend should be preferred to run on NVIDIA GPUs.
To enable support for HIP NVIDIA devices, follow the instructions for the Linux
DPC++ toolchain, but add the --hip
and --hip-platform NVIDIA
flags to
configure.py
.
Enabling this flag requires HIP to be installed, more specifically HIP NVCC, as well as CUDA to be installed, see NVIDIA CUDA Installation Guide for Linux.
Currently, this has only been tried on Linux, with ROCm 4.2.0 or 4.3.0, with CUDA 11, and using a GeForce 1060 device.
There is experimental support for DPC++ for using ESIMD CPU Emulation
This feature supports ESIMD CPU Emulation using CM_EMU library CM Emulation project. The library package will be generated from source codes downloaded from its open source project and installed in your deploy directory during toolchain build.
To enable support for ESIMD CPU emulation, follow the instructions for the Linux DPC++ toolchain, but add the `--enable-esimd-emulator'.
Enabling this flag requires following packages installed.
- Ubuntu 22.04
- libva-dev / 2.7.0-2
- libffi-dev / 3.3-4
- libtool
- RHEL 8.*
- libffi
- libffi-devel
- libva
- libva-devel
Currently, this feature was tested and verified on Ubuntu 22.04 environment.
Support for the experimental SYCL extension for user-driven kernel fusion at runtime is enabled by default.
To disable support for this feature, follow the instructions for the
Linux DPC++ toolchain, but add the --disable-fusion
flag.
Kernel fusion is currently not yet supported on the Windows platform.
Building Doxygen documentation is similar to building the product itself. First, the following tools need to be installed:
- doxygen
- graphviz
- sphinx
Then you'll need to add the following options to your CMake configuration command:
-DLLVM_ENABLE_DOXYGEN=ON
After CMake cache is generated, build the documentation with doxygen-sycl
target. It will be put to $DPCPP_HOME/llvm/build/tools/sycl/doc/html
directory.
TODO: add instructions how to deploy built DPC++ toolchain.
To run DPC++ applications on OpenCL devices, OpenCL implementation(s) must be present in the system.
To run DPC++ applications on Level Zero devices, Level Zero implementation(s) must be present in the system. You can find the link to the Level Zero spec in the following section Find More.
The Level Zero RT for GPU
, OpenCL RT for GPU
, OpenCL RT for CPU
, FPGA
emulation RT and TBB runtime which are needed to run DPC++ application
on Intel GPU
or Intel CPU
devices can be downloaded using links in
the dependency configuration file
and installed following the instructions below. The same versions are used in
PR testing.
Linux:
-
Extract the archive. For example, for the archives
oclcpuexp_<cpu_version>.tar.gz
andfpgaemu_<fpga_version>.tar.gz
you would run the following commands# Extract OpenCL FPGA emulation RT mkdir -p /opt/intel/oclfpgaemu_<fpga_version> cd /opt/intel/oclfpgaemu_<fpga_version> tar zxvf fpgaemu_<fpga_version>.tar.gz # Extract OpenCL CPU RT mkdir -p /opt/intel/oclcpuexp_<cpu_version> cd /opt/intel/oclcpuexp_<cpu_version> tar -zxvf oclcpu_rt_<cpu_version>.tar.gz
-
Create ICD file pointing to the new runtime (requires root access)
# OpenCL FPGA emulation RT echo /opt/intel/oclfpgaemu_<fpga_version>/x64/libintelocl_emu.so > /etc/OpenCL/vendors/intel_fpgaemu.icd # OpenCL CPU RT echo /opt/intel/oclcpuexp_<cpu_version>/x64/libintelocl.so > /etc/OpenCL/vendors/intel_expcpu.icd
-
Extract or build TBB libraries using links in the dependency configuration file. For example, for the archive oneapi-tbb-<tbb_version>-lin.tgz:
mkdir -p /opt/intel cd /opt/intel tar -zxvf oneapi-tbb*lin.tgz
-
Copy files from or create symbolic links to TBB libraries in OpenCL RT folder:
# OpenCL FPGA emulation RT ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclfpgaemu_<fpga_version>/x64 # OpenCL CPU RT ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbb.so.12 /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/oneapi-tbb-<tbb_version>/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclcpuexp_<cpu_version>/x64
-
Configure library paths (requires root access)
echo /opt/intel/oclfpgaemu_<fpga_version>/x64 > /etc/ld.so.conf.d/libintelopenclexp.conf echo /opt/intel/oclcpuexp_<cpu_version>/x64 >> /etc/ld.so.conf.d/libintelopenclexp.conf ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf
Windows (64-bit):
-
If you need OpenCL runtime for Intel
GPU
as well, then update/install it first. Do it before installing OpenCL runtime for IntelCPU
runtime as OpenCL runtime for IntelGPU
installer may re-write some important files or settings and make existing OpenCL runtime for IntelCPU
runtime not working properly. -
Extract the archive with OpenCL runtime for Intel
CPU
and/or for IntelFPGA
emulation using links in the dependency configuration file. For example, toc:\oclcpu_rt_<cpu_version>
. -
Extract the archive with TBB runtime or build it from sources using links in the dependency configuration file. For example, to
c:\oneapi-tbb-<tbb_version>
. -
Run
Command Prompt
asAdministrator
. To do that clickStart
button, typeCommand Prompt
, click the Right mouse button on it, then clickRun As Administrator
, then clickYes
to confirm. -
In the opened windows run
install.bat
provided with the extracted files to install runtime to the system and setup environment variables. So, if the extracted files are inc:\oclcpu_rt_<cpu_version>\
folder, then type the command:# Install OpenCL FPGA emulation RT # Answer Y to clean previous OCL_ICD_FILENAMES configuration and ICD records cleanup c:\oclfpga_rt_<fpga_version>\install.bat c:\oneapi-tbb-<tbb_version>\redist\intel64\vc14 # Install OpenCL CPU RT # Answer N for ICD records cleanup c:\oclcpu_rt_<cpu_version>\install.bat c:\oneapi-tbb-<tbb_version>\redist\intel64\vc14
Ahead of time compilation
requires ahead of time compiler available in PATH
. There is
AOT compiler for each device type:
GPU
, Level Zero and OpenCL runtimes are supported,CPU
, OpenCL runtime is supported,Accelerator
(FPGA or FPGA emulation), OpenCL runtime is supported.
-
Linux
There are two ways how to obtain GPU AOT compiler
ocloc
:- (Ubuntu) Download and install intel-ocloc_***.deb package from intel/compute-runtime releases. This package should have the same version as Level Zero / OpenCL GPU runtimes installed on the system.
- (other distros)
ocloc
is a part of Intel® software packages for general purpose GPU capabilities.
-
Windows
-
GPU AOT compiler
ocloc
is a part of Intel® oneAPI Base Toolkit (Intel® oneAPI DPC++/C++ Compiler component).
Make sure that the following path toocloc
binary is available inPATH
environment variable:<oneAPI installation location>/compiler/<version>/windows/lib/ocloc
-
- CPU AOT compiler
opencl-aot
is enabled by default. For more, see opencl-aot documentation.
-
Accelerator AOT compiler
aoc
is a part of Intel® oneAPI Base Toolkit (Intel® oneAPI DPC++/C++ Compiler component).
Make sure that these binaries are available inPATH
environment variable:aoc
from<oneAPI installation location>/compiler/<version>/<OS>/lib/oclfpga/bin
aocl-ioc64
from<oneAPI installation location>/compiler/<version>/<OS>/bin
To verify that built DPC++ toolchain is working correctly, run:
Linux:
python $DPCPP_HOME/llvm/buildbot/check.py
Windows (64-bit):
python %DPCPP_HOME%\llvm\buildbot\check.py
Make sure that psutil package is installed. If no OpenCL GPU/CPU runtimes are available, the corresponding tests are skipped.
If CUDA support has been built, it is tested only if there are CUDA devices available.
If testing with HIP for AMD, the lit tests will use gfx906
as the default
architecture. It is possible to change it by adding
-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=<target>
to the CMake
variable SYCL_CLANG_EXTRA_FLAGS
.
Follow instructions from the link below to build and run tests: README
Khronos* SYCL* conformance test suite (CTS) is intended to validate implementation conformance to Khronos* SYCL* specification. DPC++ compiler is expected to pass significant number of tests, and it keeps improving.
Follow Khronos* SYCL* CTS instructions from README file to obtain test sources and instructions how build and execute the tests.
A simple DPC++ or SYCL* program consists of following parts:
- Header section
- Allocating buffer for data
- Creating SYCL queue
- Submitting command group to SYCL queue which includes the kernel
- Wait for the queue to complete the work
- Use buffer accessor to retrieve the result on the device and verify the data
- The end
Creating a file simple-sycl-app.cpp
with the following C++/SYCL code:
#include <sycl/sycl.hpp>
int main() {
// Creating buffer of 4 elements to be used inside the kernel code
sycl::buffer<size_t, 1> Buffer(4);
// Creating SYCL queue
sycl::queue Queue;
// Size of index space for kernel
sycl::range<1> NumOfWorkItems{Buffer.size()};
// Submitting command group(work) to queue
Queue.submit([&](sycl::handler &cgh) {
// Getting write only access to the buffer on a device.
sycl::accessor Accessor{Buffer, cgh, sycl::write_only};
// Executing kernel
cgh.parallel_for<class FillBuffer>(
NumOfWorkItems, [=](sycl::id<1> WIid) {
// Fill buffer with indexes.
Accessor[WIid] = WIid.get(0);
});
});
// Getting read only access to the buffer on the host.
// Implicit barrier waiting for queue to complete the work.
sycl::host_accessor HostAccessor{Buffer, sycl::read_only};
// Check the results
bool MismatchFound = false;
for (size_t I = 0; I < Buffer.size(); ++I) {
if (HostAccessor[I] != I) {
std::cout << "The result is incorrect for element: " << I
<< " , expected: " << I << " , got: " << HostAccessor[I]
<< std::endl;
MismatchFound = true;
}
}
if (!MismatchFound) {
std::cout << "The results are correct!" << std::endl;
}
return MismatchFound;
}
To build simple-sycl-app put bin
and lib
to PATHs:
Linux:
export PATH=$DPCPP_HOME/llvm/build/bin:$PATH
export LD_LIBRARY_PATH=$DPCPP_HOME/llvm/build/lib:$LD_LIBRARY_PATH
Windows (64-bit):
set PATH=%DPCPP_HOME%\llvm\build\bin;%PATH%
set LIB=%DPCPP_HOME%\llvm\build\lib;%LIB%
and run following command:
clang++ -fsycl simple-sycl-app.cpp -o simple-sycl-app.exe
When building for CUDA or HIP NVIDIA, use the CUDA target triple as follows:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda \
simple-sycl-app.cpp -o simple-sycl-app-cuda.exe
When building for HIP AMD, use the AMD target triple and specify the
target architecture with -Xsycl-target-backend --offload-arch=<arch>
as follows:
clang++ -fsycl -fsycl-targets=amdgcn-amd-amdhsa \
-Xsycl-target-backend --offload-arch=gfx906 \
simple-sycl-app.cpp -o simple-sycl-app-amd.exe
The target architecture may also be specified for the CUDA backend, with
-Xsycl-target-backend --cuda-gpu-arch=<arch>
. Specifying the architecture is
necessary if an application aims to use newer hardware features, such as
native atomic operations or tensor core operations.
Moreover, it is possible to pass specific options to CUDA ptxas
(such as
--maxrregcount=<n>
for limiting the register usage or --verbose
for
printing generation statistics) using the -Xcuda-ptxas
flag.
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda \
simple-sycl-app.cpp -o simple-sycl-app-cuda.exe \
-Xcuda-ptxas --maxrregcount=128 -Xcuda-ptxas --verbose \
-Xsycl-target-backend --cuda-gpu-arch=sm_80
To build simple-sycl-app ahead of time for GPU, CPU or Accelerator devices, specify the target architecture. The examples provided use a supported alias for the target, representing a full triple. Additional details can be found in the Users Manual.
-fsycl-targets=spir64_gen
for GPU,
-fsycl-targets=spir64_x86_64
for CPU,
-fsycl-targets=spir64_fpga
for Accelerator.
Multiple target architectures are supported.
E.g., this command builds simple-sycl-app for GPU and CPU devices in ahead of time mode:
clang++ -fsycl -fsycl-targets=spir64_gen,spir64_x86_64 simple-sycl-app.cpp -o simple-sycl-app-aot.exe
Additionally, user can pass specific options of AOT compiler to
the DPC++ compiler using -Xsycl-target-backend
option, see
Device code formats for
more. To find available options, execute:
ocloc compile --help
for GPU,
opencl-aot --help
for CPU,
aoc -help -sycl
for Accelerator.
The simple-sycl-app.exe
application doesn't specify SYCL device for
execution, so SYCL runtime will use default_selector
logic to select one
of accelerators available in the system.
In this case, the behavior of the default_selector
can be altered
using the ONEAPI_DEVICE_SELECTOR
environment variable, setting cuda:*
forces
the usage of the CUDA backend (if available), hip:*
forces
the usage of the HIP backend (if available), opencl:*
will
force the usage of the OpenCL backend.
ONEAPI_DEVICE_SELECTOR=cuda:* ./simple-sycl-app-cuda.exe
The default is the OpenCL backend if available.
NOTE: nvptx64-nvidia-cuda
is usable with -fsycl-targets
if clang was built with the cmake option SYCL_ENABLE_PLUGINS=cuda
.
Linux & Windows (64-bit):
./simple-sycl-app.exe
The results are correct!
NOTE: Currently, when the application has been built with the CUDA target,
the CUDA backend must be selected at runtime using the ONEAPI_DEVICE_SELECTOR
environment
variable.
ONEAPI_DEVICE_SELECTOR=cuda:* ./simple-sycl-app-cuda.exe
NOTE: DPC++/SYCL developers can specify SYCL device for execution using
device selectors (e.g. sycl::cpu_selector_v
, sycl::gpu_selector_v
,
Intel FPGA selector(s)) as
explained in following section Code the program for a specific
GPU.
DPC++ applications can be built with CMake by simply using DPC++ as the C++
compiler and by adding the SYCL specific flags. For example assuming clang++
is on the PATH
, a minimal CMakeLists.txt
file for the sample above would be:
# Modifying the compiler should be done before the project line
set(CMAKE_CXX_COMPILER "clang++")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl")
project(simple-sycl-app LANGUAGES CXX)
add_executable(simple-sycl-app simple-sycl-app.cpp)
NOTE: compiling SYCL programs requires passing the SYCL flags to clang++
for
both the compilation and linking stages, so using add_compile_options
to pass
the SYCL flags is not enough on its own, they should also be passed to
add_link_options
, or more simply the SYCL flags can just be added to
CMAKE_CXX_FLAGS
.
NOTE: When linking a SYCL application, clang++
will implicitly link it against
libsycl.so
, so there is no need to add -lsycl
to target_link_libraries
in
the CMake.
To assist in finding a specific SYCL compatible device out of all that may be
available, a "device selector" may be used. A "device selector" is a ranking
function (C++ Callable) that will give an integer ranking value to all the
devices on the system. It can be passed to sycl::queue
, sycl::device
and
sycl::platform
constructors. The highest ranking device is then selected. SYCL
has built-in device selectors for selecting a generic GPU, CPU, or accelerator
device, as well as one for a default device. Additionally, a user can define
their own as function, lambda, or functor class. Device selectors returning
negative values will "reject" a device ensuring it is not selected, but values 0
or higher will be selected by the highest score with ties resolved by an
internal algorithm (see Section 4.6.1 of the SYCL 2020 specification)
The example below illustrates how to use a device selector to create device and queue objects bound to Intel GPU device:
#include <sycl/sycl.hpp>
int main() {
auto NEOGPUDeviceSelector = [](const sycl::device &Device){
using namespace sycl::info;
const std::string DeviceName = Device.get_info<device::name>();
bool match = Device.is_gpu() && (DeviceName.find("HD Graphics NEO") != std::string::npos);
return match ? 1 : -1;
};
try {
sycl::queue Queue(NEOGPUDeviceSelector);
sycl::device Device(NEOGPUDeviceSelector);
} catch (sycl::exception &E) {
std::cout << E.what() << std::endl;
}
}
The device selector below selects an NVIDIA device only, and won't execute if there is none.
int CUDASelector(const sycl::device &Device) {
using namespace sycl::info;
const std::string DriverVersion = Device.get_info<device::driver_version>();
if (Device.is_gpu() && (DriverVersion.find("CUDA") != std::string::npos)) {
std::cout << " CUDA device found " << std::endl;
return 1;
};
return -1;
}
Currently, the DPC++ toolchain relies on having a recent OpenCL implementation on the system in order to link applications to the DPC++ runtime. The OpenCL implementation is not used at runtime if only the CUDA backend is used in the application, but must be installed.
The OpenCL implementation provided by the CUDA SDK is OpenCL 1.2, which is too old to link with the DPC++ runtime and lacks some symbols.
We recommend installing the low level CPU runtime, following the instructions in the next section.
Instead of installing the low level CPU runtime, it is possible to build and install the Khronos ICD loader, which contains all the symbols required.
- DPC++ runtime and headers require C++17 at least.
- DPC++ compiler builds apps as C++17 apps by default. Higher versions of standard are supported as well.
- DPC++ device compiler fails if the same kernel was used in different translation units.
- SYCL 2020 support work is in progress.
- 32-bit host/target is not supported.
- DPC++ works only with OpenCL low level runtimes which support out-of-order queues.
- On Windows linking DPC++ applications with
/MTd
flag is known to cause crashes.
- Backend is only supported on Linux
- The only combination tested is Ubuntu 22.04 with CUDA 11.7 using a Titan RTX GPU (SM 71), but it should work on any GPU compatible with SM 50 or above
- The NVIDIA OpenCL headers conflict with the OpenCL headers required for this project and may cause compilation issues on some platforms
sycl::sqrt
is not correctly rounded by default as the SYCL specification allows lower precision, when porting from CUDA it may be helpful to use-fsycl-fp32-prec-sqrt
to use the correctly rounded square root, this is significantly slower but matches the default precision used bynvcc
, and thisclang++
flag is equivalent to thenvcc
-prec-sqrt
flag, except that it defaults tofalse
.- No Opt (O0) uses the IPSCCP compiler pass by default, although the IPSCCP pass
can be switched off at O0 using the
-mllvm -use-ipsccp-nvptx-O0=false
flag at the user's discretion. The reason that the IPSCCP pass is used by default even at O0 is that there is currently an unresolved issue with the nvvm-reflect compiler pass: This pass is used to pick the correct branches depending on the SM version which can be optionally specified by the--cuda-gpu-arch
flag. If the arch flag is not specified by the user, the default value, SM 50, is used. Without the execution of the IPSCCP pass at -O0 when using a low SM version, dead instructions which require a higher SM version can remain. Since corresponding issues occur in other backends future work will aim for a universal solution to these issues.
- Requires a ROCm compatible operating system, for full details of supported Operating System for ROCm, please refer to the ROCm Supported Operating Systems.
- Has only been tried with ROCm 4.2.0 and 4.3.0.
- Has only been tested using the MI50 (gfx906) and MI100 (gfx908) devices.
- Support is still experimental so not all of the tests are currently passing and many of the built-in function are not yet implemented.
- Additionally there is no continuous integration yet so no guarantee can be made for support platforms or configurations
- Global offsets are currently not supported.
*Other names and brands may be claimed as the property of others.