DPC++ compile time improvement ideas #8136

bader · 2023-01-27T22:28:51Z

bader
Jan 27, 2023
Maintainer

Reduce DPC++ headers code size:
1. Range-rounding feature creates a duplicate for each SYCL kernel created
  by user. This feature speeds-up execution of work-size ranges, which are not
  divisible by "recommended" work-group size. Today this feature is enabled by
  default and impacts all applications, i.e. even applications which do not
  require work-size rounding. From my perspective this sounds like a
  work-around for a performance bug in Intel GPU driver, which better be off by default.
2. SYCL specification defines single header <sycl/sycl.hpp>, which
  provides all SYCL APIs. In addition to that, we can provide specialized
  headers, which include sub-set of the functionality and can be parsed much
  faster. I expect typical SYCL application use limited set of SYCL features,
  not all of them. There should be some way to split the functionality into
  multiple headers (e.g. <math-builtins>, <vec-type>, <reductions>,
  <spec_constants>, <marray>, <assert-extension>, etc.)
3. We can check if C++ modules can speed-up DPC++ headers compilation.
4. Most of the SYCL classes/functions are templates, so they are expected to
  be implemented in headers and we have to parse them. At the same time, we
  can provide specializations for some frequently used types in the runtime
  library (e.g. reductions over buffers of floats).
Today we link all available device code into a single module and (maybe)
split it into multiple modules. It's done using LLVM means, which consume a
lot memory and work very slow. We could apply thinLTO approach and import
only required dependencies to each TU. This enables parallel execution,
requires significantly less memory and removes unnecessary steps for
particular code-split strategies (e.g. "split per source").
Research if compilation phases can be unified for all targets. For instance,
have a single parser, sema, IR-gen and do separation at LLVM IR level (before
or after lowering AST to LLVM IR).

Any other ideas are welcome.

AlexeySachkov · 2023-01-27T23:45:05Z

AlexeySachkov
Jan 27, 2023
Collaborator

Another idea is to simplify SYCL headers for device compilation pass.

I've just run the following experiment: I compiled handler_copy_core.cpp file standalone with clang++ -fsycl -Ipath/to/catch2/includes using two different versions of handler.hpp: the original one and modified one

original:
sh -x cmd.sh  46.45s user 1.04s system 99% cpu 47.543 total
modified:
sh -x cmd.sh  37.32s user 0.80s system 99% cpu 38.142 total

Modified version simply contains the whole body of all handler::copy definitions under #ifndef __SYCL_DEVICE_ONLY__ leaving them empty for device compilation. That alone saved me almost 10 seconds and that is not even FULL mode, which covers 10 times more data types.

The downside of this approach is that ultimately it may lead to creating two versions of SYCL headers: one optimised for host compilation and another one optimised for device compilation, which has extra maintenance cost.

0 replies

bader · 2023-01-31T06:04:26Z

bader
Jan 31, 2023
Maintainer Author

Today we link all available device code into a single module and (maybe)
split it into multiple modules. It's done using LLVM means, which consume a
lot memory and work very slow. We could apply thinLTO approach and import
only required dependencies to each TU. This enables parallel execution,
requires significantly less memory and removes unnecessary steps for
particular code-split strategies (e.g. "split per source").

Level Zero v1.3 or higher supports dynamic module linking. This means that DPC++ compiler doesn't need to provide fully linked module anymore. For optimization purposes, it's better to know import/export dependencies between the translation units, but we can save a lot of link time if we just skip linking device code into a single module. This can be an intermediate step towards using thinLTO for device code linking.

0 replies

rolandschulz · 2023-11-25T03:27:15Z

rolandschulz
Nov 25, 2023
Collaborator

I noticed we spend a lot of time in instantiating all the data storage types for the host types (vector, bitset, ...). We could reduce compilation time a lot by either hiding the implementation of the host API types behind #ifndef __SYCL_DEVICE_ONLY__ (as suggested by @AlexeySachkov for handler. Or by moving the implementation (including storage) from the headers to the library. The later would help not just device compilation but also host compilation. But it would causes more changes to be ABI breaks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPC++ compile time improvement ideas #8136

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DPC++ compile time improvement ideas #8136

bader Jan 27, 2023 Maintainer

Replies: 3 comments

AlexeySachkov Jan 27, 2023 Collaborator

bader Jan 31, 2023 Maintainer Author

rolandschulz Nov 25, 2023 Collaborator

bader
Jan 27, 2023
Maintainer

AlexeySachkov
Jan 27, 2023
Collaborator

bader
Jan 31, 2023
Maintainer Author

rolandschulz
Nov 25, 2023
Collaborator