Replies: 3 comments
-
Another idea is to simplify SYCL headers for device compilation pass. I've just run the following experiment: I compiled handler_copy_core.cpp file standalone with
Modified version simply contains the whole body of all The downside of this approach is that ultimately it may lead to creating two versions of SYCL headers: one optimised for host compilation and another one optimised for device compilation, which has extra maintenance cost. |
Beta Was this translation helpful? Give feedback.
-
Level Zero v1.3 or higher supports dynamic module linking. This means that DPC++ compiler doesn't need to provide fully linked module anymore. For optimization purposes, it's better to know import/export dependencies between the translation units, but we can save a lot of link time if we just skip linking device code into a single module. This can be an intermediate step towards using thinLTO for device code linking. |
Beta Was this translation helpful? Give feedback.
-
I noticed we spend a lot of time in instantiating all the data storage types for the host types (vector, bitset, ...). We could reduce compilation time a lot by either hiding the implementation of the host API types behind |
Beta Was this translation helpful? Give feedback.
-
Reduce DPC++ headers code size:
by user. This feature speeds-up execution of work-size ranges, which are not
divisible by "recommended" work-group size. Today this feature is enabled by
default and impacts all applications, i.e. even applications which do not
require work-size rounding. From my perspective this sounds like a
work-around for a performance bug in Intel GPU driver, which better be off by default.
provides all SYCL APIs. In addition to that, we can provide specialized
headers, which include sub-set of the functionality and can be parsed much
faster. I expect typical SYCL application use limited set of SYCL features,
not all of them. There should be some way to split the functionality into
multiple headers (e.g. <math-builtins>, <vec-type>, <reductions>,
<spec_constants>, <marray>, <assert-extension>, etc.)
be implemented in headers and we have to parse them. At the same time, we
can provide specializations for some frequently used types in the runtime
library (e.g. reductions over buffers of floats).
Today we link all available device code into a single module and (maybe)
split it into multiple modules. It's done using LLVM means, which consume a
lot memory and work very slow. We could apply thinLTO approach and import
only required dependencies to each TU. This enables parallel execution,
requires significantly less memory and removes unnecessary steps for
particular code-split strategies (e.g. "split per source").
Research if compilation phases can be unified for all targets. For instance,
have a single parser, sema, IR-gen and do separation at LLVM IR level (before
or after lowering AST to LLVM IR).
Any other ideas are welcome.
Beta Was this translation helpful? Give feedback.
All reactions