mxp: Mixed precision extension #294

cmpfeil · 2025-01-23T10:46:56Z

Mixed-precision extension to gtensor

Provides template functions mxp::adapt (and mxp::adapt_device) inspired from and extending gt::adapt.
An mxp::mxp_span (derived from gt::gtensor_span) is returned, enabling mixed precision computations in gtensor kernels, i.e., computations where the compute precision differs from the data / storage precision.
From the user's perspective, simply replacing gt::adapt by mxp::adapt is sufficient.

Example

const std::vector<float> x(n, 1.0 / 8. / 1024. / 1024. / 3.); // = eps(float) / 3 
/* */ std::vector<float> y(n, 1.0);

auto gt_x = gt::adapt<1>(x.data(), n); 
auto gt_y = gt::adapt<1>(y.data(), n); 

gt_y = gt_y + gt_x + gt_x; // y remains unchanged, filled with 1.0 

// Still having the data in single precision, but doing computations in double instead
auto mxp_x = mxp::adapt<1, double>(x.data(), n); 
auto mxp_y = mxp::adapt<1, double>(y.data(), n); 

mxp_y = mxp_y + mxp_x + mxp_x; // y filled with 1.000000119

Features for this PR

Right now only one test and minimal source code is added mainly with the purpose of opening the PR and clarify the questions below.

Questions

Preferred way to call: Via mxp::adapt<...>(...) or gt::mxp::adapt<...>(...) or gt::mxp_adapt<...>(...) ?
How to include: With #include <gtensor/gtensor.h> but only if flag GTENSOR_ENABLE_MXP set (like for FP16)? Or separate explicit include via #include <gtensor/mxp.h> ?
Location & naming of source files (will be two more) : Just inside include/gtensor all starting with mxp_ ?

bd4 · 2025-01-24T19:18:37Z

I prefer gt::mxp_adapt(...)
I prefer #include <gtensor/mxp.h>. The difference from FP16 is that it uses a different adapt function as entry point, so a cmake/C define is not needed.
I am fine with include/gtensor/mxp_*.h

@germasch what do you think?

cmpfeil · 2025-02-06T16:43:31Z

Thanks for the recommendations; they are added now.
The pipeline passes for host, cuda, and hip, but I can't make much of the sycl build error (and I don't have the environment / manage to reproduce it). Any ideas?

I think content-wise this PR is complete, also including a number of tests.
Ideas what could still be added, but wasn't needed so far:
*) A method mxp_span<...>::as_gtensor_span() to be used, e.g., for gt::copy
*) Probably also operator[](shape_type) should be redefined for mxp_span (but is it used for defining (explicit) kernels?)
*) Combination with the gt half precision types would be another PR

Any input is more than welcome!

bd4 · 2025-02-06T17:01:48Z

I'll see if I can reproduce the SYCL CI error with latest oneAPI toolkit. The non debug build does pass.

Going forward, it is possible to run small GPU code on a computer with Gen9 integrated GPU and with the right env var for double simulation on Gen11. This covers most intel processors, including laptops.

cmpfeil · 2025-02-12T16:31:11Z

@bd4 did you manage to reproduce the error?

I don't manage to build even vanilla gtensor with sycl as the

#include "level_zero/ze_api.h"
#include "level_zero/zes_api.h"

in include/gtensor/backend_sycl_compat.h are not found. Maybe I need to set some environment variables?

bd4 · 2025-02-13T19:43:15Z

@bd4 did you manage to reproduce the error?

Sorry for the delay, I am able to reproduce with latest 2025 oneapi. It looks like it's related to the kernel name in SYCL, somehow an extra '"' is ending up in the name and that fails to compile. Will see if I can wrap my head around what to change to fix it.

I don't manage to build even vanilla gtensor with sycl as the
#include "level_zero/ze_api.h"
#include "level_zero/zes_api.h"
in include/gtensor/backend_sycl_compat.h are not found. Maybe I need to set some environment variables?

Once level zero and oneapi are installed, you need to do the following:

# Gen11 integrated gpu only
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1

# assuming default install path
source /opt/intel/oneapi/setvars.sh

cmpfeil · 2025-02-14T12:38:26Z

Ok, so I found the problem with the sycl debug build:
It happens when one wants to emulate mantissa length of 34 bits via gt::mxp_truncated_mantissa_t<double, 34>. The second template parameter of mxp_truncated_mantissa_t is an std::uint8_t and according to cppref

... std::uint8_t may be unsigned char, ...

Then, I guess, the value of 34 corresponding to the character " confuses the compiler on determining where the kernel name ends. Would you agree that this is a bug on the compiler side?

To check this assumption and to pass the build for now, the last "debugging" commit 1da0506 just avoids the tests for the template parameter higher than 30. Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Note that the pipeline still fails at the moment, but now this is due to some actual tests not passing with thy sycl build.

bd4 · 2025-02-14T14:46:10Z

Good find! I saw the quote there in the error of the template name and was not immediately obvious where that came from.

Then, I guess, the value of 34 corresponding to the character " confuses the compiler on determining where the kernel name ends. Would you agree that this is a bug on the compiler side?

I agree it seems like a bug, but I have been wrong before on this front. I'll work on a minimal reproducer, should be pretty straightforward.

To check this assumption and to pass the build for now, the last "debugging" commit 1da0506 just avoids the tests for the template parameter higher than 30. Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Note that the pipeline still fails at the moment, but now this is due to some actual tests not passing with thy sycl build.

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

cmpfeil · 2025-02-18T10:51:05Z

Good find! I saw the quote there in the error of the template name and was not immediately obvious where that came from.

Thanks - you gave me the necessary hint in pointing out the extra " character.

I agree it seems like a bug, but I have been wrong before on this front. I'll work on a minimal reproducer, should be pretty straightforward.

With your instructions above I finally managed to "locally" reproduce the sycl debug build error. I didn't dive any deeper on why it only happens in the debug build, or why the error apparently only throws for the device tests. But it is indeed very easy to reproduce outside of gtensor: Using the example code from SYCL Basic Code: parallel_for and replacing parallel_for by parallel_for<kname> where

#include <cstdint>

template <std::uint8_t N> struct Something {};
using kname = Something<34>;

gives the corresponding compilation error via icpx -fsycl main.cxx (complete file below).

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

Gonna have a look into this next.

// main.cxx
#include <CL/sycl.hpp>

// -------------------------------------------------------------------------- //
/* Example from https://www.intel.com/content/www/us/en/docs/sycl/introduction/
                  latest/03-sycl-basic-code-parallel-for.html [2025/02/18]
   Only modification: parallel_for --> parallel_for<kname>
   Error reproduced via
   $ icpx -fsycl main.cxx
   [ Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008) ] */

#include <cstdint>
template <std::uint8_t N> struct Something {}; 
using kname = Something<34>;
// -------------------------------------------------------------------------- //

using namespace sycl;

int main() {

    queue Q;                        // The queue, Q, is the object that
                                    // submits the task to a device.
    int const size = 10; 
    buffer<int> A{ size };          // The buffer, A, is the memory used to
                                    // transfer data between host and device.

    Q.submit([&](handler& h) {      // The handler, h, is the object that contains
                                    // the parallel_for function to be used.

        accessor A_acc(A, h);       // The accessor, A_acc, is the object that
                                    // efficiently accesses the buffer elements.

        h.parallel_for<kname>(range<1>(size), [=](id<1> indx) {
            A_acc[indx] = 77; 
            });

        });

    host_accessor result(A);        // host_accessor is the object that allows
                                    // the host to access the buffer memory.

    for (int i = 0; i < size; i++)  // Print output
        std::cout << result[i] << " "; std::cout << "\n";
    return 0;

}

cmpfeil · 2025-02-18T17:08:30Z

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

Turns out all of the failing tests were due to normal (non-mxp) gtensor kernels achieving more accuracy than expected with the SYCL build. The reason is that the default -fp-model=fast allows optimizations which are not value-safe on floating point data. With any of the other options -fp-model={strict,precise,consistent} the tests also pass for the SYCL build.

So, obviously testing the normal gtensor kernels for value-safety was a bit over-ambitious (I didn't take into account that value-unsafe optimizations might of course also improve the result). Hence, I will just remove those non-mxp "reference tests" and just keep the tests for the mxp kernels. Then the pipeline should pass (I hope).

cmpfeil · 2025-02-20T15:52:40Z

I think content-wise this PR is complete, and now it also passes the pipeline.

Would you agree that this is a bug on the compiler side?

I reported it in an intel/llvm issue. Let's see what comes out of it.

Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Did this to pass the pipeline.

cmpfeil added 25 commits January 22, 2025 16:55

mxp: Most basic test + feature

686e5fc

mxp: Clang format

6f76b2a

mxp: Rename test

c4ca37b

mxp: Failing test for explicit kernel

12e0b86

mxp: Feature explicit kernel

98eda7f

mxp: Rename tests

c73fca3

mxp: Basic 2D tests

bae873f

mxp: Failing tests for complex

d34646e

mxp: Feature complex

ee68d1a

mxp: Test complex op[+,-,*,/] implicit

d4fb4d8

mxp: Clean test init vals with exp2f

81c7f9a

mxp: Test complex op[+,-,*,/] explicit

657c8da

mxp: Clang format

8968008

mxp: Reorder tests

94b7d50

mxp: Failing test for .view(...)

08d48b2

mxp: Feature .view(...)

ac5695d

mxp: Test view _all, _newaxis, _s, idx

d82db20

mxp: Failing test view + complex

5cf963d

mxp: Fix mxp_span.view(...).data_access(...)

2442300

mxp: Clean test

f1959a6

mxp: Test view complex op[+,-,*,/]

9c14114

mxp: Clean test

b909162

mxp: Test view placeholders complex 2D

b400179

mxp: Test .view(...).view(...)

ec5cab7

mxp: Prepare tests for GPU

78249a0

cmpfeil added 4 commits January 28, 2025 13:16

mxp: Failing GPU test

49042ee

mxp: Feature GPU kernel

9460ca3

mxp: Fix division accuracy cuda test

dd4f452

mxp: Clean GPU test

26aa21e

cmpfeil added 16 commits January 30, 2025 17:18

mxp: Feature complex el access compare

68c2dbd

mxp: Ensure appropriate nr of bits in truncated type

bf81f1b

mxp: Cleanup

c28a431

mxp: Cleanup

9ba5fd2

mxp: Cleanup

763dd1a

mxp: Cleanup

e30f01b

mxp: Change namespace mxp:: to gt::mxp_ (adapt, trunc)

0e980c7

mxp: Remove cmake flag GTENSOR_ENABLE_MXP

7141d02

mxp: Fix operator==(ambivalent&, thrust::complex&)

b4c8ed4

mxp: Remove truncated tests for debugging

2971f41

mxp: Fix truncated tests with -DGTENSOR_USE_THRUST=ON

b800784

mxp: Reenable all tests from debugging

00d2f0a

mxp: Use GTENSOR_HAVE_THRUST guards

472cfcb

mxp: Fix HIP build (memcpy)

9e70e51

mxp: Fix SYCL build (kernel calls through fct ptr)

7cb2955

mxp: Fix implicit conversion warning

5982b31

mxp: Debug sycl-debug build idea

1da0506

cmpfeil added 3 commits February 18, 2025 18:41

mxp: Remove unnecessary ref tests (fail for -fsycl -fast)

460b05f

mxp: Avoid possible sycl compiler bug

42b09b7

mxp: Clang-format off section

c8f6443

cmpfeil marked this pull request as ready for review February 20, 2025 15:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mxp: Mixed precision extension #294

mxp: Mixed precision extension #294

cmpfeil commented Jan 23, 2025 •

edited

Loading

bd4 commented Jan 24, 2025

cmpfeil commented Feb 6, 2025

bd4 commented Feb 6, 2025

cmpfeil commented Feb 12, 2025

bd4 commented Feb 13, 2025 •

edited

Loading

cmpfeil commented Feb 14, 2025

bd4 commented Feb 14, 2025 •

edited

Loading

cmpfeil commented Feb 18, 2025

cmpfeil commented Feb 18, 2025

cmpfeil commented Feb 20, 2025

mxp: Mixed precision extension #294

Are you sure you want to change the base?

mxp: Mixed precision extension #294

Conversation

cmpfeil commented Jan 23, 2025 • edited Loading

Mixed-precision extension to gtensor

Example

Features for this PR

Questions

bd4 commented Jan 24, 2025

cmpfeil commented Feb 6, 2025

bd4 commented Feb 6, 2025

cmpfeil commented Feb 12, 2025

bd4 commented Feb 13, 2025 • edited Loading

cmpfeil commented Feb 14, 2025

bd4 commented Feb 14, 2025 • edited Loading

cmpfeil commented Feb 18, 2025

cmpfeil commented Feb 18, 2025

cmpfeil commented Feb 20, 2025

cmpfeil commented Jan 23, 2025 •

edited

Loading

bd4 commented Feb 13, 2025 •

edited

Loading

bd4 commented Feb 14, 2025 •

edited

Loading