Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mxp: Mixed precision extension #294

Open
wants to merge 84 commits into
base: main
Choose a base branch
from

Conversation

cmpfeil
Copy link
Contributor

@cmpfeil cmpfeil commented Jan 23, 2025

Mixed-precision extension to gtensor

Provides template functions mxp::adapt (and mxp::adapt_device) inspired from and extending gt::adapt.
An mxp::mxp_span (derived from gt::gtensor_span) is returned, enabling mixed precision computations in gtensor kernels, i.e., computations where the compute precision differs from the data / storage precision.
From the user's perspective, simply replacing gt::adapt by mxp::adapt is sufficient.

Example

const std::vector<float> x(n, 1.0 / 8. / 1024. / 1024. / 3.); // = eps(float) / 3 
/* */ std::vector<float> y(n, 1.0);

auto gt_x = gt::adapt<1>(x.data(), n); 
auto gt_y = gt::adapt<1>(y.data(), n); 

gt_y = gt_y + gt_x + gt_x; // y remains unchanged, filled with 1.0 

// Still having the data in single precision, but doing computations in double instead
auto mxp_x = mxp::adapt<1, double>(x.data(), n); 
auto mxp_y = mxp::adapt<1, double>(y.data(), n); 

mxp_y = mxp_y + mxp_x + mxp_x; // y filled with 1.000000119

Features for this PR

  • implicit kernels
  • explicit kernels
  • kernels on complex data
  • implicit kernels using .view( ...placeholders... )
  • low precision emulation
  • GPU

Right now only one test and minimal source code is added mainly with the purpose of opening the PR and clarify the questions below.

Questions

  • Preferred way to call: Via mxp::adapt<...>(...) or gt::mxp::adapt<...>(...) or gt::mxp_adapt<...>(...) ?
  • How to include: With #include <gtensor/gtensor.h> but only if flag GTENSOR_ENABLE_MXP set (like for FP16)? Or separate explicit include via #include <gtensor/mxp.h> ?
  • Location & naming of source files (will be two more) : Just inside include/gtensor all starting with mxp_ ?

@bd4
Copy link
Contributor

bd4 commented Jan 24, 2025

  1. I prefer gt::mxp_adapt(...)
  2. I prefer #include <gtensor/mxp.h>. The difference from FP16 is that it uses a different adapt function as entry point, so a cmake/C define is not needed.
  3. I am fine with include/gtensor/mxp_*.h

@germasch what do you think?

@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 6, 2025

Thanks for the recommendations; they are added now.
The pipeline passes for host, cuda, and hip, but I can't make much of the sycl build error (and I don't have the environment / manage to reproduce it). Any ideas?

I think content-wise this PR is complete, also including a number of tests.
Ideas what could still be added, but wasn't needed so far:
*) A method mxp_span<...>::as_gtensor_span() to be used, e.g., for gt::copy
*) Probably also operator[](shape_type) should be redefined for mxp_span (but is it used for defining (explicit) kernels?)
*) Combination with the gt half precision types would be another PR

Any input is more than welcome!

@bd4
Copy link
Contributor

bd4 commented Feb 6, 2025

I'll see if I can reproduce the SYCL CI error with latest oneAPI toolkit. The non debug build does pass.

Going forward, it is possible to run small GPU code on a computer with Gen9 integrated GPU and with the right env var for double simulation on Gen11. This covers most intel processors, including laptops.

@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 12, 2025

@bd4 did you manage to reproduce the error?

I don't manage to build even vanilla gtensor with sycl as the

#include "level_zero/ze_api.h"
#include "level_zero/zes_api.h"

in include/gtensor/backend_sycl_compat.h are not found. Maybe I need to set some environment variables?

@bd4
Copy link
Contributor

bd4 commented Feb 13, 2025

@bd4 did you manage to reproduce the error?

Sorry for the delay, I am able to reproduce with latest 2025 oneapi. It looks like it's related to the kernel name in SYCL, somehow an extra '"' is ending up in the name and that fails to compile. Will see if I can wrap my head around what to change to fix it.

I don't manage to build even vanilla gtensor with sycl as the

#include "level_zero/ze_api.h"
#include "level_zero/zes_api.h"

in include/gtensor/backend_sycl_compat.h are not found. Maybe I need to set some environment variables?

Once level zero and oneapi are installed, you need to do the following:

# Gen11 integrated gpu only
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1

# assuming default install path
source /opt/intel/oneapi/setvars.sh

@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 14, 2025

Ok, so I found the problem with the sycl debug build:
It happens when one wants to emulate mantissa length of 34 bits via gt::mxp_truncated_mantissa_t<double, 34>. The second template parameter of mxp_truncated_mantissa_t is an std::uint8_t and according to cppref

... std::uint8_t may be unsigned char, ...

Then, I guess, the value of 34 corresponding to the character " confuses the compiler on determining where the kernel name ends. Would you agree that this is a bug on the compiler side?

To check this assumption and to pass the build for now, the last "debugging" commit 1da0506 just avoids the tests for the template parameter higher than 30. Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Note that the pipeline still fails at the moment, but now this is due to some actual tests not passing with thy sycl build.

@bd4
Copy link
Contributor

bd4 commented Feb 14, 2025

Good find! I saw the quote there in the error of the template name and was not immediately obvious where that came from.

Then, I guess, the value of 34 corresponding to the character " confuses the compiler on determining where the kernel name ends. Would you agree that this is a bug on the compiler side?

I agree it seems like a bug, but I have been wrong before on this front. I'll work on a minimal reproducer, should be pretty straightforward.

To check this assumption and to pass the build for now, the last "debugging" commit 1da0506 just avoids the tests for the template parameter higher than 30. Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Note that the pipeline still fails at the moment, but now this is due to some actual tests not passing with thy sycl build.

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 18, 2025

Good find! I saw the quote there in the error of the template name and was not immediately obvious where that came from.

Thanks - you gave me the necessary hint in pointing out the extra " character.

I agree it seems like a bug, but I have been wrong before on this front. I'll work on a minimal reproducer, should be pretty straightforward.

With your instructions above I finally managed to "locally" reproduce the sycl debug build error. I didn't dive any deeper on why it only happens in the debug build, or why the error apparently only throws for the device tests. But it is indeed very easy to reproduce outside of gtensor: Using the example code from SYCL Basic Code: parallel_for and replacing parallel_for by parallel_for<kname> where

#include <cstdint>

template <std::uint8_t N> struct Something {};
using kname = Something<34>;

gives the corresponding compilation error via icpx -fsycl main.cxx (complete file below).

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

Gonna have a look into this next.

// main.cxx
#include <CL/sycl.hpp>

// -------------------------------------------------------------------------- //
/* Example from https://www.intel.com/content/www/us/en/docs/sycl/introduction/
                  latest/03-sycl-basic-code-parallel-for.html [2025/02/18]
   Only modification: parallel_for --> parallel_for<kname>
   Error reproduced via
   $ icpx -fsycl main.cxx
   [ Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008) ] */

#include <cstdint>
template <std::uint8_t N> struct Something {}; 
using kname = Something<34>;
// -------------------------------------------------------------------------- //

using namespace sycl;

int main() {

    queue Q;                        // The queue, Q, is the object that
                                    // submits the task to a device.
    int const size = 10; 
    buffer<int> A{ size };          // The buffer, A, is the memory used to
                                    // transfer data between host and device.

    Q.submit([&](handler& h) {      // The handler, h, is the object that contains
                                    // the parallel_for function to be used.

        accessor A_acc(A, h);       // The accessor, A_acc, is the object that
                                    // efficiently accesses the buffer elements.

        h.parallel_for<kname>(range<1>(size), [=](id<1> indx) {
            A_acc[indx] = 77; 
            });

        });

    host_accessor result(A);        // host_accessor is the object that allows
                                    // the host to access the buffer memory.

    for (int i = 0; i < size; i++)  // Print output
        std::cout << result[i] << " "; std::cout << "\n";
    return 0;

}

@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 18, 2025

It looks like a lot of those tests are 1 \neq 1 issue, where the type conversion is not happening in the same way as on other platforms.

Turns out all of the failing tests were due to normal (non-mxp) gtensor kernels achieving more accuracy than expected with the SYCL build. The reason is that the default -fp-model=fast allows optimizations which are not value-safe on floating point data. With any of the other options -fp-model={strict,precise,consistent} the tests also pass for the SYCL build.

So, obviously testing the normal gtensor kernels for value-safety was a bit over-ambitious (I didn't take into account that value-unsafe optimizations might of course also improve the result). Hence, I will just remove those non-mxp "reference tests" and just keep the tests for the mxp kernels. Then the pipeline should pass (I hope).

@cmpfeil cmpfeil marked this pull request as ready for review February 20, 2025 15:43
@cmpfeil
Copy link
Contributor Author

cmpfeil commented Feb 20, 2025

I think content-wise this PR is complete, and now it also passes the pipeline.

Would you agree that this is a bug on the compiler side?

I reported it in an intel/llvm issue. Let's see what comes out of it.

Anyhow, to properly avoid it, it should be sufficient to use, e.g., std::uint16_t for the template parameter instead.

Did this to pass the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants