Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intrinsics for FEAT_SME_MOP4 #381

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
297 changes: 297 additions & 0 deletions main/acle.md
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,8 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
* Added `svdot[_n_f16_mf8]_fpm` and `svdot[_n_f32_mf8]_fpm`.
* Added Guarded Control Stack (GCS) at
[**Beta**](#current-status-and-anticipated-changes) quality level.
* Added [**Beta**](#current-status-and-anticipated-changes) support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to change to alpha

for quarter-tile outer product intrinsics.

### References

Expand Down Expand Up @@ -2370,6 +2372,17 @@ support for the SME double precision floating-point outer product
(FEAT_SME_F64F64) instructions and if their associated intrinsics are
available. This implies that `__ARM_FEATURE_SME` is nonzero.

#### Quarter-tile outer product intrinsics

The specification for SME is in
[**Beta** state](#current-status-and-anticipated-changes) and may change or be
extended in the future.

`__ARM_FEATURE_SME_MOP4` is defined to `1` if there is hardware
support for the SME quarter-tile outer product(FEAT_SME_MOP4)
instructions and if their associated intrinsics are
available. This implies that `__ARM_FEATURE_SME2` is nonzero.

## Floating-point model

These macros test the floating-point model implemented by the compiler
Expand Down Expand Up @@ -2566,6 +2579,7 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_SME_F8F16`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
| [`__ARM_FEATURE_SME_F8F32`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
| [`__ARM_FEATURE_SME_I16I64`](#16-bit-to-64-bit-integer-widening-outer-product-intrinsics) | 16-bit to 64-bit integer widening outer product intrinsics (FEAT_SME_I16I64) | 1 |
| [`__ARM_FEATURE_SME_MOP4`](#quarter-tile-outer-product-intrinsics) | quarter-tile outer product intrinsics (FEAT_SME_MOP4) | 1 |
| [`__ARM_FEATURE_SME_LOCALLY_STREAMING`](#scalable-matrix-extension-sme) | Support for the `arm_locally_streaming` attribute | 1 |
| [`__ARM_FEATURE_SME_LUTv2`](#lookup-table-extensions) | Lookup table extensions (FEAT_SME_LUTv2) | 1 |
| [`__ARM_FEATURE_SSVE_FP8DOT2`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
Expand Down Expand Up @@ -11411,6 +11425,7 @@ Bitwise exclusive NOR population count outer product and accumulate/subtract
__arm_streaming __arm_inout("za");
```


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove that

#### BFMLA, BFMLS, FMLA, FMLS (single)

Multi-vector floating-point fused multiply-add/subtract
Expand Down Expand Up @@ -13723,6 +13738,288 @@ Multi-vector 8-bit floating-point multiply-add long.
__arm_streaming __arm_inout("za");
```

### SME2 mop4 intrinsics

The intrinsics in this section are defined by the header file
[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2` and
`__ARM_FEATURE_SME_MOP4` are defined. Individual intrinsics may have
additional target feature requirements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can replace to something like this:
These intrinsics have an additional ‘_{1,2}x{1,2}’ suffix in non-overloaded names to indicate the order between single and multi-vector arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that wording confusing. This suffix is not related to order of operands. Rather it tells which of the 2 vector operands are single and which are multi vectors. Maybe it could be like this:

These intrinsics use a _{1,2}x{1,2} suffix in non-overloaded names. The first number indicates whether the first vector argument is a single vector (1) or a pair of vectors (2). The second number follows the same rule for the second vector argument.

These intrinsics use an additional '_{1,2}x{1,2}' suffix in
non-overloaded names to indicate which vector argument is a vector register pair.

#### FMOP4A (non-FP8), BFMOP4A, SMOP4A, UMOP4A

``` c
// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");
```

#### FMOP4S (non-FP8), BFMOP4S, SMOP4S, UMOP4S

``` c
// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_f32_f32](uint64_t tile, svfloat32x2_t zn,
svfloat32_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f16_f16] (only if __ARM_FEATURE_SME_F16F16 != 0)
// _za16[_bf16_bf16] (only if __ARM_FEATURE_SME_B16B16 != 0)
// _za32[_f16_f16]
// _za32[_bf16_bf16]
// _za32[_s16_s16]
// _za32[_u16_u16]
// _za32[_s8_s8]
// _za32[_u8_u8]
// _za64[_f64_f64] (only if __ARM_FEATURE_SME_F64F64 != 0)
// _za64[_s16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_f32_f32](uint64_t tile, svfloat32_t zn,
svfloat32x2_t zm)
__arm_streaming __arm_inout("za");
```

#### SUMOP4A

``` c
// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### SUMOP4S

``` c
// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_s8_u8](uint64_t tile, svint8x2_t zn,
svuint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_s16_u16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_s8_u8](uint64_t tile, svint8_t zn,
svuint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### USMOP4A

``` c
// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x1]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x2]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_2x1]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4a[_1x2]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### USMOP4S

``` c
// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x1]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x2]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_2x1]_za32[_u8_s8](uint64_t tile, svuint8x2_t zn,
svint8__t zm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za64[_u16_s16] (only if __ARM_FEATURE_SME_I16I64 != 0)
void svmop4s[_1x2]_za32[_u8_s8](uint64_t tile, svuint8_t zn,
svint8x2_t zm)
__arm_streaming __arm_inout("za");
```

#### FMOP4A (FP8)

``` c
// Variants are also available for:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/[_f8]/[_mf8]/
It is missing the _fpm intrinsic at the end.
And should these intrinsics be also consistent with the non-FP8.

// _za16[_f8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_f8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_1x1]_za32[_f8](uint64_t tile, svmfloat8_t zn,
svmfloat8_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_f8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_2x2]_za32[_f8](uint64_t tile, svmfloat8x2_t zn,
svmfloat8x2_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_f8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_2x1]_za32[_f8](uint64_t tile, svmfloat8x2_t zn,
svmfloat8_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");

// Variants are also available for:
// _za16[_f8] (only if __ARM_FEATURE_SME_F8F16 != 0)
// _za32[_f8] (only if __ARM_FEATURE_SME_F8F32 != 0)
void svmop4a[_1x2]_za32[_f8](uint64_t tile, svmfloat8_t zn,
svmfloat8x2_t zm, fpm_t fpm)
__arm_streaming __arm_inout("za");
```

# M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single
Expand Down