Full SME(1) instruction support and STREAMING Groups #415

FinnWilkinson · 2024-06-12T10:50:19Z

This PR implements all available SME (version 1) instructions that are contained within LLVM 14.0.5. Specifically, this is Version 2021-06 of the Armv9-A A64 ISA.

No FP16 or BF16 instructions have been supported due to lacking C++17 types. All Quad-Word instruction variants have been emulated using 64-bit data-types.

In addition to this, new STREAMING_SVE and STREAMING_PREDICATE groups have been introduced (along with corresponding decode logic) to allow for a different pipeline / latency configuration for these instructions when SVE Streaming Mode (the context mode which SME instructions are executed in) is enabled. This can allow for a co-processor style implementation of SME to be implemented within SimEng; with additional latency / reduced throughput being configured to mimic an offload penalty, and different execution or LD/STR hardware being modelled for said co-processor compared to the main core.

Add STREAMING Group support
Add execution logic and regression tests for all missing SME instructions

FinnWilkinson · 2024-07-09T11:18:52Z

#rerun tests

src/include/simeng/arch/aarch64/Architecture.hh

src/lib/arch/aarch64/Architecture.cc

src/include/simeng/arch/aarch64/InstructionGroups.hh

src/lib/arch/aarch64/Instruction_address.cc

src/lib/arch/aarch64/Instruction_execute.cc

ABenC377 · 2024-08-30T10:41:47Z

test/regression/aarch64/instructions/sme.cc

+      CHECK_MAT_COL(ARM64_REG_ZAS3, i, uint32_t,
+                    fillNeon<uint32_t>(inter32, (SVL / 8)));
+    } else {
+      // Even cols, all elements


Should this be Odd cols?

Several occurrences of the possible same issue throughout

No, throughout the test file some SME tests use the same predicate patterns. Two predicates are always used: p0 and p1.

p0 is always set to all true using ptrue p0.d (for example)

p1 is always set to the pattern {ON, OFF, ON, OFF...} using zip1 p1.s, p0.s, p1.s (for example)

Hence, when using p1 as the predicate the rows, columns, or individual elements per row/col updated will always be even ones. This should also be reflected in the test initialisation with inter32[i] = (i % 2 == 0) ? i : 65; where only the even vector elements are set to increasing values, and odd values to the test-default of 66.

So what does i represent here?

i is the row or column element index (i.e. from something like int i = 0; i < sliceElements; i++ with sliceElements sometimes fixed to a constant representing the number of elements for the max SVL of 2048-bits) depending on if the Horizontal or Vertical instruction variant is being used.

So in the above, if i is even (index 0, 2, 4 etc) then the row/column element is set to i to reflect the use of index z1.s, #0, #1 in the test. If i is odd then the value remains unchanged; for this test a value of 65.

I'm happy, but leaving for @jj16791 to resolve

jj16791

Some comments and I agree with several of Alex's comments. I think it would be good to get the ARM SME/SVE loops as part of our functional verification checks to help test these new instructions. I assume it would have to be done somewhere private though (not sure if we already have that guarantee in the upcoming CI/CD pipelines)?

CMakeLists.txt

configs/a64fx_SME.yaml

docs/sphinx/assets/instruction_groups_AArch64.png

src/lib/arch/aarch64/Instruction_address.cc

src/lib/arch/aarch64/Instruction.cc

jj16791 · 2024-10-26T09:49:51Z

src/lib/arch/aarch64/Architecture.cc

@@ -188,6 +188,20 @@ uint8_t Architecture::predecode(const uint8_t* ptr, uint16_t bytesAvailable,
    newInsn.setExecutionInfo(getExecutionInfo(newInsn));
    // Cache the instruction
    iter = decodeCache_.insert({insn, newInsn}).first;
+  } else {


I have no data on this but doing this process for every AArch64 instruction may have a detrimental effect on performance. Will need to run through the new CI/CD when it's ready to determine this. Would argue that such a change shouldn't be merged until we know there's no significant performance regression.

For non Predicate / SVE instructions the overhead is 3 if statements and a function call which should be minor. But yes, a performance regression test for this would be good.

Not sure on what an alternative solution could be though if the performance impact is significant...

I'll leave this comment unresolved so we remember to look into this

src/lib/arch/aarch64/Instruction_decode.cc

jj16791 · 2024-10-26T10:06:51Z

src/include/simeng/arch/aarch64/helpers/neon.hh

@@ -568,9 +568,14 @@ RegisterValue vecUMaxP(srcValContainer& sourceValues) {
  const T* n = sourceValues[0].getAsVector<T>();
  const T* m = sourceValues[1].getAsVector<T>();

+  // Concatenate the vectors


Have you double-checked the ordering of the concatenation? Ran it on ookami and I think these may be the wrong way round but worth double checking in case I've made a mistake

As per the spec:

This instruction creates a vector by concatenating the vector elements of the first source SIMD&FP register after the vector elements of the second source SIMD&FP register...

i.e. N is concatonated onto the end of M (M:N)

I think with "what the spec says" vs "observed values", the latter should probably be taken as the truth. So it's worth someone else double-checking that the values I've observed do go against what the spec says

This is very odd and confusing... I've also checked on Ookami and Isambard-AI with the following asm programme:

movi v0.16b, #0 movi v1.16b, #1 movi v2.16b, #2 umaxp v0.16b, v1.16b, v2.16b mov w12, v0.s[0] mov w13, v0.s[3]

Which after executing yields the following:

v0.b = {1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2}

v0.s = {16843009, 16843009, 33686018, 33686018}

Which means the concatonation is v1:v2, NOT v2:v1.
I double checked that gdb doesn't display vector registers "in reverse" (i.e. left-hand most element is in fact v0[0] and not v0[15]) using the final two instructions. Their results were:

w12 = 16843009

w13 = 33686018

So yes, on hardware the concatonation is seemingly vn:vm.

However, the spec and its pseudo code for UMAXP doesn't align with this... From this page, the pseudo code is as follows:

CheckFPAdvSIMDEnabled64(); constant bits(datasize) operand1 = V[n, datasize]; constant bits(datasize) operand2 = V[m, datasize]; bits(datasize) result; constant bits(2*datasize) concat = operand2:operand1; integer element1; integer element2; integer max; for e = 0 to elements-1 element1 = UInt(Elem[concat, 2*e, esize]); element2 = UInt(Elem[concat, (2*e)+1, esize]); max = Max(element1, element2); Elem[result, e, esize] = max<esize-1:0>; V[d, datasize] = result;

Where it is clear that the concatonation according to this is vm:vn....

In this instance, we should probably go with hardware. But it is quite annoying that the spec doesn't align with hardware on this, and that updating our code in-line with the spec still fixed the issue that was occuring!

src/lib/arch/aarch64/Instruction_execute.cc

jj16791 · 2024-10-26T10:18:55Z

test/regression/aarch64/instructions/sme.cc

+      CHECK_MAT_COL(ARM64_REG_ZAS3, i, uint32_t,
+                    fillNeon<uint32_t>(inter32, (SVL / 8)));
+    } else {
+      // Even cols, all elements


Several occurrences of the possible same issue throughout

dANW34V3R

Haven't finished the review but posting comments to prevent overlaps

CMakeLists.txt

src/include/simeng/arch/aarch64/Architecture.hh

src/lib/arch/aarch64/Instruction_decode.cc

src/include/simeng/Register.hh

configs/a64fx_SME.yaml

dANW34V3R

LOOTS of new instructions, well done for grinding through them. Bring on SAIL

src/include/simeng/arch/aarch64/Instruction.hh

dANW34V3R · 2024-10-28T16:02:44Z

src/lib/arch/riscv/Instruction_decode.cc

+  // Identify subgroup type
+  if (isInstruction(InsnType::isBranch))
+    group = InstructionGroups::BRANCH;
+  else if (isInstruction(InsnType::isLoad))
+    group += 8;
+  else if (isInstruction(InsnType::isStore))
+    group += 9;
+  else if (isInstruction(InsnType::isDivide))
+    group += 7;
+  else if (isInstruction(InsnType::isMultiply))
+    group += 6;
+  else if (isInstruction(InsnType::isShift) ||
+           isInstruction(InsnType::isConvert))
+    group += 5;
+  else if (isInstruction(InsnType::isLogical))
+    group += 4;
+  else if (isInstruction(InsnType::isCompare))
+    group += 3;
+  else
+    group += 2;  // Default return is {Data type}_SIMPLE_ARTH
+


Would a switch statement on instructionIdentifier not be viable? Would be slightly nicer to read

I don't this a switch would work here, or would decrease readability if it did. instructionIdentifier is just an uint32_t and so the associated value used in the switch has no meaning. With the current implementation the if clause contains what is being checked against (i.e. isMultiply)

dANW34V3R · 2024-10-28T16:07:28Z

src/lib/arch/aarch64/Instruction_execute.cc

+          }
+          results_[row] = {outRow, 256};


I may be misremembering but I think we said we would stop using this implicit registerValue initialisation in favour of an explicit one. If that is the case this should be updated throughout

e.g. results_[row] = RegisterValue(outRow, 256)

There are 158 instances of using {...} to initialise a RegisterValue. Possibly more if they contain a line break due to a long-names helper function being used.

If this is something we want to move away from then I think it should be a seperate PR, and the {...} constructor should be prohibited for this class (if possible)

Separate PR is sensible. Something to discuss as a team when we next meet

To add to this, there will also be 10s-100s of extra uses inside helper functions and possibly RV64 also

This isn't a problem in RV as I was requested not to use full explicit initialisation. Although it was in a worse state 5a1652f

This isn't a big deal, would just be nice to have a consistent style

dANW34V3R · 2024-10-28T16:11:21Z

src/lib/arch/aarch64/Instruction_execute.cc

            memoryData_[index] =
-                RegisterValue((char*)mdata.data(), md_size * 4);
-            md_size = 0;
+                RegisterValue((char*)memData.data(), memData.size() * 4);


sizeof(uint32_t) might be better than 4 but not that important. Applied throughout

Again, if this is something we want to enforce then it should be its own PR as this occurs extremely often in this file. An instruction's opcode defines the data type used so we know what the multiplicand should be

Similar also occurs in instruction_address

src/lib/arch/aarch64/Instruction_execute.cc

…e instruction group to STREAMING if SM mode is different to when instruction was first decoded.

…config file.

…sion tests.

…ssion test (B, H, S, D)

…ion test (B, H, S, D)

…n alias and regression tests (B, H, S, D)

…uctions and aliases and regression tests (B, H, S, D)

… tests.

…regression tests.

…ests.

…ardware.

FinnWilkinson added enhancement New feature or request 0.9.7 Part of SimEng Release 0.9.7 labels Jun 12, 2024

FinnWilkinson self-assigned this Jun 12, 2024

FinnWilkinson force-pushed the additional-sme-support branch from 531ebd0 to 7974237 Compare August 9, 2024 15:58

FinnWilkinson marked this pull request as ready for review August 28, 2024 13:41

FinnWilkinson requested review from dANW34V3R, jj16791, JosephMoore25 and ABenC377 August 28, 2024 13:41

ABenC377 requested changes Aug 30, 2024

View reviewed changes

FinnWilkinson changed the title ~~[WIP] Full SME(1) instruction support and STREAMING Groups~~ Full SME(1) instruction support and STREAMING Groups Sep 2, 2024

jj16791 requested changes Oct 26, 2024

View reviewed changes

dANW34V3R reviewed Oct 28, 2024

View reviewed changes

FinnWilkinson force-pushed the additional-sme-support branch from 4ad3b6e to aa40d88 Compare October 29, 2024 14:46

FinnWilkinson added 14 commits November 6, 2024 16:43

Added STREAMING versions of relevant aarch64 instruction groups.

cba8cff

Removed un-used macros from AArch64 Instruction decode.

34c1153

Moved aarch64 getGroup logic to instruction_decode.

687d2a9

Moved riscv getGroup logic to instruction_decode.

49fa390

Updated unit tests after changing getGroup logic.

4d1acc9

Added new AArch64 groups to model config and updated integration test.

d13b7cc

Added streaming mode enabled helper functions.

60aeecc

Added STREAMING group logic to instruction_decode, and logic to chang…

89e6b6b

…e instruction group to STREAMING if SM mode is different to when instruction was first decoded.

Fixed minor issues with new streaming groups and updated SME example …

e671cc3

…config file.

Re-wrote checkStreamingGroup function.

813b013

Added unit tests for new AArch64 STREAMING groups functionality.

4e7c429

Updated aarch64 groups diagram in docs.

cae1005

Added SME instruction FMOPS (S and D) support and regression tests.

8352b5a

Added SME instruction SMOPA (S and D) support and regression tests.

b7a991e

FinnWilkinson added 27 commits November 6, 2024 16:43

Fix jenkins build error.

26adf0d

Added SME instructions SUMOPA and SUMOPS (S and D) support and regres…

377dd99

…sion tests.

Updated SUMOPA and SUMOPS tests.

7903d46

Added SME instructions USMOPA and USMOPS (S and D) support and regres…

93c3b6c

…sion tests.

Fix jenkins build error pt2.

e12ccf1

Implemented SME STR instruction and regression test.

d26ef3a

Fixed execution logic for vertical ST1D and ST1W SME stores.

3adc299

Implemented SME ST1B and ST1H (H and V) instruction logic.

e06387b

Implemented SME LD1B and LD1H (H and V) instruction logic.

4cfe0eb

Added SME LD1B and LD1H regression tests.

0a3fc93

Updated ST1D and ST1W SME regression tests.

a713f44

Added SME ST1B and ST1H regression tests.

fac70b5

Implemented SME MOVA (Tile to Vec, horizontal) instructions and regre…

e906dd1

…ssion test (B, H, S, D)

Implemented SME MOVA (Tile to Vec, vertical) instructions and regress…

8c2a6bc

…ion test (B, H, S, D)

Implemented SME MOV (Tile to Vec, vertical and horizontal) instructio…

532f9af

…n alias and regression tests (B, H, S, D)

Implemented SME MOVA/MOV (Vec to Tile, vertical and horizontal) instr…

3b4de2e

…uctions and aliases and regression tests (B, H, S, D)

Implemented SME LDR instruction and regression tests.

fb58957

Implemented SME ADDHA and ADDVA (S and D) instructions and regression…

c194858

… tests.

Updated ADDHA test to make more specific.

7e5e32c

Corrected ADDVA execution logic.

e664cc7

Updated ADDVA test to make more specific.

a064e9b

Added SME MOVA (tile to vec, vec to tile) Quad-word instructions and …

ffed626

…regression tests.

Implemented SME ST1Q and LD1Q (V and H) instructions and regression t…

66e54fd

…ests.

Removed werror.

762588b

NEON instruction logic fixes.

59d7887

Attended PR comments.

32948cf

Switched order of concatonation for NEON UMAXP instruction to match H…

5945bae

…ardware.

FinnWilkinson force-pushed the additional-sme-support branch from 91c4336 to 5945bae Compare November 6, 2024 16:44

Fixed LD1W (into ZA, 32-bit) buffer overflow error.

e15f354

ABenC377 approved these changes Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full SME(1) instruction support and STREAMING Groups #415

Full SME(1) instruction support and STREAMING Groups #415

FinnWilkinson commented Jun 12, 2024 •

edited

Loading

FinnWilkinson commented Jul 9, 2024

ABenC377 Aug 30, 2024

jj16791 Oct 26, 2024

FinnWilkinson Oct 29, 2024

jj16791 Nov 2, 2024

FinnWilkinson Nov 4, 2024

ABenC377 Nov 11, 2024

jj16791 left a comment

jj16791 Oct 26, 2024

FinnWilkinson Oct 29, 2024

jj16791 Nov 2, 2024

jj16791 Oct 26, 2024

FinnWilkinson Oct 28, 2024

jj16791 Nov 2, 2024

FinnWilkinson Nov 4, 2024

jj16791 Oct 26, 2024

dANW34V3R left a comment

dANW34V3R left a comment

dANW34V3R Oct 28, 2024

FinnWilkinson Oct 29, 2024

dANW34V3R Oct 28, 2024

dANW34V3R Oct 28, 2024

FinnWilkinson Oct 29, 2024

dANW34V3R Nov 11, 2024

FinnWilkinson Nov 12, 2024

dANW34V3R Nov 12, 2024

dANW34V3R Oct 28, 2024

FinnWilkinson Oct 29, 2024

FinnWilkinson Nov 12, 2024

Full SME(1) instruction support and STREAMING Groups #415

Are you sure you want to change the base?

Full SME(1) instruction support and STREAMING Groups #415

Conversation

FinnWilkinson commented Jun 12, 2024 • edited Loading

FinnWilkinson commented Jul 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jj16791 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dANW34V3R left a comment

Choose a reason for hiding this comment

dANW34V3R left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FinnWilkinson commented Jun 12, 2024 •

edited

Loading