-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WebAssembly] Support multiple .init_array
fragments when writing Wasm objects
#111008
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-webassembly @llvm/pr-subscribers-mc Author: George Stagg (georgestagg) ChangesFor WebAssembly objects, only functions listed in the first symbol in the Consider the C program: #import <stdio.h>
void init1() { puts("init1"); }
void init2() { puts("init2"); }
void init3() { puts("init3"); }
typedef void (*f)(void);
__attribute__((section(".init_array"))) void (*p_init1)(void) = &init1;
__attribute__((section(".init_array"))) f p_init2[2] = { &init2, &init3 };
int main() {
puts("main");
return 0;
} Compiling for Linux results in each function running before
In previous LLVM versions (<=18, I think?) this also used to throw an error message: This makes things difficult for certain tools that rely on writing multiple objects in the This PR makes two changes to
This picks up any functions listed in later objects in the However, a related issue is revealed by changing the functions in the example above to read: void init1() { puts("1"); }
void init2() { puts("2"); }
void init3() { puts("3"); } Then, with our patched LLVM,
The symbols in the Is this a mistake? AFAICT I don't think there's an appropriate kind for these symbols to be written to the Interestingly, you can also reveal this problem in an unpatched LLVM by writing a Wasm object file that has no data section but does supply an void init1() {}
__attribute__((section(".init_array"))) void (*p_init1)(void) = &init1;
int main() { return 0; }
So, this leads me to the related second change:
With the two changes above, the modified test program above compiles with Emscripten and runs on Wasm with the expected behaviour:
Finally, I have tried to include a unit test. Hopefully it looks reasonable -- it's based on copying similar test files in that directory. Full diff: https://github.com/llvm/llvm-project/pull/111008.diff 2 Files Affected:
diff --git a/llvm/lib/MC/WasmObjectWriter.cpp b/llvm/lib/MC/WasmObjectWriter.cpp
index f25dc92fa235a2..034e058690744a 100644
--- a/llvm/lib/MC/WasmObjectWriter.cpp
+++ b/llvm/lib/MC/WasmObjectWriter.cpp
@@ -1769,6 +1769,11 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm,
WS.setIndex(InvalidIndex);
continue;
}
+ // Contents of .init_array sections are handled elsewhere.
+ if (WS.isDefined() &&
+ WS.getSection().getName().starts_with(".init_array")) {
+ continue;
+ }
LLVM_DEBUG(dbgs() << "adding to symtab: " << WS << "\n");
uint32_t Flags = 0;
@@ -1853,49 +1858,54 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm,
if (EmptyFrag.getKind() != MCFragment::FT_Data)
report_fatal_error(".init_array section should be aligned");
- const MCFragment &AlignFrag = *EmptyFrag.getNext();
- if (AlignFrag.getKind() != MCFragment::FT_Align)
- report_fatal_error(".init_array section should be aligned");
- if (cast<MCAlignFragment>(AlignFrag).getAlignment() !=
- Align(is64Bit() ? 8 : 4))
- report_fatal_error(".init_array section should be aligned for pointers");
-
- const MCFragment &Frag = *AlignFrag.getNext();
- if (Frag.hasInstructions() || Frag.getKind() != MCFragment::FT_Data)
- report_fatal_error("only data supported in .init_array section");
-
- uint16_t Priority = UINT16_MAX;
- unsigned PrefixLength = strlen(".init_array");
- if (WS.getName().size() > PrefixLength) {
- if (WS.getName()[PrefixLength] != '.')
+ const MCFragment *nextFrag = EmptyFrag.getNext();
+ while (nextFrag != nullptr) {
+ const MCFragment &AlignFrag = *nextFrag;
+ if (AlignFrag.getKind() != MCFragment::FT_Align)
+ report_fatal_error(".init_array section should be aligned");
+ if (cast<MCAlignFragment>(AlignFrag).getAlignment() !=
+ Align(is64Bit() ? 8 : 4))
report_fatal_error(
- ".init_array section priority should start with '.'");
- if (WS.getName().substr(PrefixLength + 1).getAsInteger(10, Priority))
- report_fatal_error("invalid .init_array section priority");
- }
- const auto &DataFrag = cast<MCDataFragment>(Frag);
- const SmallVectorImpl<char> &Contents = DataFrag.getContents();
- for (const uint8_t *
- P = (const uint8_t *)Contents.data(),
- *End = (const uint8_t *)Contents.data() + Contents.size();
- P != End; ++P) {
- if (*P != 0)
- report_fatal_error("non-symbolic data in .init_array section");
- }
- for (const MCFixup &Fixup : DataFrag.getFixups()) {
- assert(Fixup.getKind() ==
- MCFixup::getKindForSize(is64Bit() ? 8 : 4, false));
- const MCExpr *Expr = Fixup.getValue();
- auto *SymRef = dyn_cast<MCSymbolRefExpr>(Expr);
- if (!SymRef)
- report_fatal_error("fixups in .init_array should be symbol references");
- const auto &TargetSym = cast<const MCSymbolWasm>(SymRef->getSymbol());
- if (TargetSym.getIndex() == InvalidIndex)
- report_fatal_error("symbols in .init_array should exist in symtab");
- if (!TargetSym.isFunction())
- report_fatal_error("symbols in .init_array should be for functions");
- InitFuncs.push_back(
- std::make_pair(Priority, TargetSym.getIndex()));
+ ".init_array section should be aligned for pointers");
+
+ const MCFragment &Frag = *AlignFrag.getNext();
+ nextFrag = Frag.getNext();
+ if (Frag.hasInstructions() || Frag.getKind() != MCFragment::FT_Data)
+ report_fatal_error("only data supported in .init_array section");
+
+ uint16_t Priority = UINT16_MAX;
+ unsigned PrefixLength = strlen(".init_array");
+ if (WS.getName().size() > PrefixLength) {
+ if (WS.getName()[PrefixLength] != '.')
+ report_fatal_error(
+ ".init_array section priority should start with '.'");
+ if (WS.getName().substr(PrefixLength + 1).getAsInteger(10, Priority))
+ report_fatal_error("invalid .init_array section priority");
+ }
+ const auto &DataFrag = cast<MCDataFragment>(Frag);
+ const SmallVectorImpl<char> &Contents = DataFrag.getContents();
+ for (const uint8_t *
+ P = (const uint8_t *)Contents.data(),
+ *End = (const uint8_t *)Contents.data() + Contents.size();
+ P != End; ++P) {
+ if (*P != 0)
+ report_fatal_error("non-symbolic data in .init_array section");
+ }
+ for (const MCFixup &Fixup : DataFrag.getFixups()) {
+ assert(Fixup.getKind() ==
+ MCFixup::getKindForSize(is64Bit() ? 8 : 4, false));
+ const MCExpr *Expr = Fixup.getValue();
+ auto *SymRef = dyn_cast<MCSymbolRefExpr>(Expr);
+ if (!SymRef)
+ report_fatal_error(
+ "fixups in .init_array should be symbol references");
+ const auto &TargetSym = cast<const MCSymbolWasm>(SymRef->getSymbol());
+ if (TargetSym.getIndex() == InvalidIndex)
+ report_fatal_error("symbols in .init_array should exist in symtab");
+ if (!TargetSym.isFunction())
+ report_fatal_error("symbols in .init_array should be for functions");
+ InitFuncs.push_back(std::make_pair(Priority, TargetSym.getIndex()));
+ }
}
}
diff --git a/llvm/test/MC/WebAssembly/init-array.ll b/llvm/test/MC/WebAssembly/init-array.ll
new file mode 100644
index 00000000000000..d19ddcf10d5d9c
--- /dev/null
+++ b/llvm/test/MC/WebAssembly/init-array.ll
@@ -0,0 +1,56 @@
+; RUN: llc -mcpu=mvp -filetype=obj %s -o - | obj2yaml | FileCheck %s
+
+target triple = "wasm32-unknown-unknown"
+
+@p_init1 = hidden global ptr @init1, section ".init_array", align 4
+@p_init2 = hidden global ptr @init2, section ".init_array", align 4
+
+define hidden void @init1() #0 { ret void }
+define hidden void @init2() #0 { ret void }
+
+
+; CHECK: - Type: IMPORT
+; CHECK-NEXT: Imports:
+; CHECK-NEXT: - Module: env
+; CHECK-NEXT: Field: __linear_memory
+; CHECK-NEXT: Kind: MEMORY
+; CHECK-NEXT: Memory:
+; CHECK-NEXT: Minimum: 0x0
+; CHECK-NEXT: - Module: env
+; CHECK-NEXT: Field: __indirect_function_table
+; CHECK-NEXT: Kind: TABLE
+; CHECK-NEXT: Table:
+; CHECK-NEXT: Index: 0
+; CHECK-NEXT: ElemType: FUNCREF
+; CHECK-NEXT: Limits:
+; CHECK-NEXT: Minimum: 0x0
+; CHECK-NEXT: - Type: FUNCTION
+; CHECK-NEXT: FunctionTypes: [ 0, 0 ]
+; CHECK-NEXT: - Type: CODE
+; CHECK-NEXT: Functions:
+; CHECK-NEXT: - Index: 0
+; CHECK-NEXT: Locals: []
+; CHECK-NEXT: Body: 0B
+; CHECK-NEXT: - Index: 1
+; CHECK-NEXT: Locals: []
+; CHECK-NEXT: Body: 0B
+; CHECK-NEXT: - Type: CUSTOM
+; CHECK-NEXT: Name: linking
+; CHECK-NEXT: Version: 2
+; CHECK-NEXT: SymbolTable:
+; CHECK-NEXT: - Index: 0
+; CHECK-NEXT: Kind: FUNCTION
+; CHECK-NEXT: Name: init1
+; CHECK-NEXT: Flags: [ VISIBILITY_HIDDEN ]
+; CHECK-NEXT: Function: 0
+; CHECK-NEXT: - Index: 1
+; CHECK-NEXT: Kind: FUNCTION
+; CHECK-NEXT: Name: init2
+; CHECK-NEXT: Flags: [ VISIBILITY_HIDDEN ]
+; CHECK-NEXT: Function: 1
+; CHECK-NEXT: InitFunctions:
+; CHECK-NEXT: - Priority: 65535
+; CHECK-NEXT: Symbol: 0
+; CHECK-NEXT: - Priority: 65535
+; CHECK-NEXT: Symbol: 1
+; CHECK-NEXT: ...
|
@sbc100 who should we ask to review this? |
From git blame looks like @sbc100 would be the most appropriate reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable. I suppose that C++ and rust always generate just a single fragment which is why we haven't seen issues with this before?
llvm/lib/MC/WasmObjectWriter.cpp
Outdated
@@ -1769,6 +1769,11 @@ uint64_t WasmObjectWriter::writeOneObject(MCAssembler &Asm, | |||
WS.setIndex(InvalidIndex); | |||
continue; | |||
} | |||
// Contents of .init_array sections are handled elsewhere. | |||
if (WS.isDefined() && | |||
WS.getSection().getName().starts_with(".init_array")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it possible to define a symbol within this section? Can you give and example? It seems like it should be just a sequence of pointers, but I guess somehow you can have symbol that points into that sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the following C code does so:
#import <stdio.h>
void init1() { puts("init1"); }
__attribute__((section(".init_array"))) void (*p_init[3])(void) = { &init1, &init1, &init1 };
int main() { return 0; }
Under (e.g.) Linux, a list of pointers is written to the .init_array
section and a symbol p_init
points to the sequence.
$ clang -c symbol.c -o symbol.o
$ obj2yaml symbol.o
[...]
Symbols:
[...]
- Name: p_init
Type: STT_OBJECT
Section: .init_array
Binding: STB_GLOBAL
Size: 0x18
However, with a WebAssembly object this does not make as much sense. WasmObjectWriter.cpp
converts the sequence of pointers in .init_array
section into entries in the WASM_INIT_FUNC
subsection of the custom linking
section instead. There they are no longer written as a list of pointers, but indexes into a table.
So, this particular change makes it so that the p_init
symbol is not written into the WebAssembly object file. In the current version a symbol is actually written, but it is broken in any case since it points to the incorrect section (leading to the invalid data segment index: 0
bug shown in the OP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error out int this maybe? It seems like if somebody had a usage of p_init
somewhere it would simply not work. How about something like "symbols within .init_array
sections are not supported"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we error out int this maybe? It seems like if somebody had a usage of
p_init
somewhere it would simply not work. How about something like "symbols within .init_array sections are not supported"?
A reasonable argument, but I think it would get in the way.
It seems to me the p_init
symbol is just a side-effect of this pattern in C, where the real intention is just to add the list of function pointers to the .init_array
section. We don't actually need to reference p_init
anywhere, but if we throw an error here we unfortunately break the pattern.
Ultimately what I am interested in is some Rust code that uses a similar pattern. Essentially:
unsafe extern "C" fn __ctor() {
[...]
}
#[cfg_attr(link_section = ".init_array")]
static __CTOR: unsafe extern "C" fn() = __ctor;
So I would be grateful if we could continue to allow for this use case and not error out, even if the side-effect symbol cannot be referenced.
Another option is to keep the current situation, but it means somehow otherwise handling the invalid data segment index
error when using obj2yaml
in the unit test for this PR. I think by avoiding emitting the symbols we are doing no worse than before, and perhaps a little better in that regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen though if there is a reference to the p_init
symbol though, either within the object itself, or in some other object. Would this result in a relocation with against a non-existent symbol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about landing this PR without this part of the patch and then following up with another PR for e.g. "Handle symbols in init_array sections".
IIUC the test in this PR doesn't contain such symbols anyway so we would likely want a separate test anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsafe extern "C" fn __ctor() { [...] } #[cfg_attr(link_section = ".init_array")] static __CTOR: unsafe extern "C" fn() = __ctor;
In this example which is the symbol that lives in the .init_array
section? I guess it would be __CTOR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen though if there is a reference to the p_init symbol though, either within the object itself, or in some other object. Would this result in a relocation with against a non-existent symbol?
Looks like in both cases you get an error at link time:
wasm-ld: error: symbol.o: invalid data relocation: init1
I see, that's probably not ideal.
In this example which is the symbol that lives in the
.init_array
section? I guess it would be __CTOR?
Yes. Though Rust mangles the name.
What do you think about landing this PR without this part of the patch and then following up with another PR for e.g. "Handle symbols in init_array sections".
I think I'm OK with that, a fresh set of eyes on it in a new PR might be good in any case. I'll push a commit to remove this part of the PR.
BTW viewing this diff with "ignore whitespace" helped a lot to see what was going on. Regarding the test case, can it be written in assembly instead? |
I guess so, let me throw together a quick C++ example and see how it currently handles the constructors...
First, compile and test this C++ code:
#include <iostream>
using namespace std;
class Init1 {
public: Init1() { cout << "Init1" << endl; }
};
class Init2 {
public: Init2() { cout << "Init2" << endl; }
};
Init1 init1a;
Init1 init1b;
Init2 init2;
int main() {
cout << "main" << endl;
return 0;
}
Let's make sure the
Yes, but just with a single function. I suppose the constructor functions have been coalesced into a single function. Let's take a look in
Indeed, the global constructors are combined into a single function. Note that LLVM generates code using Yes, at least for simple C++.
Yes, apologies. The diff is smaller than it looks due to the change in indentation!
Unfortunately, I'm not sure how to express this in WebAssembly text format directly. There is an open issue suggesting it is not currently possible to define custom section content (WebAssembly/design#1153) in the text format. I'm also not entirely sure how I would express the test, since the bug occurs in the code that converts the I would be willing to change the test if it is possible, but would need specific help in how to proceed. |
Ping |
I think thinking you could convert the I think you can get a starting point using |
Ahhh, gotcha! Yes, done and pushed. |
783024e
to
d5495ac
Compare
For WebAssembly objects, only functions listed in the first symbol in the
.init_array
section are invoked.Consider the C program:
Compiling for Linux results in each function running before
main()
. But for WebAssembly (here via Emscripten) only the first function is invoked:In previous LLVM versions (<=18, I think?) this also used to throw an error message:
fatal error: error in backend: only one .init_array section fragment supported
.This makes things difficult for certain tools that rely on writing multiple objects in the
.init_array
section for life-before-main initialisation functions.This PR makes two changes to
WasmObjectWriter
:.init_array
section into start functions, (adding them to anInitFuncs
vector for writing out later), now loops through all the subsequent fragments of the section until there are none remaining.This picks up any functions listed in later objects in the
.init_array
section, and they are now successfully invoked before main.However, a related issue is revealed by changing the functions in the example above to read:
Then, with our patched LLVM,
The symbols in the
.init_array
section are written to theWASM_SYMBOL_TABLE
in the customlinking
section, with kindWASM_SYMBOL_TYPE_DATA
. But, these symbols do not reference content in the DATA section, and so the written offset makes no sense.Is this a mistake? AFAICT I don't think there's an appropriate kind for these symbols to be written to the
WASM_SYMBOL_TABLE
subsection with. In fact, symbols for the functions referenced by the.init_array
sections are independently written to a differentWASM_INIT_FUNC
subsection in any case.Interestingly, you can also reveal this problem in an unpatched LLVM by writing a Wasm object file that has no data section but does supply an
.init_array
function:So, this leads me to the related second change:
.init_array
section to theWASM_SYMBOL_TABLE
subsection of thelinking
custom section, because these are handled elsewhere and written to theWASM_INIT_FUNCS
subsection instead.With the two changes above, the modified test program above compiles with Emscripten and runs on Wasm with the expected behaviour:
Finally, I have tried to include a unit test. Hopefully it looks reasonable -- it's based on copying similar test files in that directory.