-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HeapHogLoadTest_5m crash vmState=0x00000000 #19081
Comments
Looks like a dup of #16029 |
I ran a 30x Grinder job for HeapHogLoadTest_5m_0, and got no failures. |
One more successful Grinder job (30x). https://openj9-jenkins.osuosl.org/job/Grinder/3369 |
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_mac_Nightly_testList_0/533
|
|
I ran a 20x Grinder job, and got 2 failures. I see the following messages from stack walker in those failures. Failure 1: (output_1721268474258)
There is no compiled code for Failure 2: (output_17212724215180)
No javacore file is available for Failure 2. |
I ran two Grinder jobs (30x each) for the same test with Java 17 on AArch64 Linux, and got no failures. |
I ran two 30x Grinder jobs (internal 42237 and 42238) on AArch64 macOS with JVM_OPTIONS |
Issue #19456 (HeapHogLoadTest with Java 22 on AArch64 macOS) might be a dup of this issue. |
Call stack is unavailable in the javacore file:
|
2 failures in a 40x Grinder job. |
You cannot walk the Java stack in the core file from the failure above. Jdmpview fails with CorruptDataException as shown below.
|
50x Grinder job with the option On the other hand, jobs with the following options failed:
|
Disabling compilation of |
|
I thought I would be able to reproduce the failure by On the other hand, a Grinder job with |
The failure disappears when running with |
A 50x Grinder job finished successfully by disabling the optimization for |
I ran a 50x Grinder job (internal 43420) hoping PR #20173 fixed the problem, but the job failed in the same way as before. |
openjdk17_j9_extended.system_aarch64_mac(
50x grinder - passed |
One more 40x Grinder on AIX: https://openj9-jenkins.osuosl.org/job/Grinder/4057/ |
openjdk17_j9_extended.system_aarch64_mac(
50x internal grinder - 13/50 failed |
Failure rate in the previous comment (13/50) is much higher than before. I ran these Grinder jobs using binaries from weekly tests:
Version outputs:
|
I ran two other Grinder jobs with a change in #19081 (comment) that disables the optimization for |
|
@nbhuiyan Any suggestion for what to try for investigating this problem? Summary of the current status:
|
It seems like some of the assumptions made in |
I shared some trace files, javacore files, and a jitdump file with @nbhuiyan via internal file sharing. |
I have been able to reproduce this locally. Having examined some of my locally generated core dumps, I was first looking for what may be platform-specific that could explain why this issue was only showing up on aarch64, and I found that there may have been involvement of some FFI call that could have gone wrong. Upon further investigation, I realized I was headed down the wrong path and the issue seemed to be within the In my local setup, here is the Java call stack of a crash:
Topmost method where the crash occurs is interpreted. I also encounter a corrupt data issue while reading this:
Bytecode index 10 of the method above is an
The caller of that interpreted method
Prior to dispatching into
Coming back to openj9/runtime/vm/BytecodeInterpreter.hpp Lines 9247 to 9257 in c113b72
If we took the
Notice how the I still do not know why disabling recognized call transformer or disabling just linkToVirtual transformer fixes this issue. I also do not know why this problem is only seen in aarch64. |
Last week I worked with @babsingh to try and determine the exact location of the segfault origin in the bytecode interpreter using some of the core files generated from my local setup. Unfortunately, we could not obtain that information from those cores. Therefore, we decided to re-run with openj9/runtime/vm/BytecodeInterpreter.hpp Line 9253 in c113b72
I chose to return with
This confirms the possible cause for the segfault I described towards the end of my previous comment, that the Other things I have tried since my last update:
Therefore, I think this issue requires investigation by the VM team to determine how we end up with the incorrect |
@babsingh, given the results of the investigation with @nbhuiyan, will you continue investigating this or somebody else on the VM team? |
|
|
Failure was reproducible with |
I'm fairly certain that this rules out the JNI paths, though it may be worth confirming that the command line option guarantees no call to |
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_extended.system_aarch64_mac_Nightly_testList_2/455/
HeapHogLoadTest_5m_0
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_extended.system_aarch64_mac_Nightly_testList_2/455/system_test_output.tar.gz
The text was updated successfully, but these errors were encountered: