Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump with amdgpu plugin causes kernel crash #2450

Open
rst0git opened this issue Jul 21, 2024 · 5 comments
Open

dump with amdgpu plugin causes kernel crash #2450

rst0git opened this issue Jul 21, 2024 · 5 comments
Assignees
Labels
bug-crash gpu/amd kernel no-auto-close Don't auto-close as a stale issue

Comments

@rst0git
Copy link
Member

rst0git commented Jul 21, 2024

Running criu dump for a simple ROCm application (HelloWorld.cpp) on Ubuntu 22.04 (6.5.0-44-generic kernel) causes kernel crash. This problem occurs with CRIU installed from both master and criu-dev branches.

HelloWorld.cpp:

#include <hip/hip_runtime.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <string>
#include <fstream>
#include <unistd.h>

#define SAMPLE_VERSION "HIP-Examples-Application-v1.0"
#define SUCCESS 0
#define FAILURE 1

using namespace std;

__global__ void helloworld(char* in, char* out)
{
	int num = hipThreadIdx_x + hipBlockDim_x * hipBlockIdx_x;
	out[num] = in[num] + 1;
}

int main(int argc, char* argv[])
{

	hipDeviceProp_t devProp;
	hipGetDeviceProperties(&devProp, 0);
	cout << " System minor " << devProp.minor << endl;
	cout << " System major " << devProp.major << endl;
	cout << " agent prop name " << devProp.name << endl;

	while (1) {
		/* Initial input,output for the host and create memory objects for the kernel*/
		const char* input = "GdkknVnqkc";
		size_t strlength = strlen(input);
		cout << "input string:" << endl;
		cout << input << endl;
		char *output = (char*) malloc(strlength + 1);

		char* inputBuffer;
		char* outputBuffer;

		hipMalloc((void**)&inputBuffer, (strlength + 1) * sizeof(char));
		hipMalloc((void**)&outputBuffer, (strlength + 1) * sizeof(char));

		hipMemcpy(inputBuffer, input, (strlength + 1) * sizeof(char), hipMemcpyHostToDevice);

		hipLaunchKernelGGL(
				helloworld,
				dim3(1),
				dim3(strlength),
				0, 0,
				inputBuffer, outputBuffer
				);

		hipMemcpy(output, outputBuffer,(strlength + 1) * sizeof(char), hipMemcpyDeviceToHost);

		hipFree(inputBuffer);
		hipFree(outputBuffer);

		//Add the terminal character to the end of output.
		output[strlength] = '\0';

		cout << "\noutput string:" << endl;
		cout << output << endl;

		free(output);
		sleep(1);
	}

	std::cout<<"Passed!\n";
	return SUCCESS;
}

A similar problem also occurs with kernel version 6.8.0-38-generic :

A similar problem occurs on RHEL 9.4 (5.14.0-427.26.1.el9_4.x86_64) when criu dump exits with an error (e.g., when the --shell-job option is not specified). This causes an immediate system reboot.

@avagin
Copy link
Member

avagin commented Aug 7, 2024

@fxkamd could you take a look at this?

@dayatsin-amd dayatsin-amd removed their assignment Aug 7, 2024
@dayatsin-amd
Copy link
Contributor

@fdavid-amd is investigating

@fdavid-amd
Copy link
Contributor

Could you attach your HelloWorld.cpp?

@rst0git
Copy link
Member Author

rst0git commented Aug 9, 2024

Could you attach your HelloWorld.cpp?

There is a dropdown in #2450 (comment) with the code of HelloWorld.cpp. This code is based on the example from the HIP-Examples repository.

Copy link

github-actions bot commented Sep 9, 2024

A friendly reminder that this issue had no activity for 30 days.

@rst0git rst0git added no-auto-close Don't auto-close as a stale issue and removed stale-issue labels Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-crash gpu/amd kernel no-auto-close Don't auto-close as a stale issue
Projects
None yet
Development

No branches or pull requests

5 participants