Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Matrix-free cuda #204

Draft
wants to merge 34 commits into
base: master
Choose a base branch
from
Draft

[WIP] Matrix-free cuda #204

wants to merge 34 commits into from

Conversation

Rombur
Copy link
Collaborator

@Rombur Rombur commented Jun 14, 2019

This PR is very much a WIP, there is no need to look at it except for @masterleinad :) The things that need to be done in order of priority:

  • Make this test work in parallel: the good news is that it works in serial the bad news is that we actually never ran that test in parallel and so even the matrix-based behaves in a strange way (i.e. only the first step decreases the error). In matrix-free, there is a segfault on rank 1 here in locally_relevant_global_diag. locally_relevant_global_diag is computed on the host because we can't get the diagonal from matrix-free with cuda.
  • I have only tested in serial with 1 eigenvector, so we need to check what happens with 2 eigenvectors. There is no need to use more eigenvectors because we know that lanczos is very brittle.
  • Right now we don't use material_property when using cuda matrix-free, so we need to add that. This can be done in another PR
  • I have had to make a few changes in the interface of several functions. Which means that only one test compiles, the others are commented. We need to uncomment them and make them pass again.
  • We need to rebase on master.

This is all the work that still needs to be done on this PR. Once it's done we can do some cleaning and split this PR in smaller ones. If you can tackle the first two points that would be fantastic but feel free to do more if you are bored 😄

@masterleinad
Copy link
Collaborator

AFAICT the only problem left is with the coarse solver, i.e. replacing it by the identity gives the same results for 1 and 2 MPI processes.

@masterleinad
Copy link
Collaborator

I see the same test failure locally as ci reports:

/var/jenkins/workspace/mfmg-experimental/tests/test_hierarchy_device.cu(308): [NON-XML-CHAR-0x1B][1;31;49merror: in "amgx/_0": difference{3.16667} between conv_rate{1} and ref_solution{0.23999999999999999} exceeds 5%
Failure occurred in a following context:
    mesh_evaluator_type = matrix_based; [NON-XML-CHAR-0x1B][0;39;49m

@codecov-io
Copy link

codecov-io commented Jul 1, 2019

Codecov Report

Merging #204 into master will decrease coverage by 0.54%.
The diff coverage is 64.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #204      +/-   ##
==========================================
- Coverage   89.66%   89.11%   -0.55%     
==========================================
  Files          57       58       +1     
  Lines        3250     3270      +20     
==========================================
  Hits         2914     2914              
- Misses        336      356      +20
Impacted Files Coverage Δ
include/mfmg/dealii/dealii_mesh_evaluator.hpp 33.33% <ø> (ø) ⬆️
tests/test_hierarchy_helpers.hpp 100% <ø> (+5.63%) ⬆️
include/mfmg/common/mesh_evaluator.hpp 100% <ø> (ø) ⬆️
include/mfmg/common/lanczos.templates.hpp 99.41% <ø> (ø) ⬆️
include/mfmg/common/hierarchy.hpp 66.88% <0%> (-10.22%) ⬇️
tests/material_property.hpp 86.44% <86.44%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12e35a2...d09730e. Read the comment docs.

@masterleinad
Copy link
Collaborator

It seems like we need to update deal.II again such that dealii::CUDAWrappers::get_quadrature_point<dim, double> returns Point<dim>. I will also look into more device support for dealii::Point<dim> within deal.II.

@Rombur
Copy link
Collaborator Author

Rombur commented Jul 5, 2019

I guess someone jumped the gun in deal.II moving from Tensor to Point...

@masterleinad
Copy link
Collaborator

I guess someone jumped the gun in deal.II moving from Tensor to Point...

Just for the record: the main problem is that you would expect get_quadrature_point() to return a Point-like object such that you can use it instead of a Point. Unfortunately, converting a Tensor to a Point is currently not possible in device code.
On the other hand, we really just need to use a more recent deal.II version for this pull request to work.

@masterleinad
Copy link
Collaborator

I pushed 19.07.0 built using c7f7cca4763d84e0fb02f5ee63bccf54b802ddff.

@Rombur Rombur closed this Jul 5, 2019
@Rombur Rombur reopened this Jul 5, 2019
@Rombur
Copy link
Collaborator Author

Rombur commented Jul 5, 2019

Still does not pass

@masterleinad
Copy link
Collaborator

All the device test are passing.

@Rombur
Copy link
Collaborator Author

Rombur commented Jul 5, 2019

yeah but for that the Device tests to pass, we had to update deal.II and now another test breaks

@masterleinad
Copy link
Collaborator

But that is due to updating deal.II and not this pull request.

@masterleinad
Copy link
Collaborator

I just checked that the test fails for the previous commit as well.

@masterleinad
Copy link
Collaborator

I just checked that the test fails for the previous commit as well.

Seems like e4fe1b806f6d086218cae85c58342815c7cef55a is guilty.

@masterleinad
Copy link
Collaborator

The corresponding pull request is https://github.com/dealii/dealii/pull/8128/files. I am really confused about how that can make any difference for the lanczos test (and only that test).

@masterleinad
Copy link
Collaborator

Reverting the changes in that pull request (on top of master) really makes the test pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants