interpretation of `gpu_processes` #2413

andre-merzky · 2019-01-30T09:51:16Z

andre-merzky
Jan 30, 2019
Maintainer

The handling of gpu_processes in RP.v1 is confusing: depending on (a) agent and resource configuration and (b) specified cpu_processes, gpu_processes will trigger the use of additional CPU cores or not. This needs to be clarified and simplified. Specifically, such requests need to be uniformly handled in the agent schedulers.

Below is a proposal for a better definition of these task description attributes. There has been a slack discussion with additional proposals and concerns - those will be added to that ticket as well.

andre-merzky · 2019-01-30T10:02:32Z

andre-merzky
Jan 30, 2019
Maintainer Author

Proposal:

all processes use 1 CPU core.
use gpu_processes to request processes which have a GPU allocated
use cpu_processes to request processes which have no GPU allocated (CPU only)
use cpu_threads to allocate additional cores for each cpu_process
use gpu_threads to allocate additional cores for each gpu_process
use cpu_process_type = POSIX to request cores on the same node
use cpu_process_type = MPI to request cores on the same or different nodes
use gpu_process_type = POSIX to request cores+GPUs on the same node
use gpu_process_type = MPI to request cores+GPUs on the same or different nodes
if both cpu_process_type and gpu_process_type are POSIX, cpu processes and gpu processes must reside on the same node.

Example 1:

node layout:  16 cores, 2 gpus

cud:
  cpu_processes    = 2
  cpu_threads      = 4
  cpu_process_type = POSIX

  gpu_processes    = 2
  gpu_threads      = 4
  gpu_process_type = POSIX

result: this fills exactly one node.  If no full node is available, the CU cannot run,

node layout:  8 cores, 2 gpus

cud:
  cpu_processes    = 2
  cpu_threads      = 4
  cpu_process_type = MPI

  gpu_processes    = 2
  gpu_threads      = 4
  gpu_process_type = MPI

result: this fills exactly two nodes, 2 gpus used in total.  The CU can also be scheduled on 4 half-nodes, 2 of which will need 1 gpu available

node layout:  8 cores, 2 gpus

cud:
  gpu_processes    = 2
  gpu_threads      = 4
  gpu_process_type = MPI

result: this fills exactly one node, both gpus being used.  The CU can also be scheduled on 2 half-nodes, each needing 1 gpu available.

0 replies

vivek-bala · 2019-01-30T15:03:29Z

vivek-bala
Jan 30, 2019

The proposal does look complete. But I think there is a possibility that we might end up under-utilizing some resources. I might be wrong, so let me pose a question:

Let's say I have 4 nodes on a machine where each node has 4 cpus and 4 gpus. I have an application that runs by starting one cpu process per unique node (to use the gpus on that node) that spawns N gpu processes each of which use 1 gpu. I want to run my application with 4 gpu processes.

Case 1: If there are 4 gpus available on a node, then we use 1 cpu and 4 gpus on that node.
Case 2: If case 1 is not possible and there are 2 gpus available on 2 nodes, then we use 1 cpu and 2 gpus on each of the 2 nodes.
Case 3: If cases 1 and 2 are not possible and there is 1 gpu available on each of the 4 nodes, then we use 1 cpu and 1 gpu on each of the 4 nodes.

If we specify cpu_processes=0, gpu_processes=4, gpu_process_type=MPI according to the proposal above, we do guarantee that it will have 4 cpus and 4 gpus which is the worst case (case 3). But might be wasting 2 cpus (case 2) or 3 cpus (case 1).

Do you agree? Let me know if you I am missing something. I don't have a concrete solution at the moment unfortunately, but wanted to point out that corner case. I will keep thinking on a solution if you agree with my assessment.

Also, I think that's how the Princeton application works based on my discussions with Wenjie@Princeton. We only used their application on Titan which 1 had gpu per node and didn't encounter this problem. I can confirm with them on Wednesday.

0 replies

andre-merzky · 2019-01-30T16:51:52Z

andre-merzky
Jan 30, 2019
Maintainer Author

Hey Vivek,

I think there is several misunderstandings.

First, this ticket is about the unit description. Unless we re-introduces the auto-add, no decision we make here can have any influence on efficiency and resource utilization. The CUD is about the user specifying how many processes, threads and GPUs it needs, and there is no space for us to interpret this one way or the other, resulting in a different number than the user specified. The user specified it needs 4 cpu processes and 2 cpu processes, then that is what it gets.

Second, scheduling and placement of those resource requests can lead to more or less efficient application execution - distributing stuff over nodes will be less efficient than placing on the same node. But that is what the scheduler has to address. In either case, the scheduler will not be able to change the number of cores or gpus needed.

Third, your formulations: I have an application that runs by starting one cpu process per unique node (to use the gpus on that node)... - The application is not able to start one process per node. The CUD can prescribe 4 cpu processes, and can specify MPI - but if those get placed on the same node is up to the scheduler, and the process startup is up to the executor.

... that spawns N gpu processes... - if an application process wants to spawn more processes or threads, it needs to request those via cpu_threads, so that it gets pinned to the right set of cores - otherwise it would not be able to access other cores in the first place (assuming pinning is supported by the launch method, like aprun, orte, jsrun).

... each of which use 1 gpu... - the CUD needs to define how many processes will use a GPU, and the scheduler will then determine where those processes will run, and what GPU devices are exposed to them. That is not a decision the application will be able to make at runtime.

To your use case (and let me know if that CUD is correct or not):

cud = {'cpu_processes' : 2,
        'gpu_processes': 2,
        'gpu_process_type' : MPI
}

This CU will always result in 4 processes, two of which will have 1 GPU assigned. If those processes are on the same node or others depend on the scheduler and its state. Yes, it may result in everything sitting on the same node, rendering 2 GPUs of that node unusable at the moment - but that can only be addressed via a different scheduler, not via a different interpretation of the CUD.

0 replies

andre-merzky · 2019-02-13T22:01:43Z

andre-merzky
Feb 13, 2019
Maintainer Author

This topic was discussed on the RCT call. There seemed to be consensus that gpu_processes / gpu_threads won't get us very far in describing application layout - (a) because threads are the wrong abstraction, and (b) this is limiting for non-trivial CU layouts. We'll pick the discussion up on the next call.

0 replies

andre-merzky · 2019-02-13T22:09:46Z

andre-merzky
Feb 13, 2019
Maintainer Author

New proposal: introduce a new alternative CU description format which is centered around the resource requirements of individual processes:

{
   'executable'  : '/bin/foo',
   'process_type': rp.MPI,
   'processes'   : [{ 
                        'count' : 1,
                        'cores' : 4,
                        'gpus'  : 2,
                        'mem'   : '2GB'
                    },
                    {
                        'count' : 5,
                        'cores' : 4,
                        'mem'   : '8GB'
                    }]
}

This way we know how many processes to spawn (6 inn this case), but leave resource usage and mapping to the application. The first process may use the GPU devices from the process, or from another process spawned on the second core, or from a thread spawned on the 4th core, similar to the 2GB memory which may be allocated by the process created by RP, or by a thread spawned by the application.

0 replies

vivek-bala · 2019-02-13T22:19:43Z

vivek-bala
Feb 13, 2019

All resource keys (cores, gpus, mem, lfs) will evaluated as the requirement per process (or per process count to be precise). Correct?

0 replies

AymenFJA · 2019-02-27T20:46:17Z

AymenFJA
Feb 27, 2019
Collaborator

So based on our last meeting and based on what @andre-merzky described ( The CUD is about the user specifying how many processes, threads, and GPUs it needs), I would like to mention few things :

The basic unit of execution in CUDA is the thread.
The Threads are organized into blocks.
The blocks are organized into grids.

in CUDA the dimensions of a block and the grid are specified at kernel launch time. Also when the CUDA runtime places a warp on single multiprocessing (SM), that warp remains on that SM until it finishes executing.

dim3 blockDim(16, 16);
dim3 gridDim(10, 10, 10);
my_kernel<<<gridDim, blockDim>>>();

So to answer the question if we can run 2 applications (executables) on the same GPU device, the answer is YES .

Both applications can run at the same time, however, the kernels will be serialized. This assumes that the 2 applications memory and resource usage combined will fit on the same GPU.

How ?
By enabling the Multi-process service (MPS) environment in the CUDA kernel through bin/bash
example :

export CUDA_VISIBLE_DEVICES="0"

 do
     mkdir /tmp/mps_$i
     mkdir /tmp/mps_log_$i
     export CUDA_VISIBLE_DEVICES=$i
     export CUDA_MPS_PIPE_DIRECTORY=/tmp/mps_$i
     export CUDA_MPS_LOG_DIRECTORY=/tmp/mps_log_$i
     nvidia-cuda-mps-control -d
end do

mps_run

0 replies

andre-merzky · 2019-04-03T09:16:25Z

andre-merzky
Apr 3, 2019
Maintainer Author

closed for v2

0 replies

iparask · 2020-06-23T13:50:19Z

iparask
Jun 23, 2020

This is still an ongoing discussion

0 replies

iparask · 2020-07-27T20:33:24Z

iparask
Jul 27, 2020

Connected to #1891

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

interpretation of `gpu_processes` #2413

{{title}}

Replies: 10 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

interpretation of gpu_processes #2413

andre-merzky Jan 30, 2019 Maintainer

Replies: 10 comments

andre-merzky Jan 30, 2019 Maintainer Author

vivek-bala Jan 30, 2019

andre-merzky Jan 30, 2019 Maintainer Author

andre-merzky Feb 13, 2019 Maintainer Author

andre-merzky Feb 13, 2019 Maintainer Author

vivek-bala Feb 13, 2019

AymenFJA Feb 27, 2019 Collaborator

andre-merzky Apr 3, 2019 Maintainer Author

iparask Jun 23, 2020

iparask Jul 27, 2020

interpretation of `gpu_processes` #2413

andre-merzky
Jan 30, 2019
Maintainer

andre-merzky
Jan 30, 2019
Maintainer Author

vivek-bala
Jan 30, 2019

andre-merzky
Jan 30, 2019
Maintainer Author

andre-merzky
Feb 13, 2019
Maintainer Author

andre-merzky
Feb 13, 2019
Maintainer Author

vivek-bala
Feb 13, 2019

AymenFJA
Feb 27, 2019
Collaborator

andre-merzky
Apr 3, 2019
Maintainer Author

iparask
Jun 23, 2020

iparask
Jul 27, 2020