interpretation of gpu_processes
#2413
Replies: 10 comments
-
Proposal:
Example 1:
|
Beta Was this translation helpful? Give feedback.
-
The proposal does look complete. But I think there is a possibility that we might end up under-utilizing some resources. I might be wrong, so let me pose a question: Let's say I have 4 nodes on a machine where each node has 4 cpus and 4 gpus. I have an application that runs by starting one cpu process per unique node (to use the gpus on that node) that spawns N gpu processes each of which use 1 gpu. I want to run my application with 4 gpu processes. Case 1: If there are 4 gpus available on a node, then we use 1 cpu and 4 gpus on that node. If we specify Do you agree? Let me know if you I am missing something. I don't have a concrete solution at the moment unfortunately, but wanted to point out that corner case. I will keep thinking on a solution if you agree with my assessment. Also, I think that's how the Princeton application works based on my discussions with Wenjie@Princeton. We only used their application on Titan which 1 had gpu per node and didn't encounter this problem. I can confirm with them on Wednesday. |
Beta Was this translation helpful? Give feedback.
-
Hey Vivek, I think there is several misunderstandings. First, this ticket is about the unit description. Unless we re-introduces the auto-add, no decision we make here can have any influence on efficiency and resource utilization. The CUD is about the user specifying how many processes, threads and GPUs it needs, and there is no space for us to interpret this one way or the other, resulting in a different number than the user specified. The user specified it needs 4 cpu processes and 2 cpu processes, then that is what it gets. Second, scheduling and placement of those resource requests can lead to more or less efficient application execution - distributing stuff over nodes will be less efficient than placing on the same node. But that is what the scheduler has to address. In either case, the scheduler will not be able to change the number of cores or gpus needed. Third, your formulations:
To your use case (and let me know if that CUD is correct or not): cud = {'cpu_processes' : 2,
'gpu_processes': 2,
'gpu_process_type' : MPI
} This CU will always result in 4 processes, two of which will have 1 GPU assigned. If those processes are on the same node or others depend on the scheduler and its state. Yes, it may result in everything sitting on the same node, rendering 2 GPUs of that node unusable at the moment - but that can only be addressed via a different scheduler, not via a different interpretation of the CUD. |
Beta Was this translation helpful? Give feedback.
-
This topic was discussed on the RCT call. There seemed to be consensus that |
Beta Was this translation helpful? Give feedback.
-
New proposal: introduce a new alternative CU description format which is centered around the resource requirements of individual processes: {
'executable' : '/bin/foo',
'process_type': rp.MPI,
'processes' : [{
'count' : 1,
'cores' : 4,
'gpus' : 2,
'mem' : '2GB'
},
{
'count' : 5,
'cores' : 4,
'mem' : '8GB'
}]
} This way we know how many processes to spawn (6 inn this case), but leave resource usage and mapping to the application. The first process may use the GPU devices from the process, or from another process spawned on the second core, or from a thread spawned on the 4th core, similar to the 2GB memory which may be allocated by the process created by RP, or by a thread spawned by the application. |
Beta Was this translation helpful? Give feedback.
-
All resource keys (cores, gpus, mem, lfs) will evaluated as the requirement per process (or per process count to be precise). Correct? |
Beta Was this translation helpful? Give feedback.
-
So based on our last meeting and based on what @andre-merzky described ( The CUD is about the user specifying how many processes, threads, and GPUs it needs), I would like to mention few things :
in
So to answer the question if we can run 2 applications (executables) on the same GPU device, the answer is YES .
How ?
|
Beta Was this translation helpful? Give feedback.
-
closed for v2 |
Beta Was this translation helpful? Give feedback.
-
This is still an ongoing discussion |
Beta Was this translation helpful? Give feedback.
-
Connected to #1891 |
Beta Was this translation helpful? Give feedback.
-
The handling of
gpu_processes
in RP.v1 is confusing: depending on (a) agent and resource configuration and (b) specifiedcpu_processes
,gpu_processes
will trigger the use of additional CPU cores or not. This needs to be clarified and simplified. Specifically, such requests need to be uniformly handled in the agent schedulers.Below is a proposal for a better definition of these task description attributes. There has been a slack discussion with additional proposals and concerns - those will be added to that ticket as well.
Beta Was this translation helpful? Give feedback.
All reactions