Platform to run this application #13

bharathkumar192 · 2023-11-19T15:07:48Z

Where do i run this application. i am currently using windows with a single GPU. so in a terminal i am running the server and in two seperate terminals i am running the examples. if the wait argument is mentioned, the program doesnt move forward, if i remove the wait argument, it moves, but the UI doesnt show the allocation and everything.

some sort of clarification would be verymuch appreciated

Spico197 · 2023-11-19T17:28:06Z

Hi there, thank you very much for using this software~ Currently I only test the functionalities on Linux (Ubuntu distribution), but it should work on Windows.

I'm not sure about the wait argument you are using. Is it the WatchClient.wait() method?
You mentioned there's only one GPU in the environment, was there any GPU consumption or processes running on that GPU? Currently the GPU could be allocated to a client if the GPU is completely free:

watchmen/watchmen/listener.py

Lines 16 to 18 in 6b24567

    
           if len(gpu.processes) <= 0 \ 
        
                   and gpu.utilization <= 10 \ 
        
                   and (float(gpu.memory_used) / float(gpu.memory_total) <= 1e-3 or gpu.memory_used < 50):

Besides, was the client registered on the server successfully? You may find the relevant information in the printed log.

bharathkumar192 · 2023-11-19T18:03:44Z

If you do have a while, we can connect regarding this. i really am looking forward to make contribution to this repo and add it to my thesis. so do let me know if you can spare some time for this. my email: [email protected]

Hey so How i am running your application is ,
Terminal-1. running the watchmen.server
Terminal-2. python single_card_mnist.py --id="single" --cuda=0 --wait --wait_mode="query"
Terminal-3. python single_card_mnist.py --id="single_schedule" --cuda=0 --wait --wait_mode="schedule"

so when i run all the above 3 terminals

it stays in that state for a while even after when the terminal says the training is completed and final accuracy is given and after quite a while, the another terminal's processing is starting. (I presume there is a duplicate process being created.) the same happens for the other terminal as well. a duplicate process being triggered is what i assume. correct me if im wrong but the front end is not being updated even the terminal says the training is completed.

and also i would like to know more about the --wait-mode="queue" and --wait-mode="schedule" what difference does it make.

and hey i have no idea on the wait argument that you have mentioned in the Readme File. #

Spico197 · 2023-11-20T05:19:02Z

Hi there, thanks for your valuable feedback!

Ideally, the server check GPU status every second and try to assign available GPUs to clients every 5 seconds. The client would ping the server every 10 seconds to see if there are available GPUs.
Clients in queue mode would wait until the specific GPU is free to use (here in the example, cuda:0). Clients in schedule mode may be assigned to another GPU (one GPU from 0, 2, and 3).

Test case 1: queue vs. shedule

Test case 2: queue and schedule on the same GPU

it stays in that state for a while even after when the terminal says the training is completed

I've understood what's going on from your point of view. You said the scheduling server would be still waiting even if the job had finished. However, a client is not designed or required to send a I'm finished signal to the server, so the server would wait until queue_timeout is triggered.

Here's why we need the queue_timeout mechanism: a job may take time to start (like downloading datasets, pretrained models, etc.), and the GPU is not occupied immediately when the job starts training. So we have to set a queue_timeout time to make sure the server could wait for the client to load models or datasets to GPU to indicate it's running. When a job is finished, the server would still be waiting for another queue_timeout (10 min in default), and this induces the time gap.

A possible solution to this may be adding another client status: RUNNING, so when the GPU is available back again with a job runned on that GPU, the server may skip the queue_timeout and directly assign the GPU to the next job.

Spico197 · 2023-11-20T05:22:10Z

Or maybe we could optimize the logic in watchmen/server/check_work.py to make an instant GPU assignment without waiting for queue time of the last job.

bharathkumar192 · 2023-11-20T10:14:16Z

Thanks for your Great explantion and your time to test my scenario.

Some more queries from my side are,

What scheduling have you implemented in the application
I am so willing to test the application using multiple GPUs. (can you suggest a platform where you are running).
Do you have multiple GPUs with your system or are you running the application in an Instance in AWS or any cloud.
When i am running using the EC2 instance, the url opens(used Flask -NGrok). The url opens but the process are not reflected neither do the other details. instead i can see a red color Error with the empty table values.

Help me out with regards to this.

and hey, thanks once again for your time. Really Grateful !. Let me know if i can be of any help to you.

Spico197 · 2023-11-20T10:54:59Z

While, the implementation is rather primitive here. It just loops and checks if the GPUs are available for a job.

watchmen/watchmen/server.py

Line 252 in 6b24567

for client_id, client in cc.work_queue.items():

2-3. I'm running the experiments with a local cluster in my lab. However, I suggest not to really rent GPUs from cloud providers if you care about the cost. You could hook some APIs in the functions below to build a testing environment, which is the least costly way to test new functions.

watchmen/watchmen/server.py

Lines 17 to 22 in 6b24567

    
           from watchmen.listener import ( 
        
               is_single_gpu_totally_free, 
        
               check_gpus_existence, 
        
               check_req_gpu_num, 
        
               GPUInfo 
        
           )

Do you mean the website UI is showing well, but the status is Error? That means the frontend is not connecting to the backend. You could try to curl http://localhost:62333/api to test the connectivity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Platform to run this application #13

Platform to run this application #13

bharathkumar192 commented Nov 19, 2023

Spico197 commented Nov 19, 2023

bharathkumar192 commented Nov 19, 2023 •

edited

Loading

Spico197 commented Nov 20, 2023

Spico197 commented Nov 20, 2023

bharathkumar192 commented Nov 20, 2023

Spico197 commented Nov 20, 2023 •

edited

Loading

Platform to run this application #13

Platform to run this application #13

Comments

bharathkumar192 commented Nov 19, 2023

Spico197 commented Nov 19, 2023

bharathkumar192 commented Nov 19, 2023 • edited Loading

Spico197 commented Nov 20, 2023

Test case 1: queue vs. shedule

Test case 2: queue and schedule on the same GPU

Spico197 commented Nov 20, 2023

bharathkumar192 commented Nov 20, 2023

Spico197 commented Nov 20, 2023 • edited Loading

bharathkumar192 commented Nov 19, 2023 •

edited

Loading

Spico197 commented Nov 20, 2023 •

edited

Loading