Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform to run this application #13

Open
bharathkumar192 opened this issue Nov 19, 2023 · 6 comments
Open

Platform to run this application #13

bharathkumar192 opened this issue Nov 19, 2023 · 6 comments

Comments

@bharathkumar192
Copy link

Where do i run this application. i am currently using windows with a single GPU. so in a terminal i am running the server and in two seperate terminals i am running the examples. if the wait argument is mentioned, the program doesnt move forward, if i remove the wait argument, it moves, but the UI doesnt show the allocation and everything.

some sort of clarification would be verymuch appreciated

@Spico197
Copy link
Owner

Hi there, thank you very much for using this software~ Currently I only test the functionalities on Linux (Ubuntu distribution), but it should work on Windows.

I'm not sure about the wait argument you are using. Is it the WatchClient.wait() method?
You mentioned there's only one GPU in the environment, was there any GPU consumption or processes running on that GPU? Currently the GPU could be allocated to a client if the GPU is completely free:

if len(gpu.processes) <= 0 \
and gpu.utilization <= 10 \
and (float(gpu.memory_used) / float(gpu.memory_total) <= 1e-3 or gpu.memory_used < 50):

Besides, was the client registered on the server successfully? You may find the relevant information in the printed log.

@bharathkumar192
Copy link
Author

bharathkumar192 commented Nov 19, 2023

If you do have a while, we can connect regarding this. i really am looking forward to make contribution to this repo and add it to my thesis. so do let me know if you can spare some time for this. my email: [email protected]

Hey so How i am running your application is ,
Terminal-1. running the watchmen.server
Terminal-2. python single_card_mnist.py --id="single" --cuda=0 --wait --wait_mode="query"
Terminal-3. python single_card_mnist.py --id="single_schedule" --cuda=0 --wait --wait_mode="schedule"

so when i run all the above 3 terminals

image

it stays in that state for a while even after when the terminal says the training is completed and final accuracy is given and after quite a while, the another terminal's processing is starting. (I presume there is a duplicate process being created.) the same happens for the other terminal as well. a duplicate process being triggered is what i assume. correct me if im wrong but the front end is not being updated even the terminal says the training is completed.

and also i would like to know more about the --wait-mode="queue" and --wait-mode="schedule" what difference does it make.

and hey i have no idea on the wait argument that you have mentioned in the Readme File. #

image

@Spico197
Copy link
Owner

Hi there, thanks for your valuable feedback!

  • Ideally, the server check GPU status every second and try to assign available GPUs to clients every 5 seconds. The client would ping the server every 10 seconds to see if there are available GPUs.
  • Clients in queue mode would wait until the specific GPU is free to use (here in the example, cuda:0). Clients in schedule mode may be assigned to another GPU (one GPU from 0, 2, and 3).

Test case 1: queue vs. shedule

image
image
image

Test case 2: queue and schedule on the same GPU

image
image
image

it stays in that state for a while even after when the terminal says the training is completed

I've understood what's going on from your point of view. You said the scheduling server would be still waiting even if the job had finished. However, a client is not designed or required to send a I'm finished signal to the server, so the server would wait until queue_timeout is triggered.

Here's why we need the queue_timeout mechanism: a job may take time to start (like downloading datasets, pretrained models, etc.), and the GPU is not occupied immediately when the job starts training. So we have to set a queue_timeout time to make sure the server could wait for the client to load models or datasets to GPU to indicate it's running. When a job is finished, the server would still be waiting for another queue_timeout (10 min in default), and this induces the time gap.

A possible solution to this may be adding another client status: RUNNING, so when the GPU is available back again with a job runned on that GPU, the server may skip the queue_timeout and directly assign the GPU to the next job.

@Spico197
Copy link
Owner

Or maybe we could optimize the logic in watchmen/server/check_work.py to make an instant GPU assignment without waiting for queue time of the last job.

@bharathkumar192
Copy link
Author

Thanks for your Great explantion and your time to test my scenario.

Some more queries from my side are,

  1. What scheduling have you implemented in the application
  2. I am so willing to test the application using multiple GPUs. (can you suggest a platform where you are running).
  3. Do you have multiple GPUs with your system or are you running the application in an Instance in AWS or any cloud.
  4. When i am running using the EC2 instance, the url opens(used Flask -NGrok). The url opens but the process are not reflected neither do the other details. instead i can see a red color Error with the empty table values.

Help me out with regards to this.

and hey, thanks once again for your time. Really Grateful !. Let me know if i can be of any help to you.

@Spico197
Copy link
Owner

Spico197 commented Nov 20, 2023

  1. While, the implementation is rather primitive here. It just loops and checks if the GPUs are available for a job.

for client_id, client in cc.work_queue.items():

2-3. I'm running the experiments with a local cluster in my lab. However, I suggest not to really rent GPUs from cloud providers if you care about the cost. You could hook some APIs in the functions below to build a testing environment, which is the least costly way to test new functions.

from watchmen.listener import (
is_single_gpu_totally_free,
check_gpus_existence,
check_req_gpu_num,
GPUInfo
)

  1. Do you mean the website UI is showing well, but the status is Error? That means the frontend is not connecting to the backend. You could try to curl http://localhost:62333/api to test the connectivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants