Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add timing tracker to fms-hf-tuning #378

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Oct 21, 2024

Description of the change

This change will add another tracker to fms-hf-tuning which will, per process, log the training time. Additionally, if tuning is run using accelerate_launch.py, it will log the entire run time from beginning to end (including pre and post processing).

Here is an example log of a run with two processes:

{"data": {"end_time": "2024-10-21T19:32:36.657076", "start_time": "2024-10-21T19:31:21.727124", "timestamp": "2024-10-21T19:32:36.657152", "total_duration_seconds": 74.929952, "train_runtime": 73.5228}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:32:36.674714", "start_time": "2024-10-21T19:31:21.724035", "timestamp": "2024-10-21T19:32:36.674749", "total_duration_seconds": 74.950679, "train_runtime": 74.8826}, "name": "training_timing"}
{"data": {"end_time": "2024-10-21T19:33:23.181481", "start_time": "2024-10-21T19:23:54.401312", "total_duration_seconds": 568.780169}, "name": "total_runtime"}

Questions:

  • Is it helpful to have timer be per process or should there just be one train_runtime?
  • Is there a better way to use the tracker for total runtime without passing it in the launch command?

Related issue number

How to verify the PR

Run these changes in a dev branch on any model and view the output directory once training is complete.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

Signed-off-by: Will Johnson <[email protected]>
@willmj willmj changed the title 1371 feat: Add timing tracker to fms-hf-tuning Oct 21, 2024
@github-actions github-actions bot added the feat label Oct 21, 2024
@ashokponkumar
Copy link
Collaborator

@willmj should we consider using https://github.com/foundation-model-stack/hf-resource-scanner for tracking?

Cc: @ChanderG

@willmj
Copy link
Collaborator Author

willmj commented Oct 22, 2024

Yes definitely, thanks for the suggestion @ashokponkumar - I had not heard of this scanner. I can look into this implementation.

@anhuong
Copy link
Collaborator

anhuong commented Oct 23, 2024

@will Johnson few questions for you...

  • Do we need to create a new log file just for timing information? I thought the initial idea was to add this to the existing log file, would need to check with SW to ensure this wouldn’t break things for them since they pull the loss from this file to add to a graph in the UI
  • Do we need to create our own method of timing the training time? Is this in order to include the pre and post-processing that happens? The current trainer already calculates how long the training time takes and outputs it at the end of training. Could we just take this value and add it to the log?
  • I don’t think we need to also run https://github.com/foundation-model-stack/hf-resource-scanner since other metrics like memory consumption should be captured in AIM. This could be a future improvement as well if we want to gather other metrics and log them during training

@will
Copy link

will commented Oct 24, 2024

@will Johnson few questions for you...

By putting a space in between will and Johnson it sends an alert to me and not Mr Johnson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants