Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training should avoid prints and return status to the caller #281

Open
danmcp opened this issue Oct 17, 2024 · 2 comments
Open

Training should avoid prints and return status to the caller #281

danmcp opened this issue Oct 17, 2024 · 2 comments

Comments

@danmcp
Copy link
Member

danmcp commented Oct 17, 2024

Today, run_training prints messages to stdout and has a return type of None. This has worked ok when being called by the CLI but isn't ideal. Generally, libraries should leave the display of output up to the client/caller. Even with the CLI, if the CLI wants to make format changes, or translations, or anything related to what shows up for the user, it wouldn't be able to do so today for training. If the caller was a REST API, this would be a bigger issue since the API would need to return the result/state to its caller. Suggestions on general rules to follow:

  • The training library shouldn't use print. Logging should be used instead of some cases at different levels for helpful info/debug. In other cases, the results should be returned to the caller to decide what to do with them
    • Note: INFO logging should be user friendly and not too verbose that a user wouldn't want to leave it on all the time
  • run_training should return the result rather than print it
  • Longer term, a callback could be useful to update the interim status of train
@mairin
Copy link
Member

mairin commented Oct 21, 2024

This was discussed at a refinement meeting today. Question from Oleg about what kind of info does the training library need to pass to which consumers?

Mustafa:

  • State of the job. Is it broken or still running?
  • Want access to the logs of the job.
  • Want some form of consumable metrics, similar to what we do in JSON output.

A few things you want as a consumer of the library to be able to directly access / display to the user to make informed decisions as library consumer. So nice with a distributed training job - user or library has a central point to call on the status/logs/metrics of a running job. Makes things easier down the line. Earlier on we did with OpenShift AI work when we tried to set up distributed training - relied on Ray pretty heavily because of state management it provided - single centralized point while job was running and display to users via a python library on there side.

@ktam3
Copy link

ktam3 commented Oct 21, 2024

Summary 10/21 -

  • Additional design discussion needed with eng/runtime team and model/eval(training) team. To be discussed during eng/runtime design meeting? or have a one off meeting for this.

cc. @Maxusmusti @RobotSail @JamesKunstle @cdoern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants