Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Parameter Server Training for the Keras JAX backend (Keras-JAX-PST) #20753

Open
SeeForTwo opened this issue Jan 13, 2025 · 0 comments
Open
Assignees
Labels
type:feature The user is asking for a new feature.

Comments

@SeeForTwo
Copy link

RFC: Parameter Server Training for the Keras JAX backend (Keras-JAX-PST)

We are considering implementing Parameter Server Training (PST) for the Keras JAX backend. It would aim to provide a scalable and performant solution for asynchronous ML training. For PST, the training cluster contains M workers and N parameter servers where the master copy of the training variables (and embeddings) are placed on parameter servers. (Background: Scaling Distributed Machine Learning with the Parameter Server)

The advantages of PST include:

  • Large embeddings are sharded across multiple parameter servers. This enables use of embeddings that exceed the local memory available to a single device or the HBM available to all accelerators.
  • Training can be scaled across multiple CPUs (data parallelism) for increased speed even without accelerator hardware (GPUs, TPUs).
  • PST uses asynchronous training which is robust to individual worker failures/preemptions/restarts and potentially more performant with low availability guarantees.

We hope to make using PST a convenient option in an end-to-end recommendation solution with Keras.

We want your feedback on whether this would be of value to you.

Please comment below.

@mehtamansi29 mehtamansi29 added type:feature The user is asking for a new feature. keras-team-review-pending Pending review by a Keras team member. labels Jan 15, 2025
@divyashreepathihalli divyashreepathihalli removed the keras-team-review-pending Pending review by a Keras team member. label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature The user is asking for a new feature.
Projects
None yet
Development

No branches or pull requests

4 participants