Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Flyte agent to provision and manage K8s (data) service for deep learning (GNN) use cases #3004

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

shuyingliang
Copy link

@shuyingliang shuyingliang commented Dec 14, 2024

Why are the changes needed?

Graph Neural Networks are critical for understanding complex relationships across LinkedIn's professional networks. However, training these models at scale involves intricate data loading, sampling, and processing across multiple nodes and GPUs. The missing piece is the infrastructure to support how and where to run these Kubernetes data services, making them scalable and reliable along with the training or inference processes.

To simplify the complex orchestration pipeline, we decided to leverage flyte agent framework to provision and manage the data services for GNN use case.

What changes were proposed in this pull request?

This PR adds the flyte agent to create/update/delete the K8s statefulset and service.

How was this patch tested?

  • The same code (with removed company related internal environments and set up) has been running in production along with the training job MPIJobs (for deep learning GNN training) or TFJob (for offline inference)
  • This is also tested in local sandbox

Setup process

pip install flytekitplugins-k8sdataservice

Screenshots

Screenshot 2024-11-11 at 3 48 18 PM

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Docs link

Blog from Flyte community sync

Copy link

codecov bot commented Dec 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.46%. Comparing base (f99d50e) to head (944a500).
Report is 4 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3004       +/-   ##
===========================================
+ Coverage   51.08%   90.46%   +39.38%     
===========================================
  Files         201      100      -101     
  Lines       21231     4920    -16311     
  Branches     2731        0     -2731     
===========================================
- Hits        10846     4451     -6395     
+ Misses       9787      469     -9318     
+ Partials      598        0      -598     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing!!! leave some minor comments

@shuyingliang shuyingliang force-pushed the shuliang/k8sdataservice branch 3 times, most recently from 737222a to ec99598 Compare December 20, 2024 04:53
…In internal things removed

Signed-off-by: Shuying Liang <[email protected]>
@shuyingliang shuyingliang force-pushed the shuliang/k8sdataservice branch 2 times, most recently from b9c4dd1 to a0c5d8e Compare December 20, 2024 05:06
Signed-off-by: Shuying Liang <[email protected]>
@shuyingliang shuyingliang force-pushed the shuliang/k8sdataservice branch from a0c5d8e to ec6d4c1 Compare December 20, 2024 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants