Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Utilization] Set up Job level profile, and test detail mechanism #6264

Merged
merged 17 commits into from
Feb 7, 2025

Conversation

yangw-dev
Copy link
Contributor

@yangw-dev yangw-dev commented Feb 7, 2025

Description

show job profile information such as duration, cpu, gpu count etc.
show the total statisctics table for the whole job.
show detected testsand tests' segmented Statistics Table

Currently we only show max aggregate data,

the max agg: we get the max of a utilization percentage during the 5 sec collect interval.
the avg agg: we get the avg of all utilization percentage during the 5 sec collect interval

Statistics Table

show all hardware's max aggregated data for avg, 10p,90p and 90% utilizaiton spike info.
Will add analysis seciton for the statistics in next step

Demo

working vercel demo link:
Job with GPU: https://torchci-qrqc2igy4-fbopensource.vercel.app/utilization/13143119394/36681447812/1
Job without GPU: https://torchci-qrqc2igy4-fbopensource.vercel.app/utilization/12937937547/36088234580/1

Job Profile Details

workflow and job information, duration, hardware count information
a statisctics table shows how the whole job's utilization behaves.

testProfile

View Tests

Two options to view detected test details:

  • list view: click on test based on test name
  • chart view: click on test segment in chart

Choose single test from list view

listView

choose singl test from chart view

chartview

Other

remove .curve(d3.curveBasis) from the svg since it does not render svg accurate

Next step

add more test related rendering and table for test improvement recommendation

@yangw-dev yangw-dev requested a review from huydhn February 7, 2025 00:19
Copy link

vercel bot commented Feb 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview Feb 7, 2025 10:15pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 7, 2025
@huydhn
Copy link
Contributor

huydhn commented Feb 7, 2025

The page looks good, but there are some redundant information that I'm not sure if we want to show to its user:

  • Workflow ID, job ID, and run attempt
  • Collect interval
  • The resource name looks weird, i.e. cpu(max) or gpu_0(max) instead of cpu or gpu_0
  • What do you think about using median (p50) instead of average? This seems more inline with other columns
  • In Detected Python test details, the highlight (pink) is a bit hard to view given the many colors there
  • The Single Test Details section looks more intuitive than the Detected Python test details section IMO, so I think we should move Single Test Details up
  • The test duration unit looks wrong, I think it's in minutes, not second
  • Some python processes can be filtered out I think and we can just show those python test/run_test.py ... processes. This is a small issue though as it's easier to just show everything

@yangw-dev
Copy link
Contributor Author

yangw-dev commented Feb 7, 2025

The page looks good, but there are some redundant information that I'm not sure if we want to show to its user:

  • Workflow ID, job ID, and run attempt
  • Collect interval
  • The resource name looks weird, i.e. cpu(max) or gpu_0(max) instead of cpu or gpu_0
  • What do you think about using median (p50) instead of average? This seems more inline with other columns
  • In Detected Python test details, the highlight (pink) is a bit hard to view given the many colors there
  • The Single Test Details section looks more intuitive than the Detected Python test details section IMO, so I think we should move Single Test Details up
  • The test duration unit looks wrong, I think it's in minutes, not second
  • Some python processes can be filtered out I think and we can just show those python test/run_test.py ... processes. This is a small issue though as it's easier to just show everything

Thx for the feedback, will work on some of those!

  • I may still keep those ID since it's part of the job information, found it's easy to grab info in ui compare to pick on url.
  • I can remove the collect interval, since this is internal info
  • I can add median(50), the reason I want to keep average cuz it can also compare to median to tell you how skewed the data is
  • will fix the pink to make it more visible
  • already fixed the test duration (sorry was a bug)
  • [next step] the filter functionality will be introduced in next step
  • [next step] the Detected Python test details can be used to filter out tests, so may still keep it there, but may make it smaller

@yangw-dev yangw-dev closed this Feb 7, 2025
@yangw-dev yangw-dev reopened this Feb 7, 2025
Copy link
Contributor

@clee2000 clee2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense if the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test" section and the "Click on the graph chart to see the test details." section were combined. For example, clicking on a test in the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test:" would show the location of the test and set the "Selected Test Details" section. Or clicking on the "Single Test Details" would also set the display in the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test:" section

@yangw-dev
Copy link
Contributor Author

I think it would make more sense if the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test" section and the "Click on the graph chart to see the test details." section were combined. For example, clicking on a test in the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test:" would show the location of the test and set the "Selected Test Details" section. Or clicking on the "Single Test Details" would also set the display in the "We detected (24) tests on python_CMD level, click on the test name to see the location of the test:" section

my next step is make two flows :
I will have a toggle feature that, you can see list of test or show the full graph of test

  1. list of test would have a show test button, click on it -> show the single test details
  2. switch to the whole graph section, click on rect -> show the single test details

but this will be next step after this

torchci/components/common/SimpleDropList.tsx Outdated Show resolved Hide resolved
torchci/components/common/SimpleDropList.tsx Outdated Show resolved Hide resolved
@yangw-dev yangw-dev merged commit 1faa4b5 into main Feb 7, 2025
6 checks passed
@yangw-dev yangw-dev deleted the wedUtilSummary branch February 7, 2025 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants