Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CT Tool for Ariane-133: Snapshot generation halt #76

Open
val-terry opened this issue Jan 30, 2025 · 0 comments
Open

CT Tool for Ariane-133: Snapshot generation halt #76

val-terry opened this issue Jan 30, 2025 · 0 comments

Comments

@val-terry
Copy link

val-terry commented Jan 30, 2025

Hello, I wanted to ask about the Ariane-133 design (from MacroPlacement). I ran the CT tool for 11 days (Intel Xeon, 132GB RAM, and I used 3 collect jobs, no GPUs), however it seemed that the tool stopped generating snapshots after day two. On Tensorboard, I also noticed that the losses plateaued around day two. Is there a reason for this? Additionally, is there a measure in place to know when the tool is done? It seems like it finished generating snapshots but continued to run. Should it stop at a certain point? Lastly, I am curious as to why the checkpoints directory was empty (no checkpoints created, even after 31k steps).

Thank you so much!

snapshot results taken on 01/26/25:
Image

Tensorboard Results:
plateau occurs after 1.782 days:
Image

all jobs ended manually on day 11:
Image

Image

Image

Image

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant