Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add note about naming files in parallel jobs #8941

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nokite
Copy link

@nokite nokite commented Sep 9, 2024

Description

Added a note about naming files in persist_to_workspace during parallel jobs - names are fixed, you cannot use parallelism environment variables to make the names of files unique and identifiable.
Proven by this error message:

Error locating workspace root directory: stat /tmp/deploy_logs_$CIRCLE_NODE_INDEX: no such file or directory"

Reasons

This behavior is not explained, so one has to figure it out by trial and error.

Furthermore, there does not seem to be a way to achieve the goal at all - save files from multiple parallel runs of a job, and give them unique names that don't conflict with each other).

Furthermore, the error message could be improved to help figure out the problem. Currently it says The specified paths did not match any files in /tmp/deploy_logs - it doesn't mention what paths were used: was it resolved to deploy_logs_0 or is it literally deploy_logs_$CIRCLE_NODE_INDEX

Content Checklist

Please follow our style when contributing to CircleCI docs. Our style guide is here: https://circleci.com/docs/style/style-guide-overview.

Please take a moment to check through the following items when submitting your PR (this is just a guide so will not be relevant for all PRs) 😸:

  • Break up walls of text by adding paragraph breaks.
  • Consider if the content could benefit from more structure, such as lists or tables, to make it easier to consume.
  • Keep the title between 20 and 70 characters.
  • Consider whether the content would benefit from more subsections (h2-h6 headings) to make it easier to consume.
  • Check all headings h1-h6 are in sentence case (only first letter is capitalized).
  • Is there a "Next steps" section at the end of the page giving the reader a clear path to what to read next?
  • Include relevant backlinks to other CircleCI docs/pages.

@nokite nokite requested review from a team as code owners September 9, 2024 09:53
@rosieyohannan
Copy link
Contributor

Hey @nokite! Thank you for the PR. It would be great to get a better understanding of what you want to get working here, any chance you can share more?

Parallelism is generally used for splitting up work across execution environments, rather than running the same work multiple times, generally not writing to the same files from the various parallel running jobs. So, it would be useful to understand more to get this addition to the docs right, or potentially offer a different way to achieve what's needed.

@nokite
Copy link
Author

nokite commented Sep 9, 2024

Hey @nokite! Thank you for the PR. It would be great to get a better understanding of what you want to get working here, any chance you can share more?

Parallelism is generally used for splitting up work across execution environments, rather than running the same work multiple times, generally not writing to the same files from the various parallel running jobs. So, it would be useful to understand more to get this addition to the docs right, or potentially offer a different way to achieve what's needed.

Absolutely, I'll try to explain my goals. Thanks for replying!

I use parallelism in order to build and deploy a product in multiple variants in reasonable time. By passing $CIRCLE_NODE_INDEX, I tell it which set of variants to build in each parallel run/instance. Each parallel run builds a couple of variants (just enough to stay below the 3h runtime limit).
Any of these variants may occasionally fail to build or deploy. I need to know which ones failed (and with what error message) at the end of the whole workflow.

I would be happy to achieve this in any way.

My assumption was that a good way would be to:

  • create a log (or rather a file with results) in each parallel run
  • save these logs in the workspace
  • have a dependent job at the end that merges all the logs and saves them as an artifact

So the issue I ran into is that you can't save these unique logs in the workspace, as the filename has to be hardcoded - so it can't vary between parallel runs (by using the index, for instance).

@rosieyohannan
Copy link
Contributor

Hey @nokite, thank you!

Sorry if I'm wrong here but it sounds like you might be able to simplify things. In CircleCI, parallelism as a feature (configuring a number of parallel execution environments and telling CircleCI how to split work across them) is generally reserved for splitting a test suite.

What you describe sounds to me like you want concurrent jobs in a workflow, so you can configure your sets of variants to build in separate jobs and then create a workflow where those jobs run concurrently: https://circleci.com/docs/concurrency/#concurrency-in-workflows. Then you would be able to see in the UI/wherever which failed/built/deployed for each job.

Basically "parallel" and "concurrent" can mean largely the same thing but in CircleCI parallelism is a specific testing-focussed feature, whereas concurrency is about running jobs, doing work at the same time across multiple execution environments.

Please let me know if I misunderstood and oversimplified things here!

@nokite
Copy link
Author

nokite commented Sep 10, 2024

@rosieyohannan thanks, I think we're on the same page.
The point where we might be thinking differently is that I believe CircleCI's parallelism has potential outside the realm of testing. It's a powerful tool, and there's no need to restrict its purpose, in my opinion.

Technically, I believe my usage of parallelism fits its purpose.
I'm splitting work of the same type across parallel runners. There's pretty much nothing different between the work for each job than an index that helps split the work. Each build uses the same codebase, but has slightly different configuration and a few of the files that it builds differ. Internally we call the same command, with only a parameter differing between jobs/runs.
(So I think it does pretty much the same as your example with tests - which runs different sets of tests in the same codebase, using some system for splitting the work between the runners.)

Regarding what you suggested - I agree it's totally valid, and that's how I started.
Initially I had a workflow config where I called the job a large number of times - via multiple entries in the workflow section. That resulted in a really, really long and repetitive config.yml.

It bugged me that the only difference between the separate calls was an index (which I passed as a parameter). The job itself was still defined only once - as there was no reason to duplicate it.

Then I figured out that what I was doing was basically parallelism - same type of work called with an index. So I refactored the config and used parallelism, which felt right (and awesome 🙂). It reduced the number of repetitive lines dramatically, and I liked the different way it was shown in the CircleCI UI. I could see everything at a glance (fitting on a single screen), and could easily switch between runs.

It ended up being a long comment unfortunately, but I hope I managed to illustrate why I see parallelism as the right tool for the job. Let me know!

@gordonsyme
Copy link
Member

gordonsyme commented Sep 10, 2024

Hi @nokite,

Thanks for reporting this, the confusion I think comes from what info is available at which points in the process. E.g. env-vars aren't available for interpolation when we're processing config.

I think we could make the different phases and what's available in each phase clearer in the docs.

To get on to your root issue, there's a few different options available that might get you unblocked:

  • You could use your original approach but with a matrix parameter so you don't have to manually add the job into the workflow multiple times.
  • You can keep the parallelism approach but instead of persisting /tmp/deploy_logs_$CIRCLE_NODE_INDEX to the workspace, instead save the log files to /tmp/deploy_logs/$CIRCLE_NODE_INDEX.log and then persist /tmp/deploy_logs.
    - persist_to_workspace:
        root: "/tmp/deploy_logs"
        paths:
          - "*"
    
  • You can store the logs as artifacts instead. In this case you could write each log to the same fixed filename and store that file as an artifact, artifacts get uploaded into a unique space per container in the parallel job.

(edited to fix config snippet)

@nokite
Copy link
Author

nokite commented Sep 17, 2024

@gordonsyme Thanks for your suggestions, I appreciate your time.
I liked the second idea - using paths: "*". I did not know that this was possible! Excellent! I am using it now 🔥
(note for anyone else reading this - I used it as an array - with a new line: paths: \n - "*")

P.S.
Matrix seems a bit too verbose for this use case, if I understand it correctly. I'd have to list all the parameters that define the array/matrix of (many) concurrent jobs. And in my case those parameters are not really meaningful - they're just indexes. It does seem like a great solution when the parameters are meaningful though (like in os: [docker, linux, macos] | node-version: ["14.17.6", "16.9.0"]).

I can't comment on artifacts, I would have to try it out with my setup.

@gordonsyme
Copy link
Member

@nokite awesome, glad you're sorted :)

(note for anyone else reading this - I used it as an array - with a new line: paths: \n - "*")

That'll teach me to write out config off the top of my head 😅, I'll edit my first reply so the correct form is out there for anyone else who comes across it.

@nokite
Copy link
Author

nokite commented Sep 17, 2024

As for the PR, I'm OK if you close it and handle the update on your side, as you have a better understanding. I can suggest the following:

  • There are some considerations about parallelism. Environment variables like $CIRCLE_NODE_INDEX cannot be used in the persist_to_workspace step in order to define dynamic file names. In order to save files with dynamic names, you could save all files in a folder with paths: - "*". This way, you can give unique names for a file in each parallel run and avoid conflicts when saving it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants