MegaLinter speed optimization #3461

wesley-dean-flexion · 2024-04-02T20:06:04Z

wesley-dean-flexion
Apr 2, 2024

Overview

I was working with a team on a project that involved a SaaS solution in the Sales space.. one that's quite Forceful, for what it's worth. The engagement was primarily about cost-cutting, specifically surrounding the amount of billable time GitHub Actions was using. The organization's 6,000 minute monthly quota was being eaten up before the end of the month so the team was asked to take a look for ways to cut back on the amount of billable time spent running GitHub Actions.

Methodology

First, we looked at the Elapsed Time column as reported by the GitHub Comment reporter.

(this image came from the MegaLinter repository (docs/assets/images/) and not the project being optimized; it's provided as an example to show the columns in samples output)

The linters that took longer to run received the most attention. Linters that only took a second or two received no attention.

Then, we looked at the number under the Found column. If there were hundreds or thousands of findings consistently across many runs and the number didn't go down, we quickly concluded that the results of the scanner weren't being used and the linter could likely be disabled. This wasn't intended as a commentary on the quality of the work or the quality of the scanner -- we took the pragmatic perspective of what yielded the most benefit for the least cost (so, a Return on Investment decision).

Changes

These are a few of the changes we made to bring the costs down:

Switch the trigger from push to pull_request

Initially, the team had MegaLinter trigger when developers pushed code up to the repo. This was helpful in keeping the code in compliance with their coding style guidelines; however, it meant that every time anyone did anything, MegaLinter ran. We changed it to only run when a Pull Request is generated to merge branches into main or master or when manually initiated.

This change was made in the project's .github/workflows/megalinter.yml file:

---
name: "MegaLinter"

# yamllint disable-line rule: truthy
on:
  pull_request:
    branches:
      - main
      - master
    workflow_dispatch:

This was a huge reduction in the number of times MegaLinter ran, especially when APPLY_FIXES was set.

We also found that having APPLY_FIXES set meant that every time a linter fixed something, the developer would have to pull from the repo after MegaLinter finished up in order to pick up the most recent changes; when they didn't they would receive messages saying that their (local) branch was out of date.

Switch to a smaller flavor

The team had been using a flavor that only ran the linters they actually needed run. The full v7.10.0 image is about 3.34 GB while the flavor they had been using included 4 linters that, while relevant to the project, weren't being used. As the project hadn't been using MegaLinter since the start (i.e., it was adopted after several years of development), there were a bunch (!!!) of linter findings that the team had no intention of addressing. The findings were reasonable and correct, but they just weren't relevant to the project in its current state. The ci_light flavor included the tools they wanted to use (e.g., GitLeaks, Grype, Secretlint, Trivy, and TruffleHog, among others). We contemplated using the security flavor (1.05 GB) but opted not to go in that direction as there was no IaC in the repo, so there was no need to run KICS, Checkov, tflint, etc..

As a result, going from the flavor they had been using down to the ci_light flavor cut the size of the image being pulled from 1.45 GB down to 0.49 GB.

This change was made in the project's .github/workflows/megalinter.yml file:

      - name: MegaLinter
        id: ml
        # You can override MegaLinter flavor used to have faster performances
        # More info at https://megalinter.github.io/flavors/
        uses: ghcr.io/oxsecurity/megalinter-ci_light:v7.10.0

That is, they eliminated the tools they weren't using and cut the size of the image down by two thirds.

Tell GitLeaks to only scan the current commit

GitLeaks is used to detect secrets (credentials, tokens, API keys, passwords, etc.) stored in files in the repository. Generally speaking -- and this is just my personal opinion -- it's usually not great to store secrets in the source code for an application.

By default, GitLeaks detects whether the stuff being scanned is a Git project (generally a safe assumption given that it was running as a GitHub Action and had a .git/ directory). As a result, it'll scan the scan the repository and its entire history for secrets.

Once we established that there were no secrets in the history of the repository, we made the decision to accept the risk of only having MegaLinter scan the commits it was requested to scan and not the entire history. We judged that the risk was acceptable given that the project was closed-source, only signed commits were accepted, and the main branch required approved PRs before other branches could be moved in.

This tweak cut down GitLeaks runtime from 50 seconds down to 4 seconds.

To implement this decision, we configured MegaLinter to pass the --no-git flag to Git Leaks in the project's .mega-linter.yml file:

# only scan the files in This commit, not the entire history of the repo
REPOSITORY_GITLEAKS_ARGUMENTS: "--no-git"

Only scan updated files

The team's concern with only scanning updated files was wanting to have security-related tooling to run on all the files all of the time so that as the tooling improved and was able to detect more potentially problematic situations, not just on updated files.

The security-related scanners we were using were generally in the REPOSITORY_* group. Scanning the documentation for these linters showed that the ones we were using typically included the following notation:

How are identified applicable files

If this linter is active, all files will always be linted

That is, even if VALIDATE_ALL_CODEBASE was set to false, the security linters would still run. The team decided that this was acceptable and updated the .mega-linter.yml file like this:

# only scan updated files
VALIDATE_ALL_CODEBASE: false

Other tweaks

We made some other tweaks, such as disabling Trivy-SBOM (we weren't building anything that would consume an SBOM), limiting the scope of the formatting linters (jsonlint, v8r, prettier, etc.). However, these changes did not yield a noticeable improvement.

Overall

reduced the number of MegaLinter runs on a typical day when that repository being updated from 9 =>6 runs (33% improvement)
cut image download size from 3.34 GB => 0.49 GB (83% improvement)
cut runtime from 6:58 => 2:33 (63% improcement)
average daily runtime 1:02:42 => 0:15:18 (76% improvement)

Does anyone have any thoughts on ways to further improve runtime performance?

nvuillam · 2024-04-02T20:37:15Z

nvuillam
Apr 2, 2024
Maintainer

@wesley-dean-flexion this is a perfect illustration that MegaLinter is a tool, but each project owns the strategy around the tool :)

I think nobody uses 100% of the default MegaLinter configuration, but we have to start somewhere :)

I think you and your team worked enough on your configuration to optimize it very well, but if you share the table with the list of linters and their execution time, maybe we can find even more time saving ^^

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegaLinter speed optimization #3461

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

MegaLinter speed optimization #3461

wesley-dean-flexion Apr 2, 2024

Overview

Methodology

Changes

Switch the trigger from push to pull_request

Switch to a smaller flavor

Tell GitLeaks to only scan the current commit

Only scan updated files

Other tweaks

Overall

Replies: 1 comment

nvuillam Apr 2, 2024 Maintainer

wesley-dean-flexion
Apr 2, 2024

nvuillam
Apr 2, 2024
Maintainer