Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 7 unable to finalize action due to missing digest for .d files when --experimental_inmemory_dotd_files is set. #22387

Closed
luispadron opened this issue May 15, 2024 · 17 comments
Assignees
Labels
awaiting-user-response Awaiting a response from the author more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@luispadron
Copy link
Contributor

Description of the bug:

When using --experimental_inmemory_dotd_files which seems to be the default, at least in Bazel 7, the .d file actions fail with a missing digest error.

ERROR: Foo/BUILD.bazel:11:15: Compiling Foo.c failed: unable to finalize action: Missing digest: <number>/<number> for bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-<sha>/bin/path/to/Foo.d

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

I haven't found a way to consistently reproduce this locally, our CI machines which are configured to:

  • Don't use a disk cache
  • Don't use remote execution
  • Use a remote cache

Failed several times in our Bazel 7 testing, after setting --noexperimental_inmemory_dotd_files we no longer saw this issue.

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

release 7.1.1

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Yes, we've never seen this issue in Bazel 6 and not much has changed in our Bazel 7 testing in terms of flags.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

No response

@sgowroji sgowroji added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label May 16, 2024
@wilwell wilwell added the P2 We'll consider working on this in future. (Assignee optional) label May 28, 2024
@wilwell wilwell removed the untriaged label May 28, 2024
@tjgq
Copy link
Contributor

tjgq commented May 28, 2024

We looked for something that could go wrong with the combination of --remote_cache, --remote_download_all and --experimental_inmemory_dotd_files, but don't have a plausible theory yet (other than the remote cache spuriously evicting blobs - but that doesn't explain why it only happens with .d files, and only when in-memory outputs are enabled).

Can you provide the following information:

  • The complete list of Bazel flags you're using
  • The remote cache implementation you're using
  • The --experimental_remote_grpc_log for one of the failed invocations (feel free to scrub sensitive data but please preserve the digests, or rewrite them in such a way that they match up between gprc requests)

In addition, it would be helpful to know the following:

  • Can you repro this against a disk cache, or a different remote cache implementation? (e.g. a simple HTTP cache that is guaranteed to never evict any blobs on its own)

@tjgq tjgq added more data needed awaiting-user-response Awaiting a response from the author labels May 28, 2024
@luispadron
Copy link
Contributor Author

Thanks for investigating @tjgq

I can provide the first two now and look at the execution log when I get a chance:

  • The --announce_rc logs for our flags in CI:
INFO: Invocation ID: <ID>
INFO: Reading 'startup' options from /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --host_jvm_args=-Djavax.net.ssl.trustStore=Configuration/Java.cacerts, --host_jvm_args=-Djavax.net.ssl.trustStorePassword=changeit, --host_jvm_args=-DBAZEL_TRACK_SOURCE_DIRECTORIES=1, --max_idle_secs=86400, --digest_function=blake3
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'build' from /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc:
  Inherited 'common' options: --remote_header=hashfn=blake3 --lockfile_mode=update --incompatible_disallow_empty_glob --experimental_repository_downloader_retries=3 --incompatible_strict_action_env=true --spawn_strategy=local --verbose_failures --test_output=errors --max_config_changes_to_show=-1 --attempt_to_print_relative_paths --experimental_inprocess_symlink_creation --keep_going --output_filter=^//.*:((?!(SwiftLintCore|SwiftLintBuiltInRules).*).)*$ --noexperimental_inmemory_dotd_files --compilation_mode=dbg --@build_bazel_rules_swift//swift:copt=-whole-module-optimization --@build_bazel_rules_swift//swift:exec_copt=-whole-module-optimization --@rules_xcodeproj//xcodeproj:extra_common_flags=--//Bazel:is_building_in_xcode=0 --features=swift.emit_symbol_graph_extension_blocks --action_env=CACHE_EPOCH=4 --remote_download_outputs=all --config=cache_cdn_read --noremote_upload_local_results --remote_local_fallback --experimental_remote_merkle_tree_cache --experimental_guard_against_concurrent_changes --disk_cache=~/Library/Caches/bazel-cash-ios-cache --remote_build_event_upload=minimal --nolegacy_important_outputs --modify_execution_info=^(AppleLipo|BitcodeSymbolsCopy|BundleApp|BundleTreeApp|DsymDwarf|DsymLipo|GenerateAppleSymbolsFile|ObjcBinarySymbolStrip|CppArchive|CppLink|ObjcLink|ProcessAndSign|SignBinary|SwiftArchive|SwiftStdlibCopy|PackagingFramework.+|ExtendModulemap|HmapCreate)$=+no-remote,^(BundleResources|ImportedDynamicFrameworkProcessor)$=+no-remote-exec --remote_cache_compression=true --xcode_version_config=//Bazel:host_xcodes --macos_minimum_os=13.0 --host_macos_minimum_os=13.0 --config virtual_frameworks --features=-swift.vfsoverlay --@build_bazel_rules_apple//apple/build_settings:use_tree_artifacts_outputs=true --define=apple.incompatible.objc_framework_propagate_modulemap=true
INFO: Reading rc options for 'build' from /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc:
  'build' options: --flag_alias=build_config=//Bazel:build_config --flag_alias=release_variant=//Bazel:release_variant --flag_alias=xcscheme=//Bazel/apple/xcschemes:xcscheme
INFO: Found applicable config definition common:cache_cdn_read in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --remote_cache=<REDACTED>
INFO: Found applicable config definition common:virtual_frameworks in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --features apple.virtualize_frameworks
INFO: Found applicable config definition common:ci in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --remote_upload_local_results --build_metadata=ROLE=CI --announce_rc --color=no --curses=no --noshow_loading_progress --show_progress_rate_limit=15.0 --progress_report_interval=60 --disk_cache=
INFO: Found applicable config definition common:cache_grpc in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --remote_cache=grpcs://bazel-remote-vpce-service-privatelink.squarecloudservices.com --experimental_remote_cache_async=true
INFO: Found applicable config definition common:ios_release in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --config=release --ios_multi_cpus=arm64 --@build_bazel_rules_apple//apple/build_settings:use_tree_artifacts_outputs=false --config=generate_dsym --objc_enable_binary_stripping --define=apple.trim_lproj_locales=yes --features=dead_strip --features=swift.opt_uses_wmo --@build_bazel_rules_swift//swift:copt=-Xfrontend --@build_bazel_rules_swift//swift:copt=-internalize-at-link
INFO: Found applicable config definition common:release in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --build_config=release --compilation_mode=opt --//Pods/cocoapods-bazel:config=release --//Pods/cocoapods-bazel:deps_config=deps_release
INFO: Found applicable config definition common:generate_dsym in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --apple_generate_dsym --output_groups=+dsyms
INFO: Found applicable config definition common:alpha in file /Users/build/.jenkins/workspace/cash-ios/ios-builder/s/c/.bazelrc: --release_variant=alpha

@tjgq
Copy link
Contributor

tjgq commented Jul 23, 2024

@luispadron Can you provide the --experimental_remote_grpc_log for a build exhibiting this failure? Otherwise, it's going to be difficult to make progress on this.

@miscott2
Copy link

Hi. We're also seeing this issue although it's very intermittent with only 6 builds out of 46,046 in the last week impacted. We're using a simple HTTP cache (nginx caching proxy in front of Artifactory) and are quite confident it's not a cache issue.

I notice in your first message you jumped to --remote_download_all. Our builds are a mixture of --remote_download_all and --remote_download_toplevel and while we aren't seeing this often it does appear it's always with --remote_download_all builds.

Please let us know what additional information/logs would be helpful.

@tjgq
Copy link
Contributor

tjgq commented Jul 30, 2024

@miscott2 I think the --experimental_remote_grpc_log for one of the failed runs would be the most useful piece of information here. (You can scrub any sensitive information, but please preserve the digests.)

@tjgq
Copy link
Contributor

tjgq commented Jul 30, 2024

Oh wait, but if you're using an HTTP cache, there's no gRPC log; nevermind.

@NeilKetley
Copy link

NeilKetley commented Sep 27, 2024

@tjgq Since, as miscott2 says, we're using HTTP rather than GRPC, is there some other information that would be useful in that case? In one of your previous posts you asked if this could be reproduced using an HTTP cache, to which the answer very much appears to be "yes", so is there a way to gather useful info in that case?

@tjgq
Copy link
Contributor

tjgq commented Sep 27, 2024

@NeilKetley What's the eviction policy for your HTTP cache? i.e., do you have any automated process that periodically removes old entries from the cache? Would you be able to confirm whether the blob in question was present in the cache at some point, but later got deleted? I'm wondering whether this might be just a special case of #18696.

@tjgq
Copy link
Contributor

tjgq commented Oct 24, 2024

I'm going with the theory that this is the same as #18696, which has been fixed in 7.4.0. Please reopen if you're seeing similar failures in 7.4.0 or later.

@tjgq tjgq closed this as not planned Won't fix, can't repro, duplicate, stale Oct 24, 2024
@NeilKetley
Copy link

@tjgq apologies for not responding sooner. We are still attempting to repro and collect the information / answer the questions you posed previously. I do not think we will be able to try a later version of Bazel at this point since this issue is happening in our live build system, not really open for experimentation, but we will collect the info requested and hope that this will either confirm your suspicion or show otherwise.

@miscott2
Copy link

miscott2 commented Nov 4, 2024

@tjgq We (if it wasn't clear Neil and I are colleagues) have just deployed Bazel 7.4.0 and we're looking at enabling --experimental_remote_cache_eviction_retries so might have more information soon. That said if I've understood correctly doesn't this have to be slightly different from #18696 ? In that bug the issue is that it's using build without the bytes and so when a cas entry is found to be missing you can't fall back to local execution because it's too late to unwind stuff. In this case the original submitter and our team are seeing issues with build without the bytes disabled (--remote_download_all) so it's less clear why Bazel would be unable to fall back to running the action locally.

We are obviously following up with our cache team about the eviction strategy but since we're using a "dumb" HTTP cache it's a bit complicated!

@coeuvre
Copy link
Member

coeuvre commented Nov 4, 2024

I think failing to download in memory files always fails the build, no matter --remote_download_all or --remote_download_toplevel. And we believe the download error from the HTTP cache presented in this issue shares the same root cause with #18696.

Looking forward to the result of your experiments!

@luispadron
Copy link
Contributor Author

Sorry I wasn't able to get any more details about the root cause here. We've now updated to 7.4.0 and removed our use of --noexperimental_inmemory_dotd_files and haven't seen any issues so far.

Thanks for fixing this and following up!

@luispadron luispadron reopened this Dec 18, 2024
@luispadron
Copy link
Contributor Author

@coeuvre Unfortunately it looks like we just hit this again in a Bazel 7.4.1 build on CI. It happened back-to-back over two builds:

ERROR: Code/CoreLibraries/Flows/FlowUI/BUILD.bazel:9:29: Compiling Code/CoreLibraries/Flows/FlowUI/Sources/ViewControllers/Passcode/SCPasscodeScreenViewController.m failed: unable to finalize action: Missing digest: <sha> for bazel-out/ios_arm64-opt-ios-arm64-min15.0-applebin_ios-ST-2dd28d6bf8a6/bin/Code/CoreLibraries/Flows/FlowUI/_objs/FlowUI_objc/arc/SCPasscodeScreenViewController.d

@coeuvre
Copy link
Member

coeuvre commented Dec 19, 2024

Did you set --experimental_remote_cache_eviction_retries? did retry help?

The error might still happen because of cache eviction but combining retry and the fix mention above should allow builds to continue and re-upload the evicted files.

@luispadron
Copy link
Contributor Author

Ah my bad I did not have --experimental_remote_cache_eviction_retries, let me try that and follow-up

@luispadron
Copy link
Contributor Author

So we had this on for a few weeks and things seemed generally stable but just had another instance where it failed, we are using --experimental_remote_cache_eviction_retries=5:

Code/ThirdParty/UICountingLabel/BUILD.bazel:3:29: Compiling Code/ThirdParty/UICountingLabel/Sources/UICountingLabel.m failed: unable to finalize action: Missing digest: <SHA> for bazel-out/ios_arm64-opt-ios-arm64-min16.0-applebin_ios-ST-01dfa3c2e826/bin/Code/ThirdParty/UICountingLabel/_objs/UICountingLabel_objc/arc/UICountingLabel.d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants