Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move GCE config to google-guest-agent #150837

Merged
merged 4 commits into from
Feb 6, 2022
Merged

Conversation

abbradar
Copy link
Member

Motivation for this change

google-compute-engine package is currently deprecated. Move us to a newer set of packages from Google.

Things done
  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandbox = true set in nix.conf? (See Nix manual)
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 22.05 Release Notes (or backporting 21.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
    • (Release notes changes) Ran nixos/doc/manual/md-to-db.sh to update generated release notes
  • Fits CONTRIBUTING.md.

Tested in production -- works, including both OS Login and fetched keys.

@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: python 8.has: module (update) This PR changes an existing module in `nixos/` labels Dec 15, 2021
@abbradar abbradar force-pushed the google-guest-agent branch 2 times, most recently from 34b2394 to 04caaa5 Compare December 15, 2021 12:24
@ofborg ofborg bot added 8.has: clean-up 8.has: package (new) This PR adds a new package labels Dec 15, 2021
@ofborg ofborg bot requested a review from kalbasit December 15, 2021 12:41
@ofborg ofborg bot added 11.by: package-maintainer This PR was created by the maintainer of the package it changes 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Dec 15, 2021
@abbradar abbradar force-pushed the google-guest-agent branch from 04caaa5 to 285fa69 Compare January 5, 2022 11:26
@putchar
Copy link
Contributor

putchar commented Jan 7, 2022

Hello

I tested this pr the following way

i used https://github.com/nix-community/nixos-generators + flake to create my image locally

{
  inputs = {
    nixpkgs.url = "github:abbradar/nixpkgs/google-guest-agent";
    nixos-generators = {
      url = "github:nix-community/nixos-generators";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };
  outputs = { self, nixpkgs, nixos-generators, ... }: {
    packages.x86_64-linux = {
      gce = nixos-generators.nixosGenerate {
        pkgs = nixpkgs.legacyPackages.x86_64-linux;
        format = "gce";
      };
    };
  };
}

then nix build .#gce

i pushed the tar.gz on a bucket. then made a vm image from it

both login from gcp console works (with metadata, with ssh-key)
the agent does push my public keys properly
image

also it works with
gcloud beta compute ssh --zone "europe-west1-b" "instance-1" --tunnel-through-iap --project "myproject"

@abbradar abbradar requested a review from flokli January 8, 2022 08:57
@flokli
Copy link
Contributor

flokli commented Jan 8, 2022

@abbradar how is this related to #144761?

@abbradar
Copy link
Member Author

abbradar commented Jan 8, 2022

@flokli I didn't see this one -- my patchset was done before #144761, but was used in our production privately. Thanks for the heads up!

After a brief comparison, I see these differences:

  1. I moved google-compute-engine-oslogin to google-guest-oslogin, but I don't see any reason to do this now -- it just seems plain wrong. I'll investigate;
  2. nixos/google-compute-config: fix module #144761 disables OS Login in google-guest-agent. I didn't see any issues with it back when implementing my patchset, though it was more than half a year ago and I might have missed something. OS Login seems to work now but I'll re-check in case any error messages appear;
  3. I use upstream systemd units and as little extra configuration as possible. OTOH nixos/google-compute-config: fix module #144761 might have some interesting configuration options that I missed. Need to compare.

Overall both PRs do the same thing, so I'm okay with closing this one and helping with #144761. First we need to investigate why does my PR work correctly with OS Login (or at least used to work). I also see some interesting comments re:ZFS there.

@picnoir
Copy link
Member

picnoir commented Jan 10, 2022

Overall both PRs do the same thing, so I'm okay with closing this one and helping with #144761. First we need to investigate why does my PR work correctly with OS Login (or at least used to work). I also see some interesting comments re:ZFS there.

As I was saying in 166761, I currently do not have any bandwidth to move it forward. Let's rather push your PR forward.

WRT. oslogin: the guest agent is going to try to setup the PAM module, NSS config & ssh config (see https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/oslogin.go#L99). It'll fail on NixOS (these paths are read-only).

Your PR is working fine because you're setting these using the NixOS module system. However, I think we should try to prevent the agent from trying to setup these as we do in #144761. We should upstream the patch allowing us to disable this config.


Apart from this, I don't see any major blocker in this PR. Ping me when you fix this and I'll properly review this PR.

@abbradar
Copy link
Member Author

abbradar commented Jan 10, 2022

After more investigation I merged two PRs while addressing concerns above. Mainly we still fully support both legacy login via metadata (if users.mutableUsers = true) and OS Login. I added another patch for guest-agent, allowing to disable some parts of accounts daemon that produced errors. With that we now have clean logs without any errors.

While testing I discovered that OS Login test started to fail on newer unstable and sometimes on development machines I tested it on. Turns out we don't always start nscd daemon, which is required for successful OS Login. I split the fix into a separate PR coming soon.

Keep in mind that nixos.patch here is temporary; I intend to submit both patches to upstream in the coming days, but you can already start reviewing this new patchset.

abbradar added a commit to abbradar/nixpkgs that referenced this pull request Jan 10, 2022
During working on NixOS#150837 I discovered that `google-oslogin` test
started failing, and so did some of my development machines. Turns out
it was because nscd doesn't start by default; rather it's wanted by
NSS lookup targets, which are not always fired up.

To quote from section on systemd.special(7) on `nss-user-lookup.target`:

> All services which provide parts of the user/group database should be
> ordered before this target, and pull it in.

Following this advice and comparing our unit to official `sssd.service`
unit (which is a similar service), we now pull NSS lookup targets from
the service, while starting it with `multi-user.target`.
@abbradar
Copy link
Member Author

Opened GoogleCloudPlatform/guest-agent#152.

@abbradar
Copy link
Member Author

I have replaced nixos.patch with links to patches submitted upstream, so no pending tasks remain for this PR before another round of reviews.

github-actions bot pushed a commit that referenced this pull request Jan 11, 2022
During working on #150837 I discovered that `google-oslogin` test
started failing, and so did some of my development machines. Turns out
it was because nscd doesn't start by default; rather it's wanted by
NSS lookup targets, which are not always fired up.

To quote from section on systemd.special(7) on `nss-user-lookup.target`:

> All services which provide parts of the user/group database should be
> ordered before this target, and pull it in.

Following this advice and comparing our unit to official `sssd.service`
unit (which is a similar service), we now pull NSS lookup targets from
the service, while starting it with `multi-user.target`.

(cherry picked from commit b451eca)
@abbradar
Copy link
Member Author

Sadly upstream decided against including our changes. Given that, I propose to simplify the patch and just disable problematic features while changing minimal number of lines. This way the patch will be easier to maintain out-of-tree. @NinjaTrappeur what do you think?

@abbradar abbradar force-pushed the google-guest-agent branch 3 times, most recently from 88bbe6c to e2b3f53 Compare January 13, 2022 15:44
@abbradar
Copy link
Member Author

Minimized and moved our changes back to nixos.patch. Ready for review.

picnoir referenced this pull request in picnoir/nixpkgs Jan 27, 2022
The python scripts we're using have been deprecated in favor of the
guest agent, see more infos at:
GoogleCloudPlatform/compute-image-packages@276e520#diff-267a2788071ee63df1443363c2ab882e7d321adc77ee53462a920229a962eabbL40

The guest agent onboard all the old python scripts as part of two new
go binaries: google_guest_agent and google_metadata_script_runner.
Both are using the same /etc/default-instance_configs.cfg
configuration file.

The configuration file we embed in this module mimicks the setup we
had with the python scripts. We'll have two services:

- The metadata script service: in charge of running the
  google compute startup/showdown scripts.
- The guest agent: in charge of syncing the guest clock with the host
  and setup the network interfaces on boot.

Co-authored-by: Mark Karpov <[email protected]>
@andre-vspry
Copy link

I'd like to see this merged. It would make upgrading to 21.11 much easier for GCP infra. Does it need more testing?

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small styling nits to make this a bit more robust in case of upgrades.

Other than that, LGTM. I ran the tests and they still succeed.

pkgs/tools/virtualization/google-guest-configs/default.nix Outdated Show resolved Hide resolved
pkgs/tools/virtualization/google-guest-agent/default.nix Outdated Show resolved Hide resolved
@ofborg ofborg bot requested a review from flokli February 5, 2022 11:12
Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks! Should we add some release notes for this - is there things in how this behaves differently from the previous release?

Copy link
Contributor

@flokli flokli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hold on… the VM tests seem to fail now:

^[[2mserver # [    7.345607] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [    7.346350] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927255555.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [    7.349218] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [    7.350087] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927256354.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [    7.352931] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [    7.353733] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927299759.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [    7.357394] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
client: output: 
cleanup
kill machine (pid 9)
^[[2mserver # [    7.358662] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927307183.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [    7.361550] python[856]: Unhandled path: /computeMetadata/v1/oslogin/groups?email=418156207295006328628^[[0m
^[[2mserver # [    7.362276] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/groups?email=418156207295006328628 HTTP/1.1" 404 -^[[0m
^[[2mserver # [    7.363560] sshd[895]: Received disconnect from 192.168.1.1 port 58964:11: disconnected by user^[[0m
^[[2mclient # qemu-system-x86_64: terminating on signal 15 from pid 6 (/nix/store/dn4fwp0yx6nsa85cr20cwvdmg64xwmcy-python3-3.9.9/bin/python3.9)^[[0m
^[[2mserver # [    7.366536] sshd[895]: Disconnected from user mockuser_nixos_org 192.168.1.1 port 58964^[[0m
kill machine (pid 22)
^[[2mserver # qemu-system-x86_64: terminating on signal 15 from pid 6 (/nix/store/dn4fwp0yx6nsa85cr20cwvdmg64xwmcy-python3-3.9.9/bin/python3.9)^[[0m
(finished: cleanup, in 0.02 seconds)
Traceback (most recent call last):
  File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/__init__.py", line 86, in main
    driver.run_tests()
  File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 121, in run_tests
    self.test_script()
  File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 117, in test_script
    exec(self.tests, symbols, None)
  File "<string>", line 31, in <module>
  File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 559, in succeed
    raise Exception(
Exception: command `ssh mockuser_nixos_org@server 'true'` failed (exit code 254)
kill vlan (pid 8)
kill vlan (pid 7)

Full log

@abbradar
Copy link
Member Author

abbradar commented Feb 5, 2022

Ugh. I'll check this, thanks for noticing.

re release notes: I'm honestly not sure if OS Login and old login path (via adding mutable users) worked well before. If they didn't, we effectively fixed a bug. Other than that -- should have no visible changes.

@abbradar
Copy link
Member Author

abbradar commented Feb 5, 2022

It works for me when I rebase atop unstable. I think it was caused by #157112.

@ofborg ofborg bot requested a review from flokli February 6, 2022 01:03
@flokli
Copy link
Contributor

flokli commented Feb 6, 2022

RIght, with the rebase, the test succeeds again. Thanks!

@flokli flokli merged commit def482e into NixOS:master Feb 6, 2022
@andre-vspry
Copy link

Is this suitable for backporting to 21.11 also?

@flokli
Copy link
Contributor

flokli commented Feb 10, 2022

This is a bit too scary to backport IMHO. People who want this before the next release can just stick to unstable for the time being.

@jonringer
Copy link
Contributor

Agreed. If these changes expect a user to react, then it should be postponed. Stable should allow for users to determine "when" to eat the upgrade cost going between release branches, rather than some update.

@andre-vspry
Copy link

Wouldn't it only impact people running 21.11 on GCP? I'm stuck on 21.05 for GCP because the agent doesn't start and therefore doesn't get project provided ssh keys. If I build a custom image with keys then I can log in but the metadata service is borked. Unless I've made some error in deploying an image, I would say not many, if any, are running 21.11 on GCP with a working agent.

@picnoir
Copy link
Member

picnoir commented Feb 11, 2022 via email

@andre-vspry
Copy link

andre-vspry commented Feb 14, 2022

I didn't realise that the google cloud console "Equivalent command line" button omits the --metadata switch. This lead to the instance not getting ssh keys from oslogin. Since the console log was all I could see and it showed the fetch ssh keys failing, I assumed this was the issue.

My mistake. I've opened an issue for the misleading service failure. #159921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: clean-up 8.has: module (update) This PR changes an existing module in `nixos/` 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 11.by: package-maintainer This PR was created by the maintainer of the package it changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants