-
-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move GCE config to google-guest-agent #150837
Conversation
34b2394
to
04caaa5
Compare
04caaa5
to
285fa69
Compare
Hello I tested this pr the following way i used https://github.com/nix-community/nixos-generators + flake to create my image locally {
inputs = {
nixpkgs.url = "github:abbradar/nixpkgs/google-guest-agent";
nixos-generators = {
url = "github:nix-community/nixos-generators";
inputs.nixpkgs.follows = "nixpkgs";
};
};
outputs = { self, nixpkgs, nixos-generators, ... }: {
packages.x86_64-linux = {
gce = nixos-generators.nixosGenerate {
pkgs = nixpkgs.legacyPackages.x86_64-linux;
format = "gce";
};
};
};
} then i pushed the tar.gz on a bucket. then made a vm image from it both login from gcp console works (with metadata, with ssh-key) also it works with |
@flokli I didn't see this one -- my patchset was done before #144761, but was used in our production privately. Thanks for the heads up! After a brief comparison, I see these differences:
Overall both PRs do the same thing, so I'm okay with closing this one and helping with #144761. First we need to investigate why does my PR work correctly with OS Login (or at least used to work). I also see some interesting comments re:ZFS there. |
As I was saying in 166761, I currently do not have any bandwidth to move it forward. Let's rather push your PR forward. WRT. oslogin: the guest agent is going to try to setup the PAM module, NSS config & ssh config (see https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/oslogin.go#L99). It'll fail on NixOS (these paths are read-only). Your PR is working fine because you're setting these using the NixOS module system. However, I think we should try to prevent the agent from trying to setup these as we do in #144761. We should upstream the patch allowing us to disable this config. Apart from this, I don't see any major blocker in this PR. Ping me when you fix this and I'll properly review this PR. |
285fa69
to
80dceec
Compare
After more investigation I merged two PRs while addressing concerns above. Mainly we still fully support both legacy login via metadata (if While testing I discovered that OS Login test started to fail on newer unstable and sometimes on development machines I tested it on. Turns out we don't always start nscd daemon, which is required for successful OS Login. I split the fix into a separate PR coming soon. Keep in mind that |
During working on NixOS#150837 I discovered that `google-oslogin` test started failing, and so did some of my development machines. Turns out it was because nscd doesn't start by default; rather it's wanted by NSS lookup targets, which are not always fired up. To quote from section on systemd.special(7) on `nss-user-lookup.target`: > All services which provide parts of the user/group database should be > ordered before this target, and pull it in. Following this advice and comparing our unit to official `sssd.service` unit (which is a similar service), we now pull NSS lookup targets from the service, while starting it with `multi-user.target`.
80dceec
to
7c76ca0
Compare
I have replaced |
During working on #150837 I discovered that `google-oslogin` test started failing, and so did some of my development machines. Turns out it was because nscd doesn't start by default; rather it's wanted by NSS lookup targets, which are not always fired up. To quote from section on systemd.special(7) on `nss-user-lookup.target`: > All services which provide parts of the user/group database should be > ordered before this target, and pull it in. Following this advice and comparing our unit to official `sssd.service` unit (which is a similar service), we now pull NSS lookup targets from the service, while starting it with `multi-user.target`. (cherry picked from commit b451eca)
Sadly upstream decided against including our changes. Given that, I propose to simplify the patch and just disable problematic features while changing minimal number of lines. This way the patch will be easier to maintain out-of-tree. @NinjaTrappeur what do you think? |
88bbe6c
to
e2b3f53
Compare
Minimized and moved our changes back to |
The python scripts we're using have been deprecated in favor of the guest agent, see more infos at: GoogleCloudPlatform/compute-image-packages@276e520#diff-267a2788071ee63df1443363c2ab882e7d321adc77ee53462a920229a962eabbL40 The guest agent onboard all the old python scripts as part of two new go binaries: google_guest_agent and google_metadata_script_runner. Both are using the same /etc/default-instance_configs.cfg configuration file. The configuration file we embed in this module mimicks the setup we had with the python scripts. We'll have two services: - The metadata script service: in charge of running the google compute startup/showdown scripts. - The guest agent: in charge of syncing the guest clock with the host and setup the network interfaces on boot. Co-authored-by: Mark Karpov <[email protected]>
I'd like to see this merged. It would make upgrading to 21.11 much easier for GCP infra. Does it need more testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small styling nits to make this a bit more robust in case of upgrades.
Other than that, LGTM. I ran the tests and they still succeed.
e2b3f53
to
a1f273f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks! Should we add some release notes for this - is there things in how this behaves differently from the previous release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on… the VM tests seem to fail now:
^[[2mserver # [ 7.345607] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [ 7.346350] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927255555.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [ 7.349218] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [ 7.350087] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927256354.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [ 7.352931] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
^[[2mserver # [ 7.353733] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927299759.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [ 7.357394] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/users?username=mockuser_nixos_org HTTP/1.1" 200 -^[[0m
client: output:
cleanup
kill machine (pid 9)
^[[2mserver # [ 7.358662] python[856]: b'{"loginProfiles": [{"name": "418156207295006328628", "posixAccounts": [{"primary": true, "username": "mockuser_nixos_org", "uid": "1009719690", "gid": "1009719690", "homeDirectory": "/home/mockuser_nixos_org", "operatingSystemType": "LINUX"}], "sshPublicKeys": {"fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48": {"key": "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBChdA2BmwcG49OrQN33f/sj+OHL5sJhwVl2Qim0vkUJQCry1zFpKTa9ZcDMiWaEhoAR6FGoaGI04ff7CS+1yybQ= snakeoil", "expirationTimeUsec": "1644092927307183.2", "fingerprint": "fa80819b5035333b770ae3a7cf7f76ada71c2739e87419509064dc4106b91d48"}}}]}'^[[0m
^[[2mserver # [ 7.361550] python[856]: Unhandled path: /computeMetadata/v1/oslogin/groups?email=418156207295006328628^[[0m
^[[2mserver # [ 7.362276] python[856]: 169.254.169.254 - - [05/Feb/2022 20:18:47] "GET /computeMetadata/v1/oslogin/groups?email=418156207295006328628 HTTP/1.1" 404 -^[[0m
^[[2mserver # [ 7.363560] sshd[895]: Received disconnect from 192.168.1.1 port 58964:11: disconnected by user^[[0m
^[[2mclient # qemu-system-x86_64: terminating on signal 15 from pid 6 (/nix/store/dn4fwp0yx6nsa85cr20cwvdmg64xwmcy-python3-3.9.9/bin/python3.9)^[[0m
^[[2mserver # [ 7.366536] sshd[895]: Disconnected from user mockuser_nixos_org 192.168.1.1 port 58964^[[0m
kill machine (pid 22)
^[[2mserver # qemu-system-x86_64: terminating on signal 15 from pid 6 (/nix/store/dn4fwp0yx6nsa85cr20cwvdmg64xwmcy-python3-3.9.9/bin/python3.9)^[[0m
(finished: cleanup, in 0.02 seconds)
Traceback (most recent call last):
File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/bin/.nixos-test-driver-wrapped", line 9, in <module>
sys.exit(main())
File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/__init__.py", line 86, in main
driver.run_tests()
File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 121, in run_tests
self.test_script()
File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/driver.py", line 117, in test_script
exec(self.tests, symbols, None)
File "<string>", line 31, in <module>
File "/nix/store/97nb219kaxqwj94cj596zcyzyw8d4ix7-nixos-test-driver-1.1/lib/python3.9/site-packages/test_driver/machine.py", line 559, in succeed
raise Exception(
Exception: command `ssh mockuser_nixos_org@server 'true'` failed (exit code 254)
kill vlan (pid 8)
kill vlan (pid 7)
Ugh. I'll check this, thanks for noticing. re release notes: I'm honestly not sure if OS Login and old login path (via adding mutable users) worked well before. If they didn't, we effectively fixed a bug. Other than that -- should have no visible changes. |
* Rename to google-guest-oslogin; * Fix systemd service; * Small cleanup.
a1f273f
to
524aecf
Compare
It works for me when I rebase atop unstable. I think it was caused by #157112. |
RIght, with the rebase, the test succeeds again. Thanks! |
Is this suitable for backporting to 21.11 also? |
This is a bit too scary to backport IMHO. People who want this before the next release can just stick to unstable for the time being. |
Agreed. If these changes expect a user to react, then it should be postponed. Stable should allow for users to determine "when" to eat the upgrade cost going between release branches, rather than some update. |
Wouldn't it only impact people running 21.11 on GCP? I'm stuck on 21.05 for GCP because the agent doesn't start and therefore doesn't get project provided ssh keys. If I build a custom image with keys then I can log in but the metadata service is borked. Unless I've made some error in deploying an image, I would say not many, if any, are running 21.11 on GCP with a working agent. |
I didn't realise that the google cloud console "Equivalent command line" button omits the My mistake. I've opened an issue for the misleading service failure. #159921 |
Motivation for this change
google-compute-engine
package is currently deprecated. Move us to a newer set of packages from Google.Things done
sandbox = true
set innix.conf
? (See Nix manual)nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)nixos/doc/manual/md-to-db.sh
to update generated release notesTested in production -- works, including both OS Login and fetched keys.