Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos-rebuild --use-remote-sudo fails with sudo error #118655

Open
asymmetric opened this issue Apr 6, 2021 · 20 comments
Open

nixos-rebuild --use-remote-sudo fails with sudo error #118655

asymmetric opened this issue Apr 6, 2021 · 20 comments
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@asymmetric
Copy link
Contributor

asymmetric commented Apr 6, 2021

Describe the bug
On a Google Cloud instance with OS Login enabled, running

nixos-rebuild switch --target-host me@myhost --flake ..#wireguard-gateway --use-remote-sudo

fails with

sudo: you do not exist in the passwd database

There is some strange interaction with nix-copy-closure and SSH session reuse. It seems that if the first connection is established by nix-copy-closure, then some information is lost/not added to the session.

A workaround for this is to prepend the nixos-rebuild invocation with `NIX_SSHOPTS='-o ControlMaster=no'.

To Reproduce
Steps to reproduce the behavior:

  1. Have a GCE instance with OS Login enabled
  2. Install NixOS
  3. Run nixos-rebuild as outlined above.
  4. To workaround, run NIX_SSHOPTS='-o ControlMaster=no' nixos-rebuild

Expected behavior
Switch happens correctly

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

The getent command correctly returns the user:

[ext_foo_gmail_com@wireguard-gateway:~]$ getent passwd $(whoami)
ext_foo_gmail_com::2812434188:2812434188::/home/ext_foo_gmail_com:/nix/store/6kxhv6s36p5l3jylxzwvqn4qm3fjkb63-bash-interactive-4.4-p23/bin/bash

OS Login uses NSS to make the user available to the system

Notify maintainers

Metadata

 - system: `"x86_64-linux"`
 - host os: `Linux 5.4.99, NixOS, 20.09.20210316.6557a3c (Nightingale)`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.4pre20210308_1c0e3e4`
 - channels(root): `"nixos-20.09.3346.4d0ee90c6e2"`
 - channels(asymmetric): `""`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
@asymmetric asymmetric added the 0.kind: bug Something is broken label Apr 6, 2021
@veprbl veprbl added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Apr 6, 2021
@asymmetric
Copy link
Contributor Author

asymmetric commented Apr 15, 2021

/cc @AmineChikhaoui @tewfik-ghariani in case you've encountered this while working on NixOps, and @flokli for your work on the oslogin package.

@flokli
Copy link
Contributor

flokli commented Apr 15, 2021 via email

@asymmetric
Copy link
Contributor Author

I just tried, and this time it has worked, so no, it doesn't happen reliably.

And yes, I've just observed an nscd crash:

Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: Started Name Service Cache Daemon.
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal sudo[23379]: pam_unix(sudo:session): session closed for user root
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal sshd[23376]: Close session: user ext_foo_gmail_com from my.ip.addr.ess port 43564 id 0
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal sshd[23376]: Starting session: command for ext_foo_gmail_com from my.ip.addr.ess port 43564 id 0
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal kernel: nscd[23405]: segfault at 0 ip 00007fd031662751 sp 00007fd013bfc0d8 error 4 in libc-2.32.so[7fd03152f000+144000]
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal kernel: Code: 84 00 00 00 00 00 0f 1f 00 31 c0 c5 f8 77 c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 48 89 fa c5 f9 ef c0 83 e1 3f 83 f9 20 77 1f <c5> fd 74 0f c5 fd d7 c1 85 c0 0f 85 df 00 00 00 48 83 c7 20 83 e1
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: Started Process Core Dump (PID 23411/UID 0).
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd-coredump[23412]: [🡕] Process 23393 (nscd) of user 64015 dumped core.
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: [email protected]: Succeeded.
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: nscd.service: Main process exited, code=killed, status=11/SEGV
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: nscd.service: Failed with result 'signal'.
Apr 16 08:48:17 wireguard-gateway.c.my-project-185616.internal systemd[1]: nscd.service: Consumed 81ms CPU time, received 5.4K IP traffic, sent 1.8K IP traffic.

This crash seems to happen on every operation, including e.g. nix copy --to.

@asymmetric
Copy link
Contributor Author

This is gdb on the coredump:

[nix-shell:/home/ext_foo_gmail_com]# coredumpctl debug
           PID: 37101 (nscd)
           UID: 64015 (nscd)
           GID: 64015 (nscd)
        Signal: 11 (SEGV)
     Timestamp: Sun 2021-04-18 14:15:27 UTC (10min ago)
  Command Line: nscd
    Executable: /nix/store/h68c6qvm6fwfzzj2b1q9xpi0x5qln25i-glibc-2.32-40-bin/bin/nscd
 Control Group: /system.slice/nscd.service
          Unit: nscd.service
         Slice: system.slice
       Boot ID: 86526397a7b64f00a7c4a1701fc6ed91
    Machine ID: 9b6d6ab4f9f2070a36c2eec1d5b5d347
      Hostname: wireguard-gateway.c.my-project-185616.internal
       Storage: /var/lib/systemd/coredump/core.nscd.64015.86526397a7b64f00a7c4a1701fc6ed91.37101.1618755327000000.lz4
       Message: Process 37101 (nscd) of user 64015 dumped core.

GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /nix/store/h68c6qvm6fwfzzj2b1q9xpi0x5qln25i-glibc-2.32-40-bin/bin/nscd...
(No debugging symbols found in /nix/store/h68c6qvm6fwfzzj2b1q9xpi0x5qln25i-glibc-2.32-40-bin/bin/nscd)
[New LWP 37112]
[New LWP 37101]
[New LWP 37105]
[New LWP 37106]
[New LWP 37107]
[New LWP 37108]
[New LWP 37109]
[New LWP 37110]
[New LWP 37111]
[New LWP 37113]
[New LWP 37114]

warning: File "/nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/c10296m7xgm3ksibcklb2xf48jr635x3-gcc-9.3.0-lib".
To enable execution of this file add
	add-auto-load-safe-path /nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libthread_db-1.0.so
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/c10296m7xgm3ksibcklb2xf48jr635x3-gcc-9.3.0-lib".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `nscd'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc9bdf83751 in __strlen_avx2 () from /nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libc.so.6
[Current thread is 1 (LWP 37112)]
warning: File "/nix/store/hsw0sq8y0a46fxc2d7krgyr6p5dy71kd-gcc-10.2.0-lib/lib/libstdc++.so.6.0.28-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/nix/store/c10296m7xgm3ksibcklb2xf48jr635x3-gcc-9.3.0-lib".
(gdb) bt
#0  0x00007fc9bdf83751 in __strlen_avx2 () from /nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libc.so.6
#1  0x000055aee61d0bce in cache_addgr.isra ()
#2  0x000055aee61d1830 in addgrby ()
#3  0x000055aee61d19aa in addgrbygid ()
#4  0x000055aee61ce15b in nscd_run_worker ()
#5  0x00007fc9bdff3e9e in start_thread () from /nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libpthread.so.0
#6  0x00007fc9bdf2549f in clone () from /nix/store/1jn6apz0fa9h9x7rl3v6vwiymwnjznwv-glibc-2.32-40/lib/libc.so.6

@asymmetric
Copy link
Contributor Author

@flokli let me know if the coredump is of any interest, and i can send it to you.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-rebuild-on-gce-vm/12301/5

@onsails
Copy link
Contributor

onsails commented Apr 26, 2021

I experience the same issue.
I am using deploy-rs to deploy configurations to google cloud instances with oslogin. AFAIK, deploy-rs deploys configurations the same way @asymmetric does manually.

Here are my observations.
I created an instance using gs://nixos-cloud-images/nixos-image-21.03.git.6bf223c82e0-x86_64-linux.raw.tar.gz image.
I add ssh key the service account which has roles/compute.osAdminLogin role for that instance.
I connect to the instance using generic ssh.
here are errors I see in dmesg:

...
[    7.165560] proc: Bad value for 'hidepid'
...
[    7.262941] piix4_smbus 0000:00:01.3: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
...
[    7.827855] proc: Bad value for 'hidepid'
...

Then I add myself to trustedUsers:

awk 'NR==3{print "  nix.trustedUsers = [\"'$(whoami)'\"];"}1' /etc/nixos/configuration.nix | sudo tee /etc/nixos/configuration.nix

Now I see nscd crash:

[  417.191421] show_signal_msg: 8 callbacks suppressed
[  417.191423] nscd[814]: segfault at 0 ip 00007fad17c59751 sp 00007fad083040d8 error 4 in libc-2.32.so[7fad17b26000+144000]
[  417.191430] Code: 84 00 00 00 00 00 0f 1f 00 31 c0 c5 f8 77 c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 48 89 fa c5 f9 ef c0 83 e1 3f 83 f9 20 77 1f <c5> fd 74 0f c5 fd d7 c1 85 c0 0f 85 df 00 00 00 48 83 c7 20 83 e1

however, sudo nixos-rebuild switch completes succesfully.
Two more crashes:

[  468.465760] nscd[1076]: segfault at 0 ip 00007f58484ad751 sp 00007f58389570d8 error 4 in libc-2.32.so[7f584837a000+144000]
[  468.465769] Code: 84 00 00 00 00 00 0f 1f 00 31 c0 c5 f8 77 c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 48 89 fa c5 f9 ef c0 83 e1 3f 83 f9 20 77 1f <c5> fd 74 0f c5 fd d7 c1 85 c0 0f 85 df 00 00 00 48 83 c7 20 83 e1
[  486.595586] nscd[1121]: segfault at 0 ip 00007fce17812751 sp 00007fcdf9bfc0d8 error 4 in libc-2.32.so[7fce176df000+144000]
[  486.595597] Code: 84 00 00 00 00 00 0f 1f 00 31 c0 c5 f8 77 c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 48 89 fa c5 f9 ef c0 83 e1 3f 83 f9 20 77 1f <c5> fd 74 0f c5 fd d7 c1 85 c0 0f 85 df 00 00 00 48 83 c7 20 83 e1

Then I use deploy-rs to deploy my configuration which is based on nixpkgs-unstable and contains additional setting

  systemd.services.fetch-instance-ssh-keys.enable = false;

so instance would not fail fetching ssh keys which don't exist on oslogin-enabled instance.

deploy-rs fails with:

🚀 ℹ️  [deploy] [INFO] Activating profile `system` for node `gcloud-validator`
🚀 ❓ [deploy] [DEBUG] Constructed activation command: sudo -u root /nix/store/xmdr8d1iszw7m36iy9l18d44gzrpgs09-activatable-nixos-system-unnamed-21.05.20210423.1df834e/activate-rs --debug-logs --temp-path '/tmp' activate '/nix/store/xmdr8d1iszw7m36iy9l18d44gzrpgs09-activatable-nixos-system-unnamed-21.05.20210423.1df834e' '/nix/var/nix/profiles/system' --confirm-timeout 30 --magic-rollback --auto-rollback
🚀 ❓ [deploy] [DEBUG] Constructed wait command: sudo -u root /nix/store/xmdr8d1iszw7m36iy9l18d44gzrpgs09-activatable-nixos-system-unnamed-21.05.20210423.1df834e/activate-rs --debug-logs --temp-path '/tmp' wait '/nix/store/xmdr8d1iszw7m36iy9l18d44gzrpgs09-activatable-nixos-system-unnamed-21.05.20210423.1df834e'
🚀 ℹ️  [deploy] [INFO] Creating activation waiter
sudo: you do not exist in the passwd database
🚀 ❓ [deploy] [DEBUG] Wait command ended
🚀 ❌ [deploy] [ERROR] Failed to deploy profile: Waiting over SSH resulted in a bad exit code: Some(1)

No more nscd crashes but there is a proc one:

[  703.869297] proc: Bad value for 'hidepid'

Interesting that sometimes when I run deploy-rs on a fresh instance it applies configuration successfully but later fails again.

@flokli
Copy link
Contributor

flokli commented Apr 27, 2021

Ignore the hidepid spam: systemd/systemd#16896 (comment)

On the failing sudo: There is some instability in nscd when used with google-oslogin, causing sudo to fail - see upstream issue GoogleCloudPlatform/guest-oslogin#33. I'm somewhat suspecting google-oslogin itself to be the culprit, but haven't found the time to drill down into this - any help would be appreciated.

@arianvp
Copy link
Member

arianvp commented May 27, 2021

I am also trying --use-remote-sudo but it fails with another; also cryptic error:

$ nixos-rebuild switch --flake .#ryzen --use-remote-sudo --target-host ryzen
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helpe

@poscat0x04
Copy link
Contributor

I am also trying --use-remote-sudo but it fails with another; also cryptic error:

$ nixos-rebuild switch --flake .#ryzen --use-remote-sudo --target-host ryzen
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helpe

This can be solved by setting NIX_SSHOPTS to -t

@06kellyjac
Copy link
Member

This can be solved by setting NIX_SSHOPTS to -t

It just replies Pseudo-terminal will not be allocated because stdin is not a terminal.
Adding another -t just makes it do nothing

(Not using google-oslogin)

@rembo10
Copy link
Contributor

rembo10 commented Dec 18, 2021

@06kellyjac did you ever solve this? I'm having the same problem

@06kellyjac
Copy link
Member

I havent looked for a solution very hard. I was hoping deploy-rs might handle it but im not sure serokell/deploy-rs#78

malte-christian added a commit to malte-christian/terraform-nixos that referenced this issue Apr 14, 2022
@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jun 19, 2022
@ryantm
Copy link
Member

ryantm commented Jul 9, 2022

My workaround to

sudo: a terminal is required to read the password

was to use --target-host root@IP, not ideal, but it worked.

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jul 9, 2022
robbins added a commit to robbins/terraform-nixos that referenced this issue Dec 30, 2022
@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 7, 2023
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-build-nixos-system-remotely/26188/5

@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Mar 10, 2023
@con-f-use
Copy link
Contributor

con-f-use commented Mar 22, 2023

This can be solved by setting NIX_SSHOPTS to -t

It just replies Pseudo-terminal will not be allocated because stdin is not a terminal. Adding another -t just makes it do nothing

(Not using google-oslogin)

So this is still a problem for me and the discourse link just above my post seems to indicate that it worked once and might be a (very old) regression, now.

Might be, that some combination of options causes that problem. Here is what I run: nixos-rebuild switch --flake '.#maybeconfig' --builders '' --target-host myuser@targethost --use-remote-sudo --verbose and the target host does not permit root-login, however in principle, sudo works for myuser - just not with nixos-rebuild.

And yes, it does seem to have something to do with the controlPath setting, because when I sometimes run it manually, I get the occasional:

unix_listener: cannot bind to path /tmp/nixos-rebuild.pjT31p/ssh-10.0.67.54.sOIwMFssqYkQzZeK: No such file or directory

and the verbose output says it's set by -o ControlPath=/tmp/nixos-rebuild.pjT31p/ssh-%n.

Also tried with NIX_SSHOPTS=-t and -tt, doesn't work reliably, but a bit better.

@mariaa144
Copy link
Contributor

mariaa144 commented May 6, 2023

This can be solved by setting NIX_SSHOPTS to -t

It just replies Pseudo-terminal will not be allocated because stdin is not a terminal. Adding another -t just makes it do nothing

(Not using google-oslogin)

Adding another -t like NIX_SSHOPTS='-tt' nixos-rebuild does allow you to enter the password to the remote host. You just cannot see the prompt. Enter your password after the ssh command and the remote machine should keep going with the build.

nixos-rebuild doesn't give me any indication of the build running on the remote host. I had to use --verbose to know when to enter the password and htop on the remote host to verify something was happening.

However, I'm now getting error: getting status of '/root/[sudo] password for myuser: which is preventing the build from continuing. This happens on the nix-copy-closure --from myuser@myhost [sudo] password for myuser: step.

The only way I could get around the prompt is to set NOPASSWD for the user on the remote host. Something like this:

security.sudo.extraRules = [
  { users = [ "privileged_user" ];
    commands = [
       { command = "/run/current-system/sw/bin/nix-store" ;
         options = [ "NOPASSWD" ];
      }
    ];
  }
];

https://discourse.nixos.org/t/dont-prompt-a-user-for-the-sudo-password/9163/3

nixos-rebuild needs to be modified to provide better support for a sudo prompt by the remote host. Right now the remote prompt doesn't work.

@con-f-use
Copy link
Contributor

Yeah, I read the manpage and tried that already several times. Didn't work. Just echo'ed the password back to me. I definitely think there's a bug somewhere in the stream handling. Also encountered a quoting error when trying to pass --builders "".

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/remote-nixos-rebuild-works-with-build-but-not-with-switch/34741/12

@adamjames
Copy link

adamjames commented Dec 3, 2024

There was an issue logged against Ansible where an SSH session would hang when trying to execute a command in a similar way, also worked around by providing the flag: ansible/ansible#66535.

There's some additional explainers on https://serverfault.com/a/706543 (mind the comments and the old bugzilla report) that might prove useful in narrowing this down. A lot of issues seem to have come about because of unexpected (additional) requests for sudo auth.

Edit: In the SSHOPTS definition

SSHOPTS="$NIX_SSHOPTS -o ControlMaster=auto -o ControlPath=$tmpDir/ssh-%n -o ControlPersist=60"

I wonder if it's possible that the ControlPersist duration is being exceeded and that's not being handled gracefully? The intermittence of the failure made me think about whether this could be being shown up by a long-running build with a while between content updates - similar to this one in Ansible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
None yet
Development

No branches or pull requests