forked from systemd/systemd
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
2118 lines (1639 loc) · 105 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Bugfixes:
* Many manager configuration settings that are only applicable to user
manager or system manager can be always set. It would be better to reject
them when parsing config.
* Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
Jun 01 09:43:02 krowka systemd[1]: Unit [email protected] has alias [email protected].
External:
* Fedora: add an rpmlint check that verifies that all unit files in the RPM are listed in %systemd_post macros.
* dbus:
- natively watch for dbus-*.service symlinks (PENDING)
- teach dbus to activate all services it finds in /etc/systemd/services/org-*.service
* kernel: add device_type = "fb", "fbcon" to class "graphics"
* /usr/bin/service should actually show the new command line
* fedora: suggest auto-restart on failure, but not on success and not on coredump. also, ask people to think about changing the start limit logic. Also point people to RestartPreventExitStatus=, SuccessExitStatus=
* neither pkexec nor sudo initialize environ[] from the PAM environment?
* fedora: update policy to declare access mode and ownership of unit files to root:root 0644, and add an rpmlint check for it
* register catalog database signature as file magic
* zsh shell completion:
- <command> <verb> -<TAB> should complete options, but currently does not
- systemctl add-wants,add-requires
- systemctl reboot --boot-loader-entry=
* systemctl status should know about 'systemd-analyze calendar ... --iterations='
* If timer has just OnInactiveSec=..., it should fire after a specified time
after being started.
* write blog stories about:
- hwdb: what belongs into it, lsusb
- enabling dbus services
- how to make changes to sysctl and sysfs attributes
- remote access
- how to pass throw-away units to systemd, or dynamically change properties of existing units
- testing with Harald's awesome test kit
- auto-restart
- how to develop against journal browsing APIs
- the journal HTTP iface
- non-cgroup resource management
- dynamic resource management with cgroups
- refreshed, longer missions statement
- calendar time events
- init=/bin/sh vs. "emergency" mode, vs. "rescue" mode, vs. "multi-user" mode, vs. "graphical" mode, and the debug shell
- how to create your own target
- instantiated apache, dovecot and so on
- hooking a script into various stages of shutdown/rearly booot
Regularly:
* look for close() vs. close_nointr() vs. close_nointr_nofail()
* check for strerror(r) instead of strerror(-r)
* pahole
* set_put(), hashmap_put() return values check. i.e. == 0 does not free()!
* use secure_getenv() instead of getenv() where appropriate
* link up selected blog stories from man pages and unit files Documentation= fields
Janitorial Clean-ups:
* Rearrange tests so that the various test-xyz.c match a specific src/basic/xyz.c again
* rework mount.c and swap.c to follow proper state enumeration/deserialization
semantics, like we do for device.c now
* get rid of prefix_roota() and similar, only use chase_symlinks() and related
calls instead.
* get rid of basename() and replace by path_extract_filename()
Deprecations and removals:
* Remove any support for booting without /usr pre-mounted in the initrd entirely.
Update INITRD_INTERFACE.md accordingly.
* 2019-10 – Remove POINTINGSTICK_CONST_ACCEL references from the hwdb, see #9573
* remove cgrouspv1 support EOY 2023. As per
https://lists.freedesktop.org/archives/systemd-devel/2022-July/048120.html
and then rework cgroupsv2 support around fds, i.e. keep one fd per active
unit around, and always operate on that, instead of cgroup fs paths.
* drop support for kernels that lack ambient capabilities support (i.e. make
4.3 new baseline). Then drop support for "!!" modifier for ExecStart= which
is only supported for such old kernels.
* drop support for getrandom()-less kernels. (GRND_INSECURE means once kernel
5.6 becomes our baseline). See
https://github.com/systemd/systemd/pull/24101#issuecomment-1193966468 for
details. Maybe before that: at taint-flags/warn about kernels that lack
getrandom()/environments where it is blocked.
* drop support for LOOP_CONFIGURE-less loopback block devices, once kernel
baseline is 5.8.
* drop fd_is_mount_point() fallback mess once we can rely on
STATX_ATTR_MOUNT_ROOT to exist i.e. kernel baseline 5.8
* rework our PID tracking in services and so on, to be strictly based on pidfd,
once kernel baseline is 5.13.
* ~2023: remove support for TPM_PCR_INDEX_KERNEL_PARAMETERS_COMPAT
* H2 2023: remove support for unmerged-usr
Features:
* add support for asymmetric LUKS2 TPM based encryption. i.e. allow preparing
an encrypted image on some host given a public key belonging to a specific
other host, so that only hosts possessing the private key in the TPM2 chip
can decrypt the volume key and activate the volume. Usecase: systemd-syscfg
for a central orchestrator to generate syscfg images securely that can only
be activated on one specific host (which can be used for installing a bunch
of creds in /etc/credstore/ for example). Extending on this: allow binding
LUKS2 TPM based encryption also to the TPM2 internal clock. Net result:
prepare a syscfg image that can only be activated on a specific host that
runs a specific software in a specific time window. syscfg would be
automatically invalidated outside of it.
* maybe add a "systemd-report" tool, that generates a TPM2-backed "report" of
current system state, i.e. a combination of PCR information, local system
time and TPM clock, running services, recent high-priority log
messages/coredumps, system load/PSI, signed by the local TPM chip, to form an
enhanced remote attestation quote. Usecase: a simple orchestrator could use
this: have the report tool upload thes reports every 3min somewhere. Then
have the orchestrator collect these reports centrally over a 3min time
window, and use them to determine what which node should now start/stop what,
and generate a small syscfg for each node, that uses Uphold= to pin services
on each node. The syscfg would be encrypted using the asymmetric encryption
proposed above, so that it can only be activated on the specific host, if the
software is in a good state, and within a specific time frame. Then run a
loop on each node that sends report to orchestrator and then sysupdate to
update syscfg. Orchestrator would be stateless, i.e. operate on desired
config and collected reports in the last 3min time window only, and thus can
be trivially scaled up since all instances of the orchestrator should come to
the same conclusions given the same inputs of reports/desired workload info.
Could also be used to deliver Wireguard secrets and thus to clients, thus
permitting zero-trust networking: secrets are rolled over via syscfg updates,
and via the time window TPM logic invalidated if node doesn't keep itself
updated, or becomes corrupted in some way.
* Always measure the LUKS rootfs volume key into PCR 15, and derive the machine
ID from it securely. This would then allow us to bind secrets a specific
system securely.
* nspawn: maybe allow TPM passthrough, backed by swtpm, and measure --image=
hash into its PCR 11, so that nspawn instances can be TPM enabled, and
partake in measurements/remote attestation and such. swtpm would run outside
of control of container, and ideally would itself bind its encryption keys to
host TPM.
* tree-wide: convert as much as possible over to use sd_event_set_signal_exit(), instead
of manually hooking into SIGINT/SIGTERM
* tree-wide: convert as much as possible over to SD_EVENT_SIGNAL_PROCMASK
instead of manual blocking.
* sd-boot: for each installed OS, grey out older entries (i.e. all but the
newest), to indicate they are obsolete
* automatically propagate LUKS password credential into cryptsetup from host
(i.e. SMBIOS type #11, …), so that one can unlock LUKS via VM hypervisor
supplied password.
* add ability to path_is_valid() to classify paths that refer to a dir from
those which may refer to anything, and use that in various places to filter
early. i.e. stuff ending in "/", "/." and "/.." definitely refers to a
directory, and paths ending that way can be refused early in many contexts.
* systemd-measure: allow operating with PEM certificates in addition to PEM
public keys when signing PCR values. SecureBoot and our Verity signatures
operate with certificates already, hence I guess we should also just deal for
convencience with certificates for the PCR stuff too.
* systemd-measure: add --pcrpkey-auto as an alternative to --pcrpkey=, where it
would just use the same public key specified with --public-key= (or the one
automatically derived from --private-key=).
* tmpfiles: add new line type for setting btrfs subvolume attributes (i.e. rw/ro)
* tmpfiles: add new line type for setting fcaps
* push people to use ".sysext.raw" as suffix for sysext DDIs (DDI =
discoverable disk images, i.e. the new name for gpt disk images following the
discoverable disk spec). [Also: just ".sysext/" for directory-based sysext]
* Add "purpose" flag to partition flags in discoverable partition spec that
indicate if partition is intended for sysext, for portable service, for
booting and so on. Then, when dissecting DDI allow specifying a purpose to
use as additional search condition. Usecase: images that combined a sysext
partition with a portable service partition in one.
* On boot, auto-generate an asymmetric key pair from the TPM,
and use it for validating DDIs and credentials. Maybe upload it to the kernel
keyring, so that the kernel does this validation for us for verity and kernel
modules
* for systemd-syscfg: add a tool that can generate suitable DDIs with verity +
sig using squashfs-tools-ng's library. Maybe just systemd-repart called under
a new name with a built-in config?
* gpt-auto: generate mount units that reference partitions via
/dev/disk/by-diskseq/… so that they can't be swapped out behind our back.
* lock down acceptable encrypted credentials at boot, via simple allowlist,
maybe on kernel command line:
systemd.import_encrypted_creds=foobar.waldo,tmpfiles.extra to protect locked
down kernels from credentials generated on the host with a weak kernel
* Add support for extra verity configuration options to systemd-repart (FEC,
hash type, etc)
* chase_symlinks(): take inspiraton from path_extract_filename() and return
O_DIRECTORY if input path contains trailing slash.
* chase_symlinks(): refuse resolution if trailing slash is specified on input,
but final node is not a directory
* chase_symlinks(): add new flag that simply refuses all symlink use in a path,
then use that for accessing XBOOTLDR/ESP
* document in boot loader spec that symlinks in XBOOTLDR/ESP are not OK even if
non-VFAT fs is used.
* measure credentials picked up from SMBIOS to some suitable PCR
* measure GPT and LUKS headers somewhere when we use them (i.e. in
systemd-gpt-auto-generator/systemd-repart and in systemd-cryptsetup?)
* pick up creds from EFI vars
* sd-stub/sd-boot: write RNG seed to LINUX_EFI_RANDOM_SEED_TABLE_GUID config
table as well. (and possibly drop our efi var). Current kernels will pick up
the seed from there already, if EFI_RNG_PROTOCOL is not implemented by
firmware.
* sd-boot: include domain specific hash string in hash function for random seed
plus sizes of everything. also include DMI/SMBIOS blob
* sd-stub: invoke random seed logic the same way as in sd-boot, except if
random seed EFI variable is already set. That way, the variable set will be
set in all cases: if you just use sd-stub, or just sd-boot, or both.
* sd-boot: we probably should include all BootXY EFI variable defined boot
entries in our menu, and then suppress ourselves. Benefit: instant
compatibility with all other OSes which register things there, in particular
on other disks. Always boot into them via NextBoot EFI variable, to not
affect PCR values.
* systemd-measure tool:
- pre-calculate PCR 12 (command line) + PCR 13 (sysext) the same way we can precalculate PCR 11
* in sd-boot: load EFI drivers from a new PE section. That way, one can have a
"supercharged" sd-boot binary, that could carry ext4 drivers built-in.
* sd-bus: document that sd_bus_process() only returns messages that non of the
filters/handlers installed on the connection took possession of.
* sd-device: add an API for acquiring list of child devices, given a device
objects (i.e. all child dirents that dirs or symlinks to dirs)
* sd-device: maybe pin the sysfs dir with an fd, during the entire runtime of
an sd_device, then always work based on that.
* add small wrapper around qemu that implements sd_notify/AF_VSOCK + machined and
maybe some other stuff and boots it
* maybe add new flags to gpt partition tables for rootfs and usrfs indicating
purpose, i.e. whether something is supposed to be bootable in a VM, on
baremetal, on an nspawn-style container, if it is a portable service image,
or a sysext for initrd, for host os, or for portable container. Then hook
portabled/… up to udev to watch block devices coming up with the flags set, and
use it.
* sd-boot should look for information what to boot in SMBIOS, too, so that VM
managers can tell sd-boot what to boot into and suchlike
* PID 1 should look for an SMBIOS variable that encodes an AF_VSOCK address it
should send sd_notify() ready notifications to. That way a VMM can boot up a
system, and generically know when it finished booting.
* add "systemd-sysext identify" verb, that you can point on any file in /usr/
and that determines from which overlayfs layer it originates, which image, and with
what it was signed.
* journald: generate recognizable log events whenever we shutdown journald
cleanly, and when we migrate run → var. This way tools can verify that a
previous boot terminated cleanly, because either of these two messages must
be safely written to disk, then.
* systemd-creds: extend encryption logic to support asymmetric
encryption/authentication. Idea: add new verb "systemd-creds public-key"
which generates a priv/pub key pair on the TPM2 and stores the priv key
locally in /var. It then outputs a certificate for the pub part to stdout.
This can then be copied/taken elsewhere, and can be used for encrypting creds
that only the host on its specific hw can decrypt. Then, support a drop-in
dir with certificates that can be used to authenticate credentials. Flow of
operations is then this: build image with owner certificate, then after
boot up issue "systemd-creds public-key" to acquire pubkey of the machine.
Then, when passing data to the machine, sign with privkey belonging to one of
the dropped in certs and encrypted with machine pubkey, and pass to machine.
Machine is then able to authenticate you, and confidentiality is guaranteed.
* building on top of the above, the pub/priv key pair generated on the TPM2
should probably also one you can use to get a remote attestation quote.
* bootctl: add "gc" verb that loads all type #1 .conf files, and then removes
all files from the set of files from the ESP/XBOOTLDR matching the entry
token that are not referenced by any. Then, change kernel-install to use only
this to remove auxiliary files, and never remove them explicitly. Benefit:
resources such as initrds/kernels/dtb can be shared between entries.
* Process credentials in:
• networkd/udevd: add a way to define additional .link, .network, .netdev files
via the credentials logic.
• fstab-generator: allow defining additional fstab-like mounts via
credentials (similar: crypttab-generator, verity-generator,
integrity-generator)
• getty-generator: allow defining additional getty instances via a credential
• run-generator: allow defining additional commands to run via a credential
• resolved: allow defining additional /etc/hosts entries via a credential (it
might make sense to then synthesize a new combined /etc/hosts file in /run
and bind mount it on /etc/hosts for other clients that want to read it.
Similar, allow picking up DNS server IP addresses from credential.
• repart: allow defining additional partitions via credential
• timesyncd: pick NTP server info from credential
• portabled: read a credential "portable.extra" or so, that takes a list of
file system paths to enable on start.
• make systemd-fstab-generator look for a system credential encoding root= or
usr=
• systemd-homed: when initializing, look for a credential
systemd.homed.register or so with JSON user records to automatically
register if not registered yet. Usecase: deploy a system, and add an
account one can directly log into.
• initialize machine ID from systemd credential picked up from the ESP via
sd-stub, so that machine ID is stable even on systems where unified kernels
are used, and hence kernel cmdline cannot be modified locally
• in gpt-auto-generator: check partition uuids against such uuids supplied via
sd-stub credentials. That way, we can support parallel OS installations with
pre-built kernels.
* define a JSON format for units, separating out unit definitions from unit
runtime state. Then, expose it:
1. Add Describe() method to Unit D-Bus object that returns a JSON object
about the unit.
2. Expose this natively via Varlink, in similar style
3. Use it when invoking binaries (i.e. make PID 1 fork off systemd-executor
binary which reads the JSON definition and runs it), to address the cow
trap issue and the fact that NSS is actually forbidden in
forked-but-not-exec'ed children
4. Add varlink API to run transient units based on provided JSON definitions
* show SUPPORT_END= data in "hostnamectl" output (and thus also expose a prop
for this on dbus)
* Add SUPPORT_END_URL= field to os-release with more *actionable* information
what to do if support ended
* pam_systemd: on interactive logins, maybe show SUPPORT_END information at
login time, á la motd
* sd-boot: instead of unconditionally deriving the ESP to search boot loader
spec entries in from the paths of sd-boot binary, let's optionally allow it
to be configured on sd-boot cmdline + efi var. Usecase: embed sd-boot in the
UEFI firmware (for example, ovmf supports that via qemu cmdline option), and
use it to load stuff from the ESP.
* mount /var/ from initrd, so that we can apply sysext and stuff before the
initrd transition. Specifically:
1. There should be a var= kernel cmdline option, matching root= and usr=
2. systemd-gpt-auto-generator should auto-mount /var if it finds it on disk
3. mount.x-initrd mount option in fstab should be implied for /var
* implement varlink introspection
* we should probably drop all use of prefix_roota() and friends, and use
chase_symlinks() instead
* make persistent restarts easier by adding a new setting OpenPersistentFile=
or so, which allows opening one or more files that is "persistent" across
service restarts, hot reboot, cold reboots (depending on configuration): the
files are created empty on first invocation, and on subsequent invocations
the files are reboot. The files would be backed by tmpfs, pmem or /var
depending on desired level of persistency.
* sd-event: add ability to "chain" event sources. Specifically, add a call
sd_event_source_chain(x, y), which will automatically enable event source y
in oneshit mode once x is triggered. Use case: in src/core/mount.c implement
the /proc/self/mountinfo rescan on SIGCHLD with this: whenever a SIGCHLD is
seen, trigger the rescan defer event source automatically, and allow it to be
dispatched *before* the SIGCHLD is handled (based on priorities). Benefit:
dispatch order is strictly controlled by priorities again. (next step: chain
event sources to the ratelimit being over)
* if we fork of a service with StandardOutput=journal, and it forks off a
subprocess that quickly dies, we might not be able to identify the cgroup it
comes from, but we can still derive that from the stdin socket its output
came from. We apparently don't do that right now.
* add ability to set hostname with suffix derived from machine id at boot
* ask dracut to generate usr= on the kernel cmdline so that we don't need to
read /etc/fstab from the root fs from the initrd and do daemon-reload
* add PR_SET_DUMPABLE service setting
* homed/userdb: maybe define a "companion" dir for home directories where apps
can safely put privileged stuff in. Would not be writable by the user, but
still conceptually belong to the user. Would be included in user's quota if
possible, even if files are not owned by UID of user. Usecase: container
images that owned by arbitrary UIDs, and are owned/managed by the users, but
are not directly belonging to the user's UID. Goal: we shouldn't place more
privileged dirs inside of unprivileged dirs, and thus containers really
should not be placed inside of traditional UNIX home dirs (which are owned by
users themselves) but somewhere else, that is separate, but still close
by. Inform user code about path to this companion dir via env var, so that
container managers find it. the ~/.identity file is also a candidate for a
file to move there, since it is managed by privileged code (i.e. homed) and
not unprivileged code.
* given that /etc/ssh/ssh_config.d/ is a thing now, ship a drop-in for that
that hooks up userbdctl ssh-key stuff.
* maybe add support for binding and connecting AF_UNIX sockets in the file
system outside of the 108ch limit. When connecting, open O_PATH fd to socket
inode first, then connect to /proc/self/fd/XYZ. When binding, create symlink
to target dir in /tmp, and bind through it.
* add a proper concept of a "developer" mode, i.e. where cryptographic
protections of the root OS are weakened after interactive confirmation, to
allow hackers to allow their own stuff. idea: allow entering developer mode
only via explicit choice in boot menu: i.e. add explicit boot menu item for
it. when developer mode is entered generate a key pair in the TPM2, and add
the public part of it automatically to keychain of valid code signature keys
on subsequent boots. Then provide a tool to sign code with the key in the
TPM2. Ensure that boot menu item is only way to enter developer mode, by
binding it to locality/PCRs so that that keys cannot be generated otherwise.
* services: add support for cryptographically unlocking per-service directories
via TPM2. Specifically, for StateDirectory= (and related dirs) use fscrypt to
set up the directory so that it can only be accessed if host and app are in
order.
* TPM2: extend unlock policy to protect against version downgrades in signed
policies: policy probably must take some nvram based generation counter into
account that can only monotonically increase and can be used to invalidate
old PCR signatures. Otherwise people could downgrade to old signed PCR sets
whenever they want.
* update HACKING.md to suggest developing systemd with the ideas from:
https://0pointer.net/blog/testing-my-system-code-in-usr-without-modifying-usr.html
https://0pointer.net/blog/running-an-container-off-the-host-usr.html
* add a clear concept how the initrd can make up credentials on their own to
pass to the system when transitioning into the host OS. usecase: things like
cloud-init/ignitation and similar can parameterize the host with data they
acquire.
* sd-event: compat wd reuse in inotify code: keep a set of removed watch
descriptors, and clear this set piecemeal when we see the IN_IGNORED event
for it, or when read() returns EAGAIN or on IN_Q_OVERFLOW. Then, whenever we
see an inotify wd event check against this set, and if it is contained ignore
the event. (to be fully correct this would have to count the occurrences, in
case the same wd is reused multiple times before we start processing
IN_IGNORED again)
* systemd-fstab-generator: support addition mount specifications via kernel
cmdline. Usecase: invoke a VM, and mount a host homedir into it via
virtio-fs.
* for vendor-built signed initrds:
- make sysext run in the initrd
- sysext should pick up sysext images from /.extra/ in the initrd, and insist
on verification if in secureboot mode
- kernel-install should be able to install pre-built unified kernel images in
type #2 drop-in dir in the ESP.
- kernel-install should be able install encrypted creds automatically for
machine id, root pw, rootfs uuid, resume partition uuid, and place next to
EFI kernel, for sd-stub to pick them up. These creds should be locked to
the TPM, and bind to the right PCR the kernel is measured to.
- kernel-install should be able to pick up initrd sysexts automatically and
place them next to EFI kernel, for sd-stub to pick them up.
- systemd-fstab-generator should look for rootfs device to mount in creds
- pid 1 should look for machine ID in creds
- systemd-resume-generator should look for resume partition uuid in creds
- sd-stub: automatically pick up microcode from ESP (/loader/microcode/*)
and synthesize initrd from it, and measure it. Signing is not necessary, as
microcode does that on its own. Pass as first initrd to kernel.
* Add a new service type very similar to Type=notify, that goes one step
further and extends the protocol to cover reloads. Specifically, SIGHUP will
become the official way to reload, and daemon has to respond with sd_notify()
to report when it starts reloading, and when it is complete reloading. Care
must be taken to remove races from this model. I.e. PID 1 needs to take
CLOCK_MONOTONIC, then send SIGHUP, then wait for at least one RELOADING=1
message that comes with a newer timestamp, then wait for a READY=1 message.
while we are at it, also maybe extend the logic to require handling of some
specific SIGRT signal for setting debug log level, that carries the level via
the sigqueue() data parameter. With that we extended with minimal logic the
service runtime logic quite substantially.
* firstboot: maybe just default to C.UTF-8 locale if nothing is set, so that we
don't query this unnecessarily in entirely uninitialized
containers. (i.e. containers with empty /etc).
* beef up sd_notify() to support AV_VSOCK in $NOTIFY_SOCKET, so that VM
managers can get ready notifications from VMs, just like container managers
from their payload. Also pick up address from qemu/fw_cfg if set there.
(which has benefits, given SecureBoot and kernel cmdline are not necessarily
friends.)
* mirroring this: maybe support binding to AV_VSOCK in Type=notify services,
then passing $NOTIFY_SOCKET and $NOTIFY_GUESTCID with PID1's cid (typically
fixed to "2", i.e. the official host cid) and the expected guest cid, for the
two sides of the channel. The latter env var could then be used in an
appropriate qemu cmdline. That way qemu payloads could talk sd_notify()
directly to host service manager.
* maybe write a tool that binds an AF_VFSOCK socket, then invokes qemu,
extending the command line to enable vsock on the VM, and using fw_cfg to
configure socket address.
* sd-boot: rework random seed handling following recent kernel changes: always
pass seed to kernel, but credit only if secure boot is used
* sd-boot: also include the hyperv "vm generation id" in the random seed hash,
to cover nicely for machine clones. It's found in the ACPI tables, which
should be easily accessible from UEFI.
* sd-boot: add menu item for shutdown? or hotkey?
* sd-device has an API to create an sd_device object from a device id, but has
no api to query the device id
* sd-device should return the devnum type (i.e. 'b' or 'c') via some API for an
sd_device object, so that data passed into sd_device_new_from_devnum() can
also be queried.
* sd-event: optionally, if per-event source rate limit is hit, downgrade
priority, but leave enabled, and once ratelimit window is over, upgrade
priority again. That way we can combat event source starvation without
stopping processing events from one source entirely.
* sd-event: similar to existing inotify support add fanotify support (given
that apparently new features in this area are only going to be added to the
latter).
* sd-event: add 1st class event source for clock changes
* sd-event: add 1st class event source for timezone changes
* support uefi/http boots with sd-boot: instead of looking for dropin files in
/loader/entries/ dir, look for a file /loader/entries/SHA256SUMS and use that
as directory manifest. The file would be a standard directory listing as
generated by GNU sha256sums.
* sd-boot: maybe add support for embedding the various auxiliary resources we
look for right in the sd-boot binary. i.e. take inspiration from sd-stub
logic: allow combining sd-boot via objcopy with kernels to enumerate, .conf
files, drivers, keys to enroll and so on. Then, add whatever we find that way
to the menu. Usecase: allow building a single PE image you can boot into via
UEFI HTTP boot.
* maybe add a new UEFI stub binary "sd-http". It works similar to sd-stub, but
all it does is download a file from a http server, and execute it, after
optionally checking its hash sum. idea would be: combine this "sd-http" stub
binary with some minimal info about an URL + hash sum, plus .osrel data, and
drop it into the unified kernel dir in the ESP. And bam you have something
that is tiny, feels a lot like a unified kernel, but all it does is chainload
the real kernel. benefit: downloading these stubs would be tiny and quick,
hence cheap for enumeration.
* sysext: measure all activated sysext into a TPM PCR
* maybe add a "syscfg" concept, that is almost entirely identical to "sysext",
but operates on /etc/ instead of /usr/ and /opt/. Use case would be: trusted,
authenticated, atomic, additive configuration management primitive: drop in a
configuration bundle, and activate it, so that it is instantly visible,
comprehensively.
* systemd-dissect: show available versions inside of a disk image, i.e. if
multiple versions are around of the same resource, show which ones. (in other
words: show partition labels).
* systemd-nspawn: make boot assessment do something sensible in a
container. i.e send an sd_notify() from payload to container manager once
boot-up is completed successfully, and use that in nspawn for dealing with
boot counting, implemented in the partition table labels and directory names.
* maybe add a generator that reads /proc/cmdline, looks for
systemd.pull-raw-portable=, systemd-pull-raw-sysext= and similar switches
that take an URL as parameter. It then generates service units for
systemd-pull calls that download these URLs if not installed yet. usecase:
invoke a VM or nspawn container in a way it automatically deploys/runs these
images as OS payloads. i.e. have a generic OS image you can point to any
payload you like, which is then downloaded, securely verified and run.
* improve scope units to support creation by pidfd instead of by PID
* deprecate cgroupsv1 further (print log message at boot)
* systemd-dissect: add --cat switch for dumping files such as /etc/os-release
* per-service sandboxing option: ProtectIds=. If used, will overmount
/etc/machine-id and /proc/sys/kernel/random/boot_id with synthetic files, to
make it harder for the service to identify the host. Depending on the user
setting it should be fully randomized at invocation time, or a hash of the
real thing, keyed by the unit name or so. Of course, there are other ways to
get these IDs (e.g. journal) or similar ids (e.g. MAC addresses, DMI ids, CPU
ids), so this knob would only be useful in combination with other lockdown
options. Particularly useful for portable services, and anything else that
uses RootDirectory= or RootImage=. (Might also over-mount
/sys/class/dmi/id/*{uuid,serial} with /dev/null).
* journalctl/timesyncd: whenever timesyncd acquires a synchronization from NTP,
create a structured log entry that contains boot ID, monotonic clock and
realtime clock (I mean, this requires no special work, as these three fields
are implicit). Then in journalctl when attempting to display the realtime
timestamp of a log entry, first search for the closest later log entry
of this kinda that has a matching boot id, and convert the monotonic clock
timestamp of the entry to the realtime clock using this info. This way we can
retroactively correct the wallclock timestamps, in particular for systems
without RTC, i.e. where initially wallclock timestamps carry rubbish, until
an NTP sync is acquired.
* kernel-install:
- add --all switch for rerunning kernel-install for all installed kernels
- maybe add env var that shortcuts kernel-install for installers that want to
call it at the end only
* doc: prep a document explaining resolved's internal objects, i.e. Query
vs. Question vs. Transaction vs. Stream and so on.
* doc: prep a document explaining PID 1's internal logic, i.e. transactions,
jobs, units
* bootspec: bring UEFI and userspace enumeration of bootspec entries back into
sync, i.e. parse out architecture field in sd-boot (currently only done in
userspace)
* automatically ignore threaded cgroups in cg_xyz().
* add linker script that implicitly adds symbol for build ID and new coredump
json package metadata, and use that when logging
* systemd-dissect: show GPT disk UUID in output
* Enable RestrictFileSystems= for all our long-running services (similar:
RestrictNetworkInterfaces=)
* Add systemd-analyze security checks for RestrictFileSystems= and
RestrictNetworkInterfaces=
* cryptsetup/homed: implement TOTP authentication backed by TPM2 and its
internal clock.
* nspawn: optionally set up nftables/iptables routes that forward UDP/TCP
traffic on port 53 to resolved stub 127.0.0.54
* man: rework os-release(5), and clearly separate our extension-release.d/ and
initrd-release parts, i.e. list explicitly which fields are about what.
* sysext: before applying a sysext, do a superficial validation run so that
things are not rearranged to wildy. I.e. protect against accidental fuckups,
such as masking out /usr/lib/ or so. We should probably refuse if existing
inodes are replaced by other types of inodes or so.
* userdb: when synthesizing NSS records, pick "best" password from defined
passwords, not just the first. i.e. if there are multiple defined, prefer
unlocked over locked and prefer non-empty over empty.
* maybe add a tool inspired by the GPT auto discovery spec that runs in the
initrd and rearranges the rootfs hierarchy via bind mounts, if
enabled. Specifically in some top-level dir /@auto/ it will look for
dirs/symlinks/subvolumes that are named after their purpose, and optionally
encode a version as well as assessment counters, and then mount them into the
file system tree to boot into, similar to how we do that for the gpt auto
logic. Maybe then bind mount the original root into /.superior or something
like that (so that update tools can look there). Further discussion in this
thread:
https://lists.freedesktop.org/archives/systemd-devel/2021-November/047059.html
The GPT dissection logic should automatically enable this tool whenever we
detect a specially marked root fs (i.e introduce a new generic root gpt type
for this, that is arch independent). The also implement this in the image
dissection logic, so that nspawn/RootImage= and so on grok it. Maybe make
generic enough so that it can also work for ostrees arrangements.
* if a path ending in ".auto.d/" is set for RootDirectory=/RootImage= then do a
strverscmp() of everything inside that dir and use that. i.e. implement very
simple version control. Also use this in systemd-nspawn --image= and so on.
* homed: while a home dir is not activated generate slightly different NSS
records for it, that reports the home dir as "/" and the shell as some binary
provided by us. Then, when an SSH login happens and SSH permits it our binary
is invoked. This binary can then talk to homed and activate the homedir if
it's not around yet, prompting the user for a password. Once that succeeded
we'll switch to the real user record, i.e. home dir and shell, and our tool
exec()s the latter. Net effect: ssh'ing into a homed account will just work:
we'll neatly prompt for the homedir's password if its needed. –– Building on
this we could take this even further: since this tool will potentially have
access to the client's ssh-agent (if ssh-agent forwarding is enabled) we
could implement SSH unlocking of a homedir with that: when enrolling a new
ssh pubkey in a user record we'd ask the ssh-agent to sign some random value
with the privkey, then use that as luks key to unlock the home dir. Will not
work for ECDSA keys since their signatures contain a random component, but
will work for RSA and Ed25519 keys.
* add tiny service that decrypts encrypted user records passed via initrd
credential logic and drops them into /run where nss-systemd can pick them up,
similar to /run/host/userdb/. Usecase: drop a root user JSON record there,
and use it in the initrd to log in as root with locally selected password,
for debugging purposes. Other usecase: boot into qemu with regular user
mounted from host. maybe put this in systemd-user-sessions.service?
* drop dependency on libcap, replace by direct syscalls based on
CapabilityQuintet we already have. (This likely allows us drop drop libcap
dep in the base OS image)
* sysext: automatically activate sysext images dropped in via new sd-stub
sysext pickup logic. (must insist on verity + signature on those though)
* add concept for "exitrd" as inverse of "initrd", that we can transition to at
shutdown, and has similar security semantics. This should then take the place
of dracut's shutdown logic. Should probably support sysexts too. Care needs
to be taken that the resulting logic ends up in RAM, i.e. is copied out of
on-disk storage.
* userdbd: implement an additional varlink service socket that provides the
host user db in restricted form, then allow this to be bind mounted into
sandboxed environments that want the host database in minimal form. All
records would be stripped of all meta info, except the basic UID/name
info. Then use this in portabled environments that do not use PrivateUsers=1.
* logind introduce two types of sessions: "heavy" and "light". The former would
be our current sessions. But the latter would be a new type of session that
is mostly the same but does not pull in [email protected] or wait for it. Then,
allow configuration which type of session is desired via pam_systemd
parameters, and then make [email protected]'s session one of these "light" ones.
People could then choose to make FTP sessions and suchlike "light" if they
don't want the service manager to be started for that.
* /etc/veritytab: allow that the roothash column can be specified as fs path
including a path to an AF_UNIX path, similar to how we do things with the
keys of /etc/crypttab. That way people can store/provide the roothash
externally and provide to us on demand only.
* add high-level lockdown level for GPT dissection logic: e.g. an enum that can
be ANY (to mount anything), TRUSTED (to require that /usr is on signed
verity, but rest doesn't matter), LOCKEDDOWN (to require that everything is
on signed verity, except for ESP), SUPERLOCKDOWN (like LOCKEDDOWN but ESP not
allowed). And then maybe some flavours of that that declare what is expected
from home/srv/var… Then, add a new cmdline flag to all tools that parse such
images, to configure this. Also, add a kernel cmdline option for this, to be
honoured by the gpt auto generator.
Alternative idea: add "systemd.gpt_auto_policy=rhvs" to allow gpt-auto to
only mount root dir, /home/ dir, /var/ and /srv/, but nothing else. And then
minor extension to this, insisting on encryption, for example
"systemd.gpt_auto_policy=r+v+h" to require encryption for root and var but not
for /home/, and similar. Similar add --image-dissect-policy= to tools that
take --image= that take the same short string.
* nspawn: maybe optionally insert .nspawn file as GPT partition into images, so
that such container images are entirely stand-alone and can be updated as
one.
* we probably should extend the root verity hash of the root fs into some PCR
on boot. (i.e. maybe add a veritytab option tpm2-measure=12 or so to measure
it into PCR 12); Similar: we probably should extend the LUKS volume key of
the root fs into some PCR on boot. (i.e. maybe add a crypttab option
tpm2-measure=15 or so to measure it into PCR 15); once both are in place
update gpt-auto-discovery to generate these by default for the partitions it
discovers. Static vendor stuff should probably end up in PCR 12 (i.e. the
verity hash), with local keys in PCR 15 (i.e. the encryption volume
key). That way, we nicely distinguish resources supplied by the OS vendor
(i.e. sysext, root verity) from those inherently local (i.e. encryption key),
which is useful if they shall be signed separately.
* add a "policy" to the dissection logic. i.e. a bit mask what is OK to mount,
what must be read-only, what requires encryption, and what requires
authentication.
* in uefi stub: query firmware regarding which PCR banks are being used, store
that in EFI var. then use this when enrolling TPM2 in cryptsetup to verify
that the selected PCRs actually are used by firmware.
* rework recursive read-only remount to use new mount API
* PAM: pick up authentication token from credentials
* when mounting disk images: if IMAGE_ID/IMAGE_VERSION is set in os-release
data in the image, make sure the image filename actually matches this, so
that images cannot be misused.
* New udev block device symlink names:
/dev/disk/by-parttypelabel/<pttype>-<ptlabel>. Use case: if pt label is used
as partition image version string, this is a safe way to reference a specific
version of a specific partition type, in particular where related partitions
are processed (e.g. verity + rootfs both named "LennartOS_0.7").
* sysupdate:
- add fuzzing to the pattern parser
- support casync as download mechanism
- "systemd-sysupdate update --all" support, that iterates through all components
defined on the host, plus all images installed into /var/lib/machines/,
/var/lib/portable/ and so on.
- figure out what to do about system extensions (i.e. they need to imply an
update component, since otherwise system extenion' sysupdate.d/ files would
override the host's update files.)
- Allow invocation with a single transfer definition, i.e. with
--definitions= pointing to a file rather than a dir.
- add ability to disable implicit decompression of downloaded artifacts,
i.e. a Compress=no option in the transfer definitions
* in sd-id128: also parse UUIDs in RFC4122 URN syntax (i.e. chop off urn:uuid: prefix)
* DynamicUser= + StateDirectory= → use uid mapping mounts, too, in order to
make dirs appear under right UID.
* systemd-sysext: optionally, run it in initrd already, before transitioning
into host, to open up possibility for services shipped like that.
* maybe add a tool that displays most recent journal logs as QR code to scan
off screen and run it automatically on boot failures, emergency logs and
such. Use DRM APIs directly, see
https://github.com/dvdhrm/docs/blob/master/drm-howto/modeset.c for an example
for doing that.
* introduce /dev/disk/root/* symlinks that allow referencing partitions on the
disk the rootfs is on in a reasonably secure way. (or maybe: add
/dev/gpt-auto-{home,srv,boot,…} similar in style to /dev/gpt-auto-root as we
already have it.
* whenever we receive fds via SCM_RIGHTS make sure none got dropped due to the
reception limit the kernel silently enforces.
* add an Open= setting to service unit files that can open arbitrary file
system paths at service startup time and pass them to the service process via
our usual socket activation protocol. If passed path refers to AF_UNIX
socket: connect() to it.
* Similar, ConnectStream= which takes IP addresses and connects to them.
* Similar, Load= which takes literal data in text or base64 format, and puts it
into a memfd, and passes that. This enables some fun stuff, such as embedding
bash scripts in unit files, by combining Load= with ExecStart=/bin/bash
/proc/self/fd/3
* add a ConnectSocket= setting to service unit files, that may reference a
socket unit, and which will connect to the socket defined therein, and pass
the resulting fd to the service program via socket activation proto.
* Add a concept of ListenStream=anonymous to socket units: listen on a socket
that is deleted in the fs. Usecase would be with ConnectSocket= above.
* importd: support image signature verification with PKCS#7 + OpenBSD signify
logic, as alternative to crummy gpg
* add "systemd-analyze debug" + AttachDebugger= in unit files: The former
specifies a command to execute; the latter specifies that an already running
"systemd-analyze debug" instance shall be contacted and execution paused
until it gives an OK. That way, tools like gdb or strace can be safely be
invoked on processes forked off PID 1.
* expose MS_NOSYMFOLLOW in various places
* credentials system:
- acquire from EFI variable?
- acquire via via ask-password?
- acquire creds via keyring?
- pass creds via keyring?
- pass creds via memfd?
- acquire + decrypt creds from pkcs11?
- make systemd-cryptsetup acquire pw via creds logic
- make PAMName= acquire pw via creds logic
- make macsec/wireguard code in networkd read key via creds logic
- make gatwayd/remote read key via creds logic
- add sd_notify() command for flushing out creds not needed anymore
- make user manager instances create and use a user-specific key (the one in
/var/lib is root-only) and add --user switch to systemd-creds to use it
* add tpm.target or so which is delayed until TPM2 device showed up in case
firmware indicates there is one.
* TPM2: auto-reenroll in cryptsetup, as fallback for hosed firmware upgrades
and such
* introduce a new group to own TPM devices
* cyptsetup: add option for automatically removing empty password slot on boot
* cryptsetup: optionally, when run during boot-up and password is never
entered, and we are on battery power (or so), power off machine again
* cryptsetup: when waiting for FIDO2/PKCS#11 token, tell plymouth that, and
allow plymouth to abort the waiting and enter pw instead
* make cryptsetup lower --iter-time
* cryptsetup: allow encoding key directly in /etc/crypttab, maybe with a
"base64:" prefix. Useful in particular for pkcs11 mode.
* cryptsetup: reimplement the mkswap/mke2fs in cryptsetup-generator to use
systemd-makefs.service instead.
* cryptsetup:
- cryptsetup-generator: allow specification of passwords in crypttab itself
- support rd.luks.allow-discards= kernel cmdline params in cryptsetup generator
* when configuring loopback netif, and it fails due to EPERM, eat up error if
it happens to be set up alright already.
* at boot: check if battery above some threshold, if not power off again after explanation
* userdb: add field for ambient caps, so that a user can have CAP_WAKE_ALARM
for example. And add code that resets ambient caps for all services by
default.
* sd-bus: when connecting to some dbus server socker, set originating AF_UNIX
socket name in abstract namespace to include "description" string, and pick
it up from there in sd_bus_creds logic. i.e. we can use the socket peer
address as conduit for some minimal connection metainfo, and use it to
restore the "description" logic that kdbus used to have.
* systemd-analyze netif that explains predictable interface (or networkctl)
* Add service setting to run a service within the specified VRF. i.e. do the
equivalent of "ip vrf exec".
* change SwitchRoot() implementation in PID 1 to use pivot_root(".", "."), as
documented in the pivot_root(2) man page, so that we can drop the /oldroot
temporary dir.
* special case some calls of chase_symlinks() to use openat2() internally, so
that the kernel does what we otherwise do.
* add a new flag to chase_symlinks() that stops chasing once the first missing
component is found and then allows the caller to create the rest.
* make use of new glibc 2.32 APIs sigabbrev_np() and strerrorname_np().
* if /usr/bin/swapoff fails due to OOM, log a friendly explanatory message about it
* pid1: Move to tracking of main pid/control pid of units per pidfd
* pid1: support new clone3() fork-into-cgroup feature
* pid1: also remove PID files of a service when the service starts, not just
when it exits
* make us use dynamically fewer deps for containers in general purpose distros:
o turn into dlopen() deps:
- p11-kit-trust (always)
- kmod-libs (only when called from PID 1)
- libblkid (only in RootImage= handling in PID 1, but not elsewhere)
- libpam (only when called from PID 1)
- bzip2, xz, lz4 (always — gzip and zstd should probably stay static deps the way they are,
since they are so basic and our defaults)
o move into separate libsystemd-shared-iptables.so .so
- iptables-libs (only used by nspawn + networkd)
* seccomp: maybe use seccomp_merge() to merge our filters per-arch if we can.
Apparently kernel performance is much better with fewer larger seccomp
filters than with more smaller seccomp filters.
* systemd-path: add ESP and XBOOTLDR path. Add "private" runtime/state/cache dir enum,
mapping to $RUNTIME_DIRECTORY, $STATE_DIRECTORY and such
* All tools that support --root= should also learn --image= so that they can
operate on disk images directly. Specifically: systemctl, coredumpctl.
(Already done: bootctl, systemd-nspawn, systemd-firstboot,
systemd-repart, systemd-tmpfiles, systemd-sysusers, journalctl)
* seccomp: by default mask x32 ABI system wide on x86-64. it's on its way out
* seccomp: don't install filters for ABIs that are masked anyway for the
specific service
* busctl: maybe expose a verb "ping" for pinging a dbus service to see if it
exists and responds.
* Maybe add a separate GPT partition type to the discoverable partition spec
for "hibernate" partitions, that are exactly like swap partitions but only
activated right before hibernation and thus never used for regular swapping.