Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bare Metal Infrastructure Provider Phase 1 #660

Open
smira opened this issue Oct 2, 2024 · 0 comments
Open

Bare Metal Infrastructure Provider Phase 1 #660

smira opened this issue Oct 2, 2024 · 0 comments
Assignees

Comments

@smira
Copy link
Member

smira commented Oct 2, 2024

Prerequisites

Support only Talos >= 1.8.0

Phase 0.1

Implement metal agent as a mini-Talos with a fat base image:

  • include all firmware extensions and anything hardware-related, so we can always boot & detect hardware
  • figure out a way either to copy or (better!) share some resources and controllers between an agent and Talos
  • SideroLink
  • hardware info, including disks and network links
  • maintenance apid running only over SideroLink

Idea: add a build tag to Talos to remove all stuff which is not required for an agent, and build Talos initramfs image with that tag. Layer on top of that an agent, for example, as an extension service.

Boot all (test) QEMU VMs via PXE, run a minimal agent (reports back to the infra provider and to Omni). If the machine is allocated, PXE boot it from the Image Factory with proper schematic & Talos version.

  • PXE server
  • Proxy DHCP (optional)
  • Metadata server (returning initial talos.config= contents)
  • Provider API for the agent to report back to (borrow ideas/implementation from Sidero Metal)
  • Inject proper labels/join token to associate the machine with the provider

Flow:

  • initially provider doesn't know about any machines
  • a machine comes to the PXE endpoint, it's unknown, so provider boots an agent
  • agent reports back to the provider: "there's a machine UUID x"
  • now provider knows about the machine, and next PXE attempt will boot Talos

Phase 0.2

Power management:

  • agent should discover/provision and report IPMI credentials
  • provider should reconcile power state of the machine based on what Omni says
  • in QEMU/test environment we have a mock power API we can leverage for testing

Omni:

  • default power state (based on allocated/not-allocated status)
  • user-provided overrides for power state/UI & API to manage power state

Phase 0.3

Acceptance (configurable, with an option to auto-accept) flow - the machine appears in the Machines view, but no actions are performed on the machine (e.g. it is not wiped, it can't be added to a cluster, don't do IPMI setup, power management, etc.).

Omni provides some UI to accept machines, show not accepted machines, etc.

Provider knows about the acceptance status - if machine is accepted, provider can start some additional actions (in the next phases).

If the machine is not accepted, the agent should "hang" until it either receives the signal that it got accepted, or rejected. It provisions IPMI creds only once the machine is accepted.

Phase 0.4

Hardware reboot support.

Phase 0.5

Disk wipe - initial after acceptance, and disk wipe after the machine is removed from the cluster.

Omni: change the "reset" flow in Omni to use the provider's wipe capability: machine is force-rebooted over IPMI (or equivalent), and forced to PXE boot, and agent is booted up to wipe the disks, and machine is once again available.

Phase 0.6

Redfish support.

Phase 0.7

Provider-specific configurable labels for the joining machines (e.g. dc=nyc).

Phase 0.8

Discovering hardware in the agent (e.g. bnx2 NIC) and automatically building initial set of system extensions to use: e.g. bnx2-firmware.

Phase 0.9

Support for kexec when transitioning from the agent to Talos.

Example:

  • machine is discovered, agent boots up
  • machine is accepted, agent wipes the disks, provisions IPMI creds, but has some timeout before it gets powered off/rebooted
  • if the machine is allocated within that timeout, instead of full reboot, download next Talos kernel args, initramfs, kernel image and kexec into it
@smira smira changed the title Bare Metal Infrastructure Provide Phase 1 Bare Metal Infrastructure Provider Phase 1 Oct 2, 2024
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 14, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 14, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 18, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 21, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 21, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
utkuozdemir added a commit to utkuozdemir/sidero-omni-infra-provider-bare-metal that referenced this issue Oct 22, 2024
Add initial implementation of the Talos agent mode service.

Related to siderolabs/omni#660.

Signed-off-by: Utku Ozdemir <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants