Skip to content
This repository has been archived by the owner on Aug 2, 2024. It is now read-only.

Latest commit

 

History

History
228 lines (157 loc) · 19.8 KB

File metadata and controls

228 lines (157 loc) · 19.8 KB

Azure ML Federated Learning Sandboxes

This page describes the different sandboxes that you can fully provision and use out-of-the-box with our real-world examples. Each sandbox has distinct properties depending on what you'd like to test.

  • Minimal sandbox : the quickest path to a demo environment (only horizontal FL)
  • Eyes-on sandboxes : a sandbox where you can debug your code, but the data is still accessible by the users of your subscription
  • Eyes-off sandboxes : a sandbox where the data is kept in storages without public network access, and only accessible by the computes through a vnet
  • Private sandboxes : an eyes-off sandbox where the Azure ML workspace and resources are also protected behind a vnet
  • Confidential VM sandboxes : a sandbox where the data is kept in storages without public network access, and only accessible by the computes through a vnet, and the computes are Confidential VMs
  • Configurable sandboxes : at the root of our eyes-on/eyes-off sandboxes, these bicep scripts allow you to modify multiple parameters to fit your needs.

🚨 🚨 🚨 IMPORTANT: These sandboxes require you to be the Owner of an Azure resource group. Contributor role is not enough. In your subscription, depending on admin policies, even if you can create a resource group yourself, you might not be the Owner of it. Without ownership, you will not be able to set the RBAC roles necessary for provisioning these sandboxes. Ask your subscription administrator for help.

Minimal sandbox

Deploy a completely open sandbox to allow you to try things out in an eyes-on environment. This setup is intended only for demo purposes. The data is still accessible by the users of your subscription when opening the storage accounts, and data exfiltration is possible. This supports only Horizontal FL scenarios.

Deploy to Azure

❗ Important parameters

Parameter Description Values
compute1SKU SKU of the first compute to provision. ex: Standard_DS4_v2
siloRegions List of regions used for the silos. All our samples work with 3 regions. ex: ["australiaeast", "eastus", "westeurope"]
kaggleUsername and kaggleKey Optional: some of our samples require kaggle credentials to download datasets, this will ensure the credentials get injected in the workspace secret store properly (you can also do that manually later).

Relevant Documentation

To manually reproduce this full provisioning, see relevant documentation:

Eyes-on sandboxes

Deploy a sandbox where computes are located in a vnet, and can communicate with one another for Vertical FL through vnet peering, but the storages remain eyes-on to allow for debugging. This is recommended for a good sandbox for figuring things out on synthetic data.

These sandboxes are typical of a cross-geo federated learning scenario. Each silo is provisioned with a single-tenant, but in different regions.

Deploy Description
Deploy to Azure Eyes-on with 1 CPU compute per silo
Deploy to Azure Eyes-on with 1 GPU compute per silo
Deploy to Azure Eyes-on with 2 computes per silo (1 CPU, 1 GPU)

❗ Important parameters

Parameter Description Values
primarySKU SKU of the first compute to provision. ex: Standard_DS4_v2
secondarySKU SKU of the second compute to provision. ex: STANDARD_NC6
siloRegions List of regions used for the silos. All our samples work with 3 regions. ex: ["australiaeast", "eastus", "westeurope"]
applyVNetPeering Peer the silo networks to the orchestrator network to allow for live private communication between jobs (required for Vertical FL). true or false
kaggleUsername and kaggleKey Optional: some of our samples require kaggle credentials to download datasets, this will ensure the credentials get injected in the workspace secret store properly (you can also do that manually later).

Architecture

Architecture schema of the eyes-on sandboxes.

Relevant Documentation

To manually reproduce this full provisioning, see relevant documentation:

Eyes-off sandboxes

Deploy a sandbox where the silos storages are kept eyes-off by a private service endpoint, accessible only by the computes through a vnet. This sandbox is typical of a cross-geo federated learning scenario. Each silo is provisioned with a single-tenant, but in different regions. Each silo has a distinct virtual network enabling private communication between the silo compute and the silo storage.

Deploy Description
Deploy to Azure Eyes-off with 1 CPU compute per silo
Deploy to Azure Eyes-off with 1 GPU compute per silo
Deploy to Azure Eyes-off with 2 computes per silo (1 CPU, 1 GPU)

❗ Important parameters

Parameter Description Values
primarySKU SKU of the first compute to provision. ex: Standard_DS4_v2
secondarySKU SKU of the second compute to provision. ex: STANDARD_NC6
siloRegions List of regions used for the silos. All our samples work with 3 regions. ex: ["australiaeast", "eastus", "westeurope"]
orchestratorEyesOn Sets the orchestrator network access to either public (true) or private (false, default). true or false
applyVNetPeering Peer the silo networks to the orchestrator network to allow for live private communication between jobs (required for Vertical FL). true or false
kaggleUsername and kaggleKey Optional: some of our samples require kaggle credentials to download datasets, this will ensure the credentials get injected in the workspace secret store properly (you can also do that manually later).

Architecture

Architecture schema of the eyes-off sandboxes.

Relevant Documentation

To manually reproduce this full provisioning, see relevant documentation:

Private sandboxes

This is an eyes-off sandbox, but in addition the Azure ML workspace and all its related resources (container registry, keyvault, etc) are also provisioned behind a vnet. All those are made accessible to each silo by private endpoints.

Deploy Description
Deploy to Azure Private with 1 CPU compute per silo
Deploy to Azure Private with 1 GPU compute per silo
Deploy to Azure Private with 2 computes per silo (1 CPU, 1 GPU)

❗ Important parameters

Parameter Description Values
primarySKU SKU of the first compute to provision. ex: Standard_DS4_v2
secondarySKU SKU of the second compute to provision. ex: STANDARD_NC6
siloRegions List of regions used for the silos. All our samples work with 3 regions. ex: ["australiaeast", "eastus", "westeurope"]
workspaceNetworkAccess To make it easier to debug, use public to make the Azure ML workspace accessible through its public IP in the Azure portal (default: public). public or private
applyVNetPeering Peer the silo networks to the orchestrator network to allow for live private communication between jobs (required for Vertical FL). true or false
kaggleUsername and kaggleKey Optional: some of our samples require kaggle credentials to download datasets, this will ensure the credentials get injected in the workspace secret store properly (you can also do that manually later).

Architecture

Architecture schema of the private sandboxes.

Relevant Documentation

To manually reproduce this full provisioning, see relevant documentation:

Confidential sandboxes

Deploy an eyes-off sandbox where the computes leverage confidential computing to keep your training and processing within an enclave.

Deploy Description
Deploy to Azure A sandbox with AKS clusters with confidential computes per silo and orchestrator.

Note: to take full benefit of the VMs, you will need to finalize the setup of the AKS cluster by creating an instance type and use it in pipeline configs.

❗ Important parameters

Parameter Description Values
computeSKU VM to provision in the AKS cluster (default will use a CVM from dcasv5). You can also use any non-confidential SKU. ex: Standard_DC4as_v5
siloRegions List of regions used for the silos. All our samples work with 3 regions. ❗ make sure you have quota in those regions for confidential compute in particular. ex: ["australiaeast", "eastus", "westeurope"]
orchestratorEyesOn Sets the orchestrator network access to either public (true) or private (false, default). true or false
applyVNetPeering Peer the silo networks to the orchestrator network to allow for live private communication between jobs (required for Vertical FL). true or false
kaggleUsername and kaggleKey Optional: some of our samples require kaggle credentials to download datasets, this will ensure the credentials get injected in the workspace secret store properly (you can also do that manually later).

Architecture

Note: in current sandbox, we're provisioning only in the eastus region by default, to allow for capacity and quick deployment.

Architecture schema of the sandboxes with confidential compute and vnets.

Relevant Documentation

To manually reproduce this full provisioning, see relevant documentation:

Configurable sandboxes

In this section, we'll expose the generic sandbox provisioning scripts we're using to create the sandboxes above. Feel free to adapt the parameters to your own needs.

Using the Azure Portal

Deploy Description
Deploy to Azure Deploy a sandbox with a vnet and public IP for the orchestrator, either eyes-on/eyes-off, vnet peering or not.
Deploy to Azure Deploy a sandbox with a vnet and public IP for the orchestrator, using confidential computes in AKS, either eyes-on/eyes-off, vnet peering or not.

Using bicep

In this section, we will use bicep scripts to automatically provision a set of resources for an FL sandbox.

  1. Using the az cli, log into your Azure subscription:

    az login
    az account set --name <subscription name>
  2. Optional: Create a new resource group for the demo resources. Having a new group would make it easier to delete the resources afterwards (deleting this RG will delete all resources within).

    # create a resource group for the resources
    az group create --name <resource group name> --location <region>

    Notes: If you have Owner role only in a given resource group (as opposed to in the whole subscription), just use that resource group instead of creating a new one.

  3. Run the bicep deployment script in a resource group you own:

    # deploy the demo resources in your resource group
    az deployment group create --template-file ./mlops/bicep/vnet_publicip_sandbox_setup.bicep --resource-group <resource group name> --parameters demoBaseName="fldemo"

    Notes:

    • If someone already provisioned a demo with the same name in your subscription, change demoBaseName parameter to a unique value.
    • By default, only one CPU compute is created for each silo. Please set the compute2 parameter to true if you wish to create both CPU & GPU computes for each silo.
    • Some regions don't have enough quota to provision GPU computes. Please look at the headers of the bicep script to change the region/computeSKU.
  4. Alternatively, you can provision the confidential compute sandbox the same way:

    # deploy the demo resources in your resource group
    az deployment group create --template-file ./mlops/bicep/vnet_publicip_sandbox_aks_confcomp_setup.bicep --resource-group <resource group name> --parameters demoBaseName="fldemo"