Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Workflow Interface] Refactor FLSpec and Runtime to enhance modularity #1363

Open
wants to merge 60 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
9190c2e
local_runtime refactor initial commit
ishant162 Feb 6, 2025
c59508d
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 6, 2025
2aa528a
fix format
ishant162 Feb 6, 2025
9ec51ce
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 8, 2025
43ff9f9
Optimized code
ishant162 Feb 9, 2025
912119a
federated_runtime refactor
ishant162 Feb 10, 2025
f537e11
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 10, 2025
61f1a5d
federated_runtime refactor optimization
ishant162 Feb 10, 2025
a973959
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 10, 2025
d13ac8a
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 11, 2025
2bc3898
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 11, 2025
151484e
Updated flspec.py
ishant162 Feb 11, 2025
dcc7d61
updated aggregator.py
ishant162 Feb 11, 2025
d8a552c
update federated_runtime
ishant162 Feb 11, 2025
120878f
Merge branch 'develop' into runtime_refactor
ishant162 Feb 11, 2025
de6dbbb
updated federated_runtime
ishant162 Feb 11, 2025
ab75dca
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 12, 2025
a416b5d
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 12, 2025
4fce1b2
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 13, 2025
617a5b7
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 17, 2025
072fcfc
Updated federated_runtime.py
ishant162 Feb 17, 2025
af11aab
Incorporated internal review comments
ishant162 Feb 18, 2025
58d32af
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 18, 2025
3127f68
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 19, 2025
e7220bc
Incorporated internal review comments
ishant162 Feb 19, 2025
7732987
Incorporated review comments
ishant162 Feb 20, 2025
c4c5d9f
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 20, 2025
b984789
Incorporated review comments
ishant162 Feb 20, 2025
dc0d2af
Updated FederatedRuntime testcases
ishant162 Feb 20, 2025
de86fe6
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 20, 2025
f22a379
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 20, 2025
9a568c7
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 21, 2025
9fbcb23
update docstring
ishant162 Feb 21, 2025
b1653d3
updated and restructured code
ishant162 Feb 21, 2025
d8649b1
updating 101_MNIST
ishant162 Feb 24, 2025
56e5313
updating 101_MNIST
ishant162 Feb 24, 2025
62c5505
updating 101_MNIST
ishant162 Feb 24, 2025
b1ec130
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 24, 2025
5de433c
Incorporated review comments
ishant162 Feb 24, 2025
5143a7c
Update documentation
ishant162 Feb 24, 2025
52b702c
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 24, 2025
95076ee
Update documentation
ishant162 Feb 24, 2025
dba31f5
Incorporated review comments
ishant162 Feb 25, 2025
234a80c
Updated documentation
ishant162 Feb 25, 2025
0a51540
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 25, 2025
c49ae86
Merge branch 'develop' into runtime_refactor
payalcha Feb 25, 2025
abf6d44
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 27, 2025
3ad2e30
Review comments incorporated
ishant162 Feb 27, 2025
4fab86e
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 27, 2025
6a5fcb2
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 27, 2025
ecb7b63
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Feb 27, 2025
0e30108
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 3, 2025
b72e7de
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 3, 2025
c6a696b
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 3, 2025
d3696d4
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 4, 2025
d7b4b59
Update federated_runtime
ishant162 Mar 4, 2025
b1fc0ce
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 6, 2025
f28f842
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 7, 2025
3d45d8c
Updated testcase
ishant162 Mar 7, 2025
5438228
Merge branch 'securefederatedai:develop' into runtime_refactor
ishant162 Mar 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 19 additions & 10 deletions docs/about/features_index/workflowinterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,25 @@ A new OpenFL interface that gives significantly more flexility to researchers in
There are several modifications we make in our reimagined version of this interface that are necessary for federated learning:

1. *Placement*: Metaflow's :code:`@step` decorator is replaced by placement decorators that specify where a task will run. In horizontal federated learning, there are server (or aggregator) and client (or collaborator) nodes. Tasks decorated by :code:`@aggregator` will run on the aggregator node, and :code:`@collaborator` will run on the collaborator node. These placement decorators are interpreted by *Runtime* implementations: these do the heavy lifting of figuring out how to get the state of the current task to another process or node.
2. *Runtime*: Each flow has a :code:`.runtime` attribute. The runtime encapsulates the details of the infrastucture where the flow will run. We support the LocalRuntime for simulating experiments on local node and FederatedRuntime to launch experiments on distributed infrastructure.
2. *Runtime*: The runtime encapsulates the details of the infrastucture where the flow will run. We support the LocalRuntime for simulating experiments on local node and FederatedRuntime to launch experiments on distributed infrastructure.
3. *Conditional branches*: Perform different tasks if a criteria is met
4. *Loops*: Internal loops are within a flow; this is necessary to support rounds of training where the same sequence of tasks is performed repeatedly.

How to use it?
==============

Let's start with the basics. A flow is intended to define the entirety of federated learning experiment. Every flow begins with the :code:`start` task and concludes with the :code:`end` task. At each step in the flow, attributes can be defined, modified, or deleted. Attributes get passed forward to the next step in the flow, which is defined by the name of the task passed to the :code:`next` function. In the line before each task, there is a **placement decorator**. The placement decorator defines where that task will be run. The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with start and concludes with the end task. In the following example, the aggregator begins with an optionally passed in model and optimizer. The aggregator begins the flow with the start task, where the list of collaborators is extracted from the runtime (:code:`self.collaborators = self.runtime.collaborators`) and is then used as the list of participants to run the task listed in self.next, aggregated_model_validation. The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the start function on the aggregator to the aggregated_model_validation task on the collaborator. Where the tasks run is determined by the placement decorator that precedes each task definition (:code:`@aggregator` or :code:`@collaborator`). Once each of the collaborators (defined in the runtime) complete the aggregated_model_validation task, they pass their current state onto the train task, from train to local_model_validation, and then finally to join at the aggregator. It is in join that an average is taken of the model weights, and the next round can begin.
Let's start with the basics. A flow is intended to define the entirety of federated learning experiment. Every flow begins with the :code:`start` task and concludes with the
:code:`end` task. At each step in the flow, attributes can be defined, modified, or deleted. Attributes get passed forward to the next step in the flow, which is defined by
the name of the task passed to the :code:`next` function.
In the line before each task, there is a **placement decorator**. The placement decorator defines where that task will be run (:code:`@aggregator` or :code:`@collaborator`).
The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with start andconcludes with the end task. In the following example, the
aggregator begins the flow with :code:`start` task and optionally passed in model and optimizer. The list of collaborators in the federation, :code:`self.collaborators`,
is automatically populated by the Runtime infrastructure. It serves as the participant list for executing tasks listed in :code:`self.next` and :code:`aggregated_model_validation`.
Comment on lines +33 to +34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this add a reserved keyword collaborators to the flow spec. I assume this is set statically at the start of the workflow execution? Is there any way to get an updated collaborator from inside the workload? Use case - for the FederatedRuntime if additional envoys registered with the director, how could they join the federation after experiment start (assume TLS is setup and handled)? With the current API (self.collaborators = self.runtime.collaborators), the flow could (in a future release) get access to this list by querying the runtime. This is a capability of the RANO fork of OpenFL used in the real world; where not all collaborators will necessarily be ready (or known) at the start of an experiment.

The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the start function on the aggregator to the
aggregated_model_validation task on the collaborator.
Once each of the collaborators (defined in the runtime) complete the :code:`aggregated_model_validation` task, they
pass their current state onto the :code:`train` task, from :code:`train` to :code:`local_model_validation`, and then finally to :code:`join` at the aggregator.
It is in :code:`join` that an average is taken of the model weights, and the next round can begin.

.. code-block:: python

Expand All @@ -45,9 +56,9 @@ Let's start with the basics. A flow is intended to define the entirety of federa
@aggregator
def start(self):
print(f'Performing initialization for model')
self.collaborators = self.runtime.collaborators
self.private = 10
self.current_round = 0
print(f'Collaborators participating in federation: {self.collaborators}')
self.next(self.aggregated_model_validation,foreach='collaborators',exclude=['private'])

@collaborator
Expand Down Expand Up @@ -237,20 +248,19 @@ Some important points to remember while creating callback function and private a
- In above example multiple collaborators have the same callback function or private attributes. Depending on the Federated Learning requirements, user can specify unique callback function or private attributes for each Participant
- *Private attributes* needs to be set after instantiating the participant.

Now let's see how the runtime for a flow is assigned, and the flow gets run:
To run the flow, simply pass the instance of the flow to the :code:`run()` method of runtime:

.. code-block:: python

flow = FederatedFlow()
flow.runtime = local_runtime
flow.run()
local_runtime.run(flow)

And that's it! This will run an instance of the :code:`FederatedFlow` on a single node in a single process.

LocalRuntime Backends
---------------------

The Runtime defines where code will run, but the Runtime has a :code:`Backend` - which defines the underlying implementation of *how* the flow will be executed. :code:`single_process` is the default in the :code:`LocalRuntime`: it executes all code sequentially within a single python process, and is well suited to run both on high spec and low spec hardware
The Runtime defines where code will run, but the Runtime has a :code:`backend` - which defines the underlying implementation of *how* the flow will be executed. :code:`single_process` is the default in the :code:`LocalRuntime`: it executes all code sequentially within a single python process, and is well suited to run both on high spec and low spec hardware

For users with large servers or multiple GPUs they wish to take advantage of, we also provide a :code:`ray` `<https://github.com/ray-project/ray>` backend. The Ray backend enables parallel task execution for collaborators, and optionally allows users to request dedicated CPU / GPUs for Participants by using the :code:`num_cpus` and :code:`num_gpus` arguments while instantiating the Participant in following manner:

Expand Down Expand Up @@ -428,13 +438,12 @@ Below is an example of how to set up and instantiate a :code:`FederatedRuntime`:
tls=False
)

To distribute the experiment on the Federation, we now need to assign the federated_runtime to the flow and execute it.
To distribute the experiment on the Federation, we simply pass the instance of flow to :code:`run()` method of :code:`FederatedRuntime`

.. code-block:: python

flow = FederatedFlow()
flow.runtime = federated_runtime
flow.run()
federated_runtime.run(flow)

This will export the Jupyter notebook to an workspace and deploy it to the federation. The Director receives the experiment, distributes it to the Envoys, and initiates the execution of the experiment.

Expand Down
15 changes: 9 additions & 6 deletions openfl-tutorials/experimental/workflow/101_MNIST.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,12 @@
"scrolled": true
},
"source": [
"Now we come to the flow definition. The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with `start` and concludes with the `end` task. The aggregator begins with an optionally passed in model and optimizer. The aggregator begins the flow with the `start` task, where the list of collaborators is extracted from the runtime (`self.collaborators = self.runtime.collaborators`) and is then used as the list of participants to run the task listed in `self.next`, `aggregated_model_validation`. The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the `start` function on the aggregator to the `aggregated_model_validation` task on the collaborator. Where the tasks run is determined by the placement decorator that precedes each task definition (`@aggregator` or `@collaborator`). Once each of the collaborators (defined in the runtime) complete the `aggregated_model_validation` task, they pass their current state onto the `train` task, from `train` to `local_model_validation`, and then finally to `join` at the aggregator. It is in `join` that an average is taken of the model weights, and the next round can begin.\n",
"Now we come to the flow definition. The OpenFL Workflow Interface adopts the conventions set by Metaflow, that every workflow begins with `start` and concludes with the `end` task. Task placement (i.e. where the tasks run) is determined by the placement decorator that precedes each task definition (`@aggregator` or `@collaborator`)\n",
"\n",
"The aggregator begins the flow with `start` task and optionally passed in model and optimizer. The list of collaborators in federation (`self.collaborators`) is automatically populated by LocalRuntime infrastructure and is then used as the list of participants to run the task listed in `self.next`, `aggregated_model_validation`. The model, optimizer, and anything that is not explicitly excluded from the next function will be passed from the `start` function on the aggregator to the `aggregated_model_validation` task on the collaborator\n",
"\n",
"Once each of the collaborators (defined in the runtime) complete the `aggregated_model_validation` task, they pass their current state onto the `train` task, from `train` to `local_model_validation`, and then finally to `join` at the aggregator. It is in `join` that an average is taken of the model weights, and the next round can begin.\n",
"\n",
"\n",
"![image.png](attachment:image.png)"
]
Expand Down Expand Up @@ -252,9 +257,9 @@
" @aggregator\n",
" def start(self):\n",
" print(f'Performing initialization for model')\n",
" self.collaborators = self.runtime.collaborators\n",
" self.private = 10\n",
" self.current_round = 0\n",
" print(f'Collaborators participating in federation: {self.collaborators}')\n",
" self.next(self.aggregated_model_validation, foreach='collaborators', exclude=['private'])\n",
"\n",
" @collaborator\n",
Expand Down Expand Up @@ -382,8 +387,7 @@
"best_model = None\n",
"optimizer = None\n",
"flflow = FederatedFlow(model, optimizer, rounds=2, checkpoint=True)\n",
"flflow.runtime = local_runtime\n",
"flflow.run()"
"local_runtime.run(flflow)"
]
},
{
Expand Down Expand Up @@ -425,8 +429,7 @@
"outputs": [],
"source": [
"flflow2 = FederatedFlow(model=flflow.model, optimizer=flflow.optimizer, rounds=2, checkpoint=True)\n",
"flflow2.runtime = local_runtime\n",
"flflow2.run()"
"local_runtime.run(flflow2)"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -339,9 +339,8 @@
" \"\"\"\n",
" print(f\"Initializing Workflow .... \")\n",
"\n",
" self.collaborators = self.runtime.collaborators\n",
" self.current_round = 0\n",
"\n",
" print(f'Collaborators participating in federation: {self.collaborators}')\n",
" self.next(self.aggregated_model_validation, foreach=\"collaborators\")\n",
"\n",
" @collaborator\n",
Expand Down Expand Up @@ -521,8 +520,7 @@
"model = None\n",
"optimizer = None\n",
"flflow = FederatedFlow_TorchMNIST(model, optimizer, learning_rate, momentum, rounds=2, checkpoint=True)\n",
"flflow.runtime = local_runtime\n",
"flflow.run()"
"local_runtime.run(flflow)"
]
},
{
Expand Down Expand Up @@ -635,7 +633,7 @@
"id": "87c487cb",
"metadata": {},
"source": [
"Now that we have our distributed infrastructure ready, let us modify the flow runtime to `FederatedRuntime` instance and deploy the experiment. \n",
"Now that we have our distributed infrastructure ready, the experiment is deployed onto the federation by providing the same `flflow` instance to `FederatedRuntime`.\n",
"\n",
"Progress of the flow is available on \n",
"1. Jupyter notebook: if `checkpoint` attribute of the flow object is set to `True`\n",
Expand All @@ -650,8 +648,7 @@
"outputs": [],
"source": [
"flflow.results = [] # clear results from previous run\n",
"flflow.runtime = federated_runtime\n",
"flflow.run()"
"federated_runtime.run(flflow)"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -279,8 +279,7 @@
" This is the start of the Flow.\n",
" \"\"\"\n",
" print(\"<Agg>: Start of flow ... \")\n",
" self.collaborators = self.runtime.collaborators\n",
"\n",
" print(f'Collaborators participating in federation: {self.collaborators}')\n",
" self.next(self.watermark_pretrain)\n",
"\n",
" @aggregator\n",
Expand Down Expand Up @@ -558,8 +557,7 @@
" watermark_retrain_optimizer,\n",
" checkpoint=True,\n",
")\n",
"flflow.runtime = federated_runtime\n",
"flflow.run()"
"federated_runtime.run(flflow)"
]
}
],
Expand Down
14 changes: 2 additions & 12 deletions openfl/experimental/workflow/component/aggregator/aggregator.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@
import dill

from openfl.experimental.workflow.interface import FLSpec
from openfl.experimental.workflow.runtime import FederatedRuntime
from openfl.experimental.workflow.utilities import aggregator_to_collaborator, checkpoint
from openfl.experimental.workflow.utilities.metaflow_utils import MetaflowInterface

logger = getLogger(__name__)

Expand Down Expand Up @@ -125,13 +123,7 @@ def __init__(

self.flow = flow
self.checkpoint = checkpoint
self.flow._foreach_methods = []
logger.info("MetaflowInterface creation.")
self.flow._metaflow_interface = MetaflowInterface(self.flow.__class__, "single_process")
self.flow._run_id = self.flow._metaflow_interface.create_run()
self.flow.runtime = FederatedRuntime()
self.name = "aggregator"
self.flow.runtime.collaborators = self.authorized_cols

self.__private_attrs_callable = private_attributes_callable
self.__private_attrs = private_attributes
Expand Down Expand Up @@ -200,10 +192,8 @@ async def run_flow(self) -> FLSpec:
"""
# Start function will be the first step if any flow
f_name = "start"
# Creating a clones from the flow object
FLSpec._reset_clones()
FLSpec._create_clones(self.flow, self.flow.runtime.collaborators)

# Initialize the flow state
self.flow.initialize_flow_state(self.authorized_cols)
logger.info(f"Starting round {self.current_round}...")
while True:
next_step = self.do_task(f_name)
Expand Down
Loading
Loading