Merged PR 1546: Allow single GPU/CPU processes

Allow single GPU/CPU processes [X] - Multi GPU_nlg: https://ml.azure.com/runs/adb32644-7ad3-425f-ac5f-8a81d2756147?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47 [X] - Single GPU_nlg: https://ml.azure.com/runs/3295eb10-f0bd-478a-a7c2-8f1cac8f9be5?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics [X] - Single GPU_ecg: https://ml.azure.com/runs/34f7da18-e230-4df8-9b2f-f4c916d4d005?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&reloadCount=1 [X] - Single GPU_classif_cnn: https://ml.azure.com/runs/0ff278ea-0bcb-4781-bb85-fe15514edd53?wsid=/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourcegroups/gcr-singularity-octo/workspaces/msroctows&tid=72f988bf-86f1-41af-91ab-2d7cd011db47#metrics
microsoft · Jun 28, 2023 · 43e1530 · 43e1530
1 parent 9df34a8
commit 43e1530
Show file tree

Hide file tree

Showing 6 changed files with 299 additions and 191 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -38,3 +38,12 @@ about the new command, please refer to the [README](README.md).
 - 🌟 Abstract classes for new models/dataloaders.
 - 🌟 Allows Federated Learning with Personalization. 
   - Personalization allows you to leverage each client local data to obtain models that are better adjusted to their own data distribution. You can run the `cv` task in order to try out this feature.
+
+
+## [1.0.1] - 2023-07-29
+
+🔋 This release removes the restriction of the minimum number of GPUs available in FLUTE, 
+allowing users to run experiments using a single-GPU worker by instantiating both: Server
+and clients on the same device. For more documentation about how to run an experiments
+using a single GPU, please refer to the [README](README.md).
+
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ Welcome to FLUTE (Federated Learning Utilities for Testing and Experimentation),
 FLUTE is a pytorch-based orchestration environment enabling GPU or CPU-based FL simulations.  The primary goal of FLUTE is to enable researchers to rapidly prototype and validate their ideas.  Features include:
 
 - large scale simulation (millions of clients, sampling tens of thousands per round)
-- multi-GPU and multi-node orchestration
+- single/multi GPU and multi-node orchestration
 - local or global differential privacy
 - model quantization
 - a variety of standard optimizers and aggregation methods
@@ -74,11 +74,19 @@ FLUTE uses torch.distributed API as its main communication backbone, supporting
 
 After this initial setup, you can use the data created for the integration test for a first local run. Note that this data needs to be download manually inside the `testing` folder, for more instructions please look at [the README file inside `testing`](testing/README.md).
 
+For single-GPU runs:
+
+```
+python -m torch.distributed.run --nproc_per_node=1 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl
+```
+
+For multi-GPU runs (3 GPUs):
+
 ```
 python -m torch.distributed.run --nproc_per_node=3 e2e_trainer.py -dataPath ./testing -outputPath scratch -config testing/hello_world_nlg_gru.yaml -task nlg_gru -backend nccl
 ```
 
-This config uses 1 node with 3 workers (1 server, 2 clients). The config file `testing/hello_world_nlg_gru.yaml` has some comments explaining the major sections and some important details; essentially, it consists in a very short experiment where a couple of iterations are done for just a few clients. A `scratch` folder will be created containing detailed logs.
+The config file `testing/hello_world_nlg_gru.yaml` has some comments explaining the major sections and some important details; essentially, it consists in a very short experiment where a couple of iterations are done for just a few clients. A `scratch` folder will be created containing detailed logs.
 
 ## Documentation
 

diff --git a/core/evaluation.py b/core/evaluation.py
@@ -20,7 +20,7 @@
 
 class Evaluation():
 
-    def __init__(self, config, model_path, process_testvalidate, idx_val_clients, idx_test_clients):
+    def __init__(self, config, model_path, process_testvalidate, idx_val_clients, idx_test_clients, single_worker):
 
         self.config = config
         self.model_path = model_path
@@ -29,6 +29,7 @@ def __init__(self, config, model_path, process_testvalidate, idx_val_clients, id
         self.idx_val_clients = idx_val_clients
         self.idx_test_clients = idx_test_clients
         self.send_dicts = config['server_config'].get('send_dicts', False)
+        self.single_worker = single_worker
         super().__init__()
 
     def run(self, eval_list, req, metric_logger=None):
@@ -155,7 +156,7 @@ def run_distributed_evaluation(self, mode, clients, model):
         total = 0
         self.logits = {'predictions': [], 'probabilities': [], 'labels': []}
         server_data = (0.0, model, 0)
-        for result in self.process_testvalidate(clients, server_data, mode):
+        for result in self.process_testvalidate(clients, server_data, mode, self.single_worker):
             output, metrics, count = result
             val_metrics =  {key: {'value':0, 'higher_is_better': False} for key in metrics.keys()} if total == 0 else val_metrics
 
@@ -190,7 +191,7 @@ def make_eval_clients(dataset, config):
     '''
 
     total = sum(dataset.num_samples)
-    clients = federated.size() - 1
+    clients = federated.size() - 1 if federated.size()>1 else federated.size()
     delta = total / clients + 1
     threshold = delta
     current_users_idxs = list()