Add multi-worker support for JAX training. #18654

qlzh727 · 2023-10-19T21:28:48Z

I have tried to add some backend agnostic tests to dataset builder, and the JAX specific multi-worker test will be hosted internally.

qlzh727 · 2023-10-19T21:30:07Z

This will update #18561 and #18560

codecov-commenter · 2023-10-19T22:11:01Z

Codecov Report

Attention: 28 lines in your changes are missing coverage. Please review.

Comparison is base (2ad8e07) 78.57% compared to head (174b193) 78.51%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #18654      +/-   ##
==========================================
- Coverage   78.57%   78.51%   -0.06%     
==========================================
  Files         335      335              
  Lines       32979    33020      +41     
  Branches     6455     6467      +12     
==========================================
+ Hits        25913    25927      +14     
- Misses       5510     5532      +22     
- Partials     1556     1561       +5

Flag	Coverage Δ
keras	`78.41% <41.66%> (-0.06%)`	⬇️
keras-jax	`63.39% <41.66%> (-0.04%)`	⬇️
keras-numpy	`57.67% <12.50%> (-0.06%)`	⬇️
keras-tensorflow	`64.51% <14.58%> (-0.07%)`	⬇️
keras-torch	`65.16% <14.58%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
keras/backend/jax/core.py	`90.32% <100.00%> (+0.10%)`	⬆️
keras/backend/jax/trainer.py	`95.54% <100.00%> (+0.02%)`	⬆️
keras/trainers/data_adapters/tf_dataset_adapter.py	`96.15% <100.00%> (+0.15%)`	⬆️
keras/trainers/data_adapters/__init__.py	`68.96% <60.00%> (-1.41%)`	⬇️
keras/distribution/distribution_lib.py	`91.26% <12.50%> (-2.71%)`	⬇️
keras/backend/jax/distribution_lib.py	`63.15% <24.00%> (-19.54%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fchollet

Thanks for the PR!

fchollet · 2023-10-20T08:34:31Z

keras/trainers/data_adapters/tf_dataset_adapter.py

@@ -7,7 +7,7 @@
 class TFDatasetAdapter(DataAdapter):
    """Adapter that handles `tf.data.Dataset`."""

-    def __init__(self, dataset, class_weight=None):
+    def __init__(self, dataset, class_weight=None, distribution=None):


Please add a Args: section to document the type of each argument (to avoid confusion with tf.distribute)

fchollet · 2023-10-20T08:36:36Z

keras/trainers/data_adapters/__init__.py

+    ):
+        raise ValueError(
+            "Only `tf.data.Dataset` is supported for multi worker "
+            f"distribution, received input types is {type(x)}"


When using multi-worker distribution, the data must be provided as a `tf.data.Dataset` instance. Received: type(x)={type(x)}

fchollet · 2023-10-20T08:36:53Z

keras/backend/jax/distribution_lib.py

+        inputs: `jax.Array` that is already sharded to a local process size.
+        layout: `TensorLayout` for the distribution information, or a
+        `jax.sharding.Sharding` instance.
+    Returns:


Add space above

fchollet · 2023-10-20T08:37:49Z

keras/distribution/distribution_lib.py

-            ).prefetch(tf.data.AUTOTUNE)
+            batch_size = tf_data_distribute.compute_batch_size(dataset)
+            if batch_size.numpy() < 0:
+                raise ValueError(


In what cases does this happen? Unbatched dataset? The error message should make explicit what user action is required (e.g. calling .batch()).

yea, most likely due to the dataset is not batched. Update the error message.

fchollet

LGTM

Add multi-worker support for JAX training.

10d3733

qlzh727 requested a review from fchollet October 19, 2023 21:28

google-ml-butler bot added awaiting review size:M labels Oct 19, 2023

google-ml-butler bot assigned gbaned Oct 19, 2023

Fix JAX unit test

f910485

fchollet reviewed Oct 20, 2023

View reviewed changes

qlzh727 added 2 commits October 23, 2023 10:39

Merge branch 'master' into multiworker

5e4d487

Address review comments.

174b193

qlzh727 requested a review from fchollet October 23, 2023 17:51

fchollet approved these changes Oct 24, 2023

View reviewed changes

google-ml-butler bot added kokoro:force-run ready to pull Ready to be merged into the codebase labels Oct 24, 2023

fchollet merged commit ee8b1ea into keras-team:master Oct 24, 2023
6 checks passed

google-ml-butler bot removed awaiting review ready to pull Ready to be merged into the codebase kokoro:force-run labels Oct 24, 2023

qlzh727 deleted the multiworker branch October 24, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-worker support for JAX training. #18654

Add multi-worker support for JAX training. #18654

qlzh727 commented Oct 19, 2023

qlzh727 commented Oct 19, 2023

codecov-commenter commented Oct 19, 2023 •

edited

Loading

fchollet left a comment

fchollet Oct 20, 2023

qlzh727 Oct 23, 2023

fchollet Oct 20, 2023

qlzh727 Oct 23, 2023

fchollet Oct 20, 2023

qlzh727 Oct 23, 2023

fchollet Oct 20, 2023

qlzh727 Oct 23, 2023

fchollet left a comment

Add multi-worker support for JAX training. #18654

Add multi-worker support for JAX training. #18654

Conversation

qlzh727 commented Oct 19, 2023

qlzh727 commented Oct 19, 2023

codecov-commenter commented Oct 19, 2023 • edited Loading

Codecov Report

fchollet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fchollet left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 19, 2023 •

edited

Loading