[BUG] How to improve the training perfomance in MLX compare to pytorch and keras ? #1542

thegodone · 2024-10-30T07:06:11Z

Describe the bug
I have a major issue that I have seen in lot of the cases on other trial. The MLX training gives rarely a good performance while for torch and keras it is more stable and better. This is really a bottleneck to use MLX, as you need to train 10 to 20 time your model to get a good result while torch and keras are systematically in a good range (rmse : 0.50-0.55).
important: models (tf/keras, torch and mlx) have the same number of trainable parameters, and we use the same train, val and test split for the 3 methods).

To Reproduce

run several time the following code the best result is jumping out of the pytorch and tf/keras results
https://github.com/thegodone/apple_ai_model/blob/main/AttFP_mlx_faster.ipynb
https://github.com/thegodone/apple_ai_model/blob/main/AttFP_torch.ipynb
https://github.com/thegodone/apple_ai_model/blob/main/AttFP_tf.ipynb

Expected behavior
I don't know if it is weights initialization or optimizer that can cause this huge difference between the 3 packages.

Desktop (please complete the following information):
see #1531

thegodone · 2024-11-08T11:24:04Z

Can it be related to the #1153 (comment) comment ?

import math
from typing import Union, List, Callable
import mlx.core as mx

class Adam(Optimizer):
    def __init__(self, learning_rate: Union[float, Callable], betas: List[float] = [0.9, 0.999], eps: float = 1e-8):
        super().__init__()
        self._maybe_schedule("learning_rate", learning_rate)
        self.betas = betas
        self.eps = eps

    def init_single(self, parameter: mx.array, state: dict):
        state["m"] = mx.zeros_like(parameter)  # Initialize momentum
        state["v"] = mx.zeros_like(parameter)  # Initialize velocity

    def apply_single(self, gradient: mx.array, parameter: mx.array, state: dict):
        lr = self.learning_rate.astype(gradient.dtype)
        beta1, beta2 = self.betas
        eps = self.eps

        # Update biased first moment estimate
        state["m"] = beta1 * state["m"] + (1 - beta1) * gradient
        # Update biased second moment estimate
        state["v"] = beta2 * state["v"] + (1 - beta2) * mx.square(gradient)

        # Bias-corrected estimates (optional depending on application)
        m_hat = state["m"] / (1 - beta1)
        v_hat = state["v"] / (1 - beta2)

        # Parameter update
        return parameter - lr * m_hat / (mx.sqrt(v_hat) + eps)

class AdamW(Adam):
    def __init__(self, learning_rate: Union[float, Callable], betas: List[float] = [0.9, 0.999], eps: float = 1e-8, weight_decay: float = 0.01):
        super().__init__(learning_rate=learning_rate, betas=betas, eps=eps)
        self.weight_decay = weight_decay

    def apply_single(self, gradient: mx.array, parameter: mx.array, state: dict):
        lr = self.learning_rate.astype(gradient.dtype)
        # Apply weight decay before applying Adam update
        parameter = parameter * (1 - lr * self.weight_decay)
        # Call the parent Adam's apply_single() for the core Adam update
        return super().apply_single(gradient, parameter, state)

awni · 2024-11-09T04:23:49Z

Did you try the suggestion in Difference in training convergence between PyTorch & MLX #1153 to use a linear warmup?
Did you try the bias corrected Adam to see if it improves the convergence rate?

If you can try either of those and report back, that would be useful to know.

Is it possible to add a parameter for Adam optimizer to be strictly identical to pytorch / tensorflow ?

It's possible..would be good to know if it fixes your issue first though.

thegodone · 2024-11-09T07:12:13Z

here the differences between the 3 trainings I used now the bias corrected AdamW in mlx:

train_loss is also different I need to check that
LR are not identical that may explain the difference

one remark, now we have similar speed between torch and mlx

awni · 2024-11-09T14:10:28Z

The 0 training loss in MLX seems incorrect particularly given the training MSE seems reasonable. I would double check you are averaging the loss in MLX correctly.

Otherwise it mostly looks reasonable.. fine tuning learning rates, warmups, initializations etc could all help.

thegodone · 2024-11-09T17:20:20Z

I will have a synchronize LR scheduler tomorrow to be sure this is not the part that affect the model deviation. Yes I will look at the loss that is strange.

thegodone · 2024-11-10T07:21:06Z

ok I fix the loss and now I have this type of results interestingly pytorch is more efficient than mlx or tensorflow.

thegodone · 2024-11-10T09:01:15Z

I use this LR scheduler now in mlx, one potential issue is that in pytorch/tensorflow the LR scheduler is per epoch while in mlx it is per step, is it possible to have an epoch equivalent ?

def cosineannealingwarmrestartfactor_(initial_lr, restart, decay_steps, warmup_factor, Tmin):
    schedules = []
    boundaries = []  # Boundaries should be one less than schedules
    base_lr = initial_lr
    schedules.append(optim.cosine_decay(initial_lr, decay_steps, Tmin))
    for i in range(restart-1):        
        Tmin *= warmup_factor
        initial_lr*=warmup_factor

        schedules.append(optim.cosine_decay(initial_lr, decay_steps, Tmin))
        boundaries.append(decay_steps*(i+1))
        
    lr_schedule = optim.join_schedules(schedules, boundaries)
    return lr_schedule

thegodone · 2024-11-10T09:17:18Z

this is the mlx version with "official Adam without bias correction": clearly not performing as the bias correction used in previous posts.

thegodone · 2024-11-12T06:42:49Z

Is there a way to fix a seed for mlx similar to the torch.manual_seed(int) ?

awni · 2024-11-12T14:20:49Z

mx.random.seed

thegodone · 2024-11-12T16:30:13Z

Yes, was expected that was the case.The good accuracy on torch is due to a specific seed… so we can close that issue.Envoyé de mon iPhoneLe 12 nov. 2024 à 15:21, Awni Hannun ***@***.***> a écrit : mx.random.seed —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

thegodone mentioned this issue Oct 31, 2024

[FEATURE] benchmark, this code runs more than 3x slower compare to pytorch #1531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] How to improve the training perfomance in MLX compare to pytorch and keras ? #1542

[BUG] How to improve the training perfomance in MLX compare to pytorch and keras ? #1542

thegodone commented Oct 30, 2024 •

edited

Loading

thegodone commented Nov 8, 2024 •

edited

Loading

awni commented Nov 9, 2024

thegodone commented Nov 9, 2024 •

edited

Loading

awni commented Nov 9, 2024 •

edited

Loading

thegodone commented Nov 9, 2024 •

edited

Loading

thegodone commented Nov 10, 2024 •

edited

Loading

thegodone commented Nov 10, 2024

thegodone commented Nov 10, 2024

thegodone commented Nov 12, 2024

awni commented Nov 12, 2024

thegodone commented Nov 12, 2024 via email

[BUG] How to improve the training perfomance in MLX compare to pytorch and keras ? #1542

[BUG] How to improve the training perfomance in MLX compare to pytorch and keras ? #1542

Comments

thegodone commented Oct 30, 2024 • edited Loading

thegodone commented Nov 8, 2024 • edited Loading

awni commented Nov 9, 2024

thegodone commented Nov 9, 2024 • edited Loading

awni commented Nov 9, 2024 • edited Loading

thegodone commented Nov 9, 2024 • edited Loading

thegodone commented Nov 10, 2024 • edited Loading

thegodone commented Nov 10, 2024

thegodone commented Nov 10, 2024

thegodone commented Nov 12, 2024

awni commented Nov 12, 2024

thegodone commented Nov 12, 2024 via email

thegodone commented Oct 30, 2024 •

edited

Loading

thegodone commented Nov 8, 2024 •

edited

Loading

thegodone commented Nov 9, 2024 •

edited

Loading

awni commented Nov 9, 2024 •

edited

Loading

thegodone commented Nov 9, 2024 •

edited

Loading

thegodone commented Nov 10, 2024 •

edited

Loading