Welcome to the official repository of GoLU
GoLU is a novel self-gated activation function that enhances neural network performance by leveraging the Gompertz function to self-gate the input. The Gompertz function is an S-Shaped function similar to the Gaussian CDF or the Sigmoid function carrying its own qualitative and quantitative properties, primarily the subtle asymmetry which doesn't exist in both, the Gaussian CDF or the Sigmoid functions as they can be mirrored around a central point or axis. However, the Gompertz function is the CDF of the Standard Gumbel distribution. The Gumbel distribution being asymmetric induces this rightward bias and shift in the Gompertz function that effectively reduces noise and variance in the latent representation due to a smaller slope at the origin for GoLU.
GoLU is defined as:
"GoLU can be seen as a new standard in activation functions, pushing deep learning performance beyond existing benchmarks!"
Unlike ReLU, GELU, Swish or Mish, GoLU offers:
✅ Reduced activation variance for better feature representation
GoLU exhibits a profile that remains close to the x-axis across the input range, indicating a lower magnitude of slope, especially near the origin. This property reduces sensitivity to input variations, minimizing noise and variance in latent representations. As a result, the activation function produces smoother outputs, improving the model’s ability to distinguish between strong and weak features. Additionally, GoLU demonstrates a squeezing effect compressing activation values into a smaller range, further reducing variance compared to other gated activations like GELU and Swish. This effect is clearly evident in this image generated via DALL-E 3 which is later passed viw 3x3 2D Convolution and 2D Batch Normalization and finally through different activation functions. We then plot the distribution of the neuron values in the latent representation to see the effect of GoLU and other activations.
✅ Smoother loss landscape encouraging flatter minima
GoLU’s smaller gradients contribute to a smoother loss landscape, helping the optimizer avoid sharp variations in parameter space and converge to flatter minima. This property enhances robustness to small perturbations in model parameters, improving generalization. When noise is added to the learned model weights, ResNet-20 with GoLU exhibits a less spiked and more stable loss landscape compared to other activations, suggesting improved resilience to noise. In contrast, ReLU’s non-smooth nature results in a more erratic and highly spiked loss landscape, potentially leading to poorer generalization.
✅ Spread weight distribution implying implicit regularization and improved generalization
GoLU leads to a broader distribution of learned weights, particularly around the peak, suggesting that networks trained with this activation capture more diverse transformations. This counterbalances the reduced variance in activation outputs, ensuring representational diversity. The broader spread of weights results from more uniform gradients, which encourage a wider distribution while avoiding reliance on extreme parameter values. These findings indicate that GoLU enhances feature differentiation while maintaining balanced weight distribution.
✅ Strong performance across diverse Deep Learning tasks - outperforms ReLU, GELU, Swish and Mish
Below you can find our training results across a wide variety of tasks involving image classification, language modeling, semantic segmentation, object detection, instance segmentation and diffusion.
Note - it is not necessary that one activation performs the best across all possible architectures, datasets and hyperparameters settings. We generally observe that at optimal hyperparameter settings GoLU performs the best. Some architectures like ResNet and Vision Transformer already provide ready to use optimal hyperparameter settings and by default GoLU performs better for those. In some cases like Semantic Segmentation using DeepLabV3 trained on MS-COCO and DDPM model trained on the CelebA dataset, we find that the default pipelines are suboptimal. For example, we can reach better performance simply with learning rate ablations over these tasks and datasets.
Click to view results
Architecture | Dataset | Metric | ReLU | GELU | Swish | Mish | GoLU |
---|---|---|---|---|---|---|---|
ResNet-18 | ImageNet-1K | Top-1 Acc | 69.74±0.07 | 70.66±0.05 | 70.60±0.06 | 70.53±0.06 | 70.76±0.06 |
ResNet-34 | ImageNet-1K | Top-1 Acc | 73.26±0.01 | 73.44±0.04 | 72.74±0.05 | 72.73±0.07 | 73.71±0.04 |
ResNet-50 | ImageNet-1K | Top-1 Acc | 75.44±0.07 | 76.07±0.06 | 75.17±0.14 | 75.53±0.09 | 76.63±0.03 |
WideResNet-50-2 | ImageNet-1K | Top-1 Acc | 76.96±0.07 | 76.72±0.01 | 75.41±0.03 | 75.75±0.19 | 77.37±0.03 |
DenseNet-121 | ImageNet-1K | Top-1 Acc | 74.95±0.09 | 74.64±0.11 | 72.81±0.06 | 72.97±0.10 | 75.25±0.03 |
EfficientNet-B0 | ImageNet-1K | Top-1 Acc | 76.52±0.07 | 76.90±0.01 | 76.84±0.02 | 76.76±0.06 | 76.86±0.04 |
TinyViT | ImageNet-1K | Top-1 Acc | 82.91±0.02 | 83.05±0.03 | 82.92±0.06 | 83.01±0.02 | 83.21±0.02 |
ViT-B/32 | ImageNet-1K | Top-1 Acc | 74.51±0.04 | 75.48±0.05 | 72.31±2.15 | 75.16±0.07 | 75.74±0.09 |
ViT-B/16 | ImageNet-1K | Top-1 Acc | 80.06±0.05 | 79.39±0.99 | 79.19±0.94 | 77.97±1.95 | 80.72±0.04 |
babyGPT | TinyStories | Perplexity | 4.519±0.006 | 4.462±0.005 | 4.535±0.004 | 4.539±0.007 | 4.444±0.005 |
babyGPT | TinyStories | Token Acc | 61.243±0.030 | 61.465±0.034 | 61.178±0.032 | 61.135±0.036 | 61.545±0.029 |
GPT2-S | OpenWebText | Perplexity | 17.845±0.078 | 17.525±0.015 | 17.785±0.026 | 17.797±0.086 | 17.297±0.023 |
GPT2-S | OpenWebText | Token Acc | 44.059±0.079 | 44.262±0.042 | 44.155±0.025 | 44.104±0.081 | 44.413±0.023 |
DeepLabV3-RN50 (LR=0.01) | MS-COCO | mIoU | 65.11±0.326 | 65.59±0.162 | 64.14±0.135 | 64.40±0.144 | 65.98±0.124 |
Faster R-CNN-FPN-RN50 | MS-COCO | Box mAP | 37.44±0.146 | 38.16±0.044 | 37.28±0.078 | 37.71±0.087 | 38.31±0.058 |
RetinaNet-FPN-RN50 | MS-COCO | Box mAP | 39.90±0.063 | 40.68±0.090 | 40.27±0.087 | 40.45±0.093 | 40.77±0.065 |
Mask R-CNN-FPN-RN50 | MS-COCO | Box mAP | 38.33±0.001 | 39.00±0.001 | 38.19±0.002 | 38.76±0.000 | 38.96±0.001 |
Mask R-CNN-FPN-RN50 | MS-COCO | Mask mAP | 34.19±0.001 | 34.73±0.000 | 33.99±0.001 | 34.70±0.000 | 34.54±0.001 |
DDPM (LR=0.001) | CelebA | Loss | 0.01928±0.0004 | 0.01902±0.0004 | 0.01900±0.0004 | 0.01906±0.0004 | 0.01895±0.0004 |
The following results can change with re-trainings. However, the activation performance rankings shouldn't change.
✅ CUDA-optimized Kernel for comparable training and inference speed to PyTorch activations
We leverage the provided CUDA Kernels in PyTorch to create our own CUDA-optimized kernel for GoLU. This is done in order to aim for an apples-to-apples comparison across various other activations that already have a CUDA-optimized kernel. We not only focus on delivering a novel activation to the community, but also a well-engineered API that performs at par in terms of training and inference speed with existing activation functions.
Click to view kernel speed
Architecture | Dataset | Baseline Activation | Relative Training Time | Relative Inference Time |
---|---|---|---|---|
ResNet-18 | ImageNet-1k | ReLU | 1.00x | 1.00x |
ResNet-34 | ImageNet-1k | ReLU | 1.01x | 1.00x |
ResNet-50 | ImageNet-1k | ReLU | 1.01x | 1.01x |
WideResNet-50-2 | ImageNet-1k | ReLU | 1.03x | 1.02x |
DenseNet121 | ImageNet-1k | ReLU | 1.02x | 1.02x |
EfficientNet-B0 | ImageNet-1k | Swish | 1.00x | 1.00x |
TinyViT | ImageNet-1k | GELU | 0.99x | 0.98x |
ViT-B/32 | ImageNet-1k | GELU | 0.99x | 0.99x |
ViT-B/16 | ImageNet-1k | GELU | 0.98x | 0.98x |
babyGPT | TinyStories | GELU | 1.00x | 1.00x |
GPT2-S | OpenWebText | GELU | 1.01x | 1.01x |
DeepLabV3-RN50 | MS-COCO | ReLU | 1.14x | 1.04x |
Faster R-CNN-FPN-RN50 | MS-COCO | ReLU | 1.03x | 1.00x |
RetinaNet-FPN-RN50 | MS-COCO | ReLU | 1.00x | 1.00x |
Mask R-CNN-FPN-RN50 | MS-COCO | ReLU | 1.05x | 1.02x |
DDPM | CelebA | Swish | 0.97x | 0.97x |
Average | - | - | 1.01x | 1.00x |
Follow these steps to set up GoLU.
Ensure that Conda is installed on your system. If not, you can install it by running:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ~/Miniconda3-latest-Linux-x86_64.sh
source ~/.bashrc
Next, clone the GoLU repository:
git clone <fill_this_later>
cd GoLU
Finally, install all the required packages:
conda create -n golu_env python=3.10 -y
conda activate golu_env
pip install -r requirements.txt
In case torch, torchvision and torchaudio doesn't install via the requirements.txt file, use the following command:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
If Linux already has GCC Compiler, you don't need this. The GCC compiler helps compile the CUDA Kernel before use. To install GCC, you can use the following command:
conda install -c conda-forge gcc=9 gxx=9
Set the following environment variables
export CC=$(which gcc)
export CXX=$(which g++)
Using GoLU is simple! You can import and use it directly in your PyTorch model.
We simply compile the CUDA kernel on the fly while running any script that uses it.
from torch.utils.cpp_extension import load
# Exists in golu_cuda_activation.py
golu_extension = load(
name="golu_extension",
sources=["./golu/golu_cuda_kernel.cu", "./golu/golu_cuda.cpp"],
extra_cflags=["-O3"],
extra_cuda_cflags=["--use_fast_math", "--extended-lambda"],
verbose=True
)
from golu.golu_cuda_activation import GoLUCUDA
activation = GoLUCUDA()
import torch
import torch.nn as nn
from golu.golu_cuda_activation import GoLUCUDA
class SampleModel(nn.Module):
def __init__(self):
super(SampleModel, self).__init__()
self.fc = nn.Linear(128, 128)
self.activation = GoLUCUDA()
def forward(self, x):
x = self.fc(x)
x = self.activation(x)
return x
# Example usage
model = SampleModel()
model.to('cuda')
x = torch.randn(1, 128).to('cuda')
y = model(x)
print(y)
You can also fetch any activation function dynamically by passing its name as a string:
from golu.activation_utils import get_activation_function
# Check get_activation_function() for available activation functions
activation_function = get_activation_function(activation='GoLUCUDA')
If you need to replace an existing activation in a model:
import torchvision
from golu.activation_utils import replace_activation_by_torch_module
# replace_activation_by_torch_module() recursively replaces any instance of nn.ReLU in the model with the activation of choice
model = torchvision.models.get_model("resnet50", weights=None, num_classes=1000)
model = replace_activation_by_torch_module(model, nn.ReLU, activation)
model.to('cuda')
You can update the GoLU activation parameters dynamically:
from golu.activation_utils import update_golu_parameters
model = update_golu_parameters(model, new_alpha=0.8, new_beta=1.2, new_gamma=0.9)
However, please don't set these parameters in the activation to negative values. This could lead the Gompertz function to lose its characteristic S-Shape. Also, this can be done only when the model has an instance of GoLUCUDA in it.
Below you can find the commands for training and plotting. Note - the commands are mostly simple and easy to alter. Therefore, anybody willing to contribute better pipelines for each of these tasks is free to do so. You can play around with the command line arguments and do ablations of your choice that could lead to improved results.
Click to view
Note - For all our trainings we usually store results in a folder like "./results" and many of our scripts have this hard-coded. This means that the $RESULTS_PATH folder equals "./results". In case you get path errors, you could change the code to the path you've used, otherwise please feel free to stick to "./results".
Image Classification
ImageNet-1K
Setting up the dataset
To setup the dataset one needs to register here. After registering successfully download the following dataset files.
Go to this link and download the Training images (Task 1 & 2)-138GB, Validation images (all tasks)-6.3GB and Development kit (Task 1 & 2)-2.5MB files. After downloading them do not unzip as the ImageNet class from TorchVision does that for you. Also, when unzipping it, make sure you have only one task with multiple CPU workers to expedite the unzipping process. You can use the following command to prepare the dataset for training.
Note to change the $NUM_WORKERS and $DATA_PATH. The zip/tar files should be present in the "$DATA_PATH/imagenet_1k" directory for it to unzip the data.
srun --ntasks=1 --cpus-per-task=$NUM_WORKERS python -m tasks.image_classification.setup_dataset --dataset_path $DATA_PATH
ResNets 18, 34 and 50
Change --nproc_per_node=4 depending on the number of available GPUs. Further, change $NUM_WORKERS, $RESULTS_PATH, $DATA_PATH, $ACTIVATION and $SEED accordingly. $MODEL_NAME can be either "resnet18", "resnet34" or "resnet50". In this, --batch-size is the overall batch size which is further handled by --gradient-accumulation-steps and world_size in the code depending on the number of GPUs and the accumulation steps passed in the arguments.
torchrun --standalone --nproc_per_node=4 -m tasks.image_classification.train --workers $NUM_WORKERS --output-dir $RESULTS_PATH --data-path "$DATA_PATH/imagenet_1k" --model $MODEL_NAME --activation $ACTIVATION --seed $SEED --sync-bn
WideResNet-50-2 and DenseNet-121
Change --nproc_per_node=4 depending on the number of available GPUs. Further, change $NUM_WORKERS, $RESULTS_PATH, $DATA_PATH, $ACTIVATION and $SEED accordingly. $MODEL_NAME can be either "wide_resnet50_2" or "densenet121". In this, --batch-size is the overall batch size which is further handled by --gradient-accumulation-steps and world_size in the code depending on the number of GPUs and the accumulation steps passed in the arguments.
torchrun --standalone --nproc_per_node=4 -m tasks.image_classification.train --workers $NUM_WORKERS --output-dir $RESULTS_PATH --data-path "$DATA_PATH/imagenet_1k" --model $MODEL_NAME --activation $ACTIVATION --seed $SEED --opt "sgd_nesterov" --sync-bn
EfficientNet-B0
Change --nproc_per_node=4 depending on the number of available GPUs. Further, change $NUM_WORKERS, $RESULTS_PATH, $DATA_PATH, $ACTIVATION and $SEED accordingly. In this, --batch-size is the per gpu batch size.
torchrun --standalone --nproc_per_node=4 -m tasks.image_classification.timm_training_script --output $RESULTS_PATH --dataset "torch/imagenet" --data-dir "$DATA_PATH/imagenet_1k" --seed $SEED --activation $ACTIVATION --model "efficientnet_b0" --batch-size 384 --sched "step" --epochs 450 --decay-epochs 2.4 --decay-rate 0.97 --opt "rmsproptf" --opt-eps 0.001 --workers $NUM_WORKERS --warmup-lr 1e-6 --weight-decay 1e-5 --drop 0.2 --drop-connect 0.2 --model-ema --model-ema-decay 0.9999 --aa "rand-m9-mstd0.5" --remode "pixel" --reprob 0.2 --amp --lr 0.048
Tiny-ViT
Change --nproc_per_node=4 depending on the number of available GPUs. Further, change $RESULTS_PATH, $DATA_PATH, $ACTIVATION and $SEED accordingly. In this, --batch-size is the per gpu batch size.
torchrun --standalone --nproc_per_node=4 -m tasks.image_classification.tiny_vit.main --cfg "./tasks/image_classification/tiny_vit/configs/1k/tiny_vit_21m.yaml" --data-path "$DATA_PATH/imagenet_1k" --batch-size 256 --seed $SEED --run-name $ACTIVATION --act $ACTIVATION --output "$RESULTS_PATH/$ACTIVATION/$SEED"
ViT-B/32 and ViT-B/16
Change --nproc_per_node=4 depending on the number of available GPUs. This will further change --gradient-accumulation-steps in the command. For overall batch size of 4096, --gradient-accumulation-steps=8 and world_size=4, the batch size per gpu is 128. Further, change $NUM_WORKERS, $RESULTS_PATH, $DATA_PATH, $ACTIVATION and $SEED accordingly. $MODEL_NAME can be either "vit_b_32" or "vit_b_16"
torchrun --standalone --nproc_per_node=4 -m tasks.image_classification.train --workers $NUM_WORKERS --output-dir $RESULTS_PATH --data-path "$DATA_PATH/imagenet_1k" --model $MODEL_NAME --activation $ACTIVATION --seed $SEED --epochs 300 --batch-size 4096 --gradient-accumulation-steps 8 --opt "adamw" --lr 0.003 --wd 0.3 --lr-scheduler "cosineannealinglr" --lr-warmup-method "linear" --lr-warmup-epochs 30 --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --cutmix-alpha 1.0 --auto-augment "imagenet" --clip-grad-norm 1.0 --ra-sampler --model-ema
CIFAR-10
ResNets 20, 32, 44, 56, 110, WideResNet-28-2, DenseNet-40
Change $MODEL_NAME, $RESULTS_PATH, $ACTIVATION and $SEED accordingly. $MODEL_NAMES can be "resnet20", "resnet32", "resnet44", "resnet56", "resnet110", "wideresnet_28_2", "densenet_40".
python -m tasks.small_poc.main --model $MODEL_NAME --activation $ACTIVATION --seed $SEED --results_path $RESULTS_PATH
ViT-Ti/16-224
python -m tasks.small_poc.train_vit --model "vit_tiny_patch16_224" --dataset "torch/CIFAR10" --workers 1 --data-dir "./data" --dataset-download --drop 0.0 --drop-path 0.1 --warmup-epochs 20 --weight-decay 0.05 --lr-base 5e-4 --warmup-lr 5e-7 --min-lr 5e-6 --clip-grad 5 --layer-decay 1.0 --decay-epochs 30 --opt "adamw" --opt-eps 1e-8 --opt-betas 0.9 0.999 --aa "rand-m9-mstd0.5-inc1" --reprob 0.25 --mixup 0.8 --cutmix 1.0 --batch-size 32 --epochs 300 --seed $SEED --ac $ACTIVATION --results_folder $RESULTS_PATH
Small Ablations - CIFAR-10 and MNIST
For MNIST:
python -m tasks.small_poc.run_small_ablation --dataset mnist
For CIFAR-10:
python -m tasks.small_poc.run_small_ablation --dataset cifar10
To generate the loss contours, --dataset could be "mnist" or "cifar10":
python -m tasks.small_poc.merge_and_plot_contours --dataset cifar10
To generate the learning rate heatmaps, --dataset could be "mnist" or "cifar10":
python -m tasks.small_poc.lr_heatmaps --dataset cifar10
Language Modeling
babyGPT - TinyStories
First prepare the Tiny Stories dataset - change $DATA_DIR and $NUM_WORKERS accordingly.
python -m tasks.gpt.prepare_tiny_stories --cache_dir "$DATA_DIR/tiny_stories" --num_workers $NUM_WORKERS
Change $DATA_PATH, $SEED, $ACTIVATION, $RESULTS_PATH and $INIT_FROM accordingly.
When training for the first time $INIT_FROM should be "scratch", while when resuming training $INIT_FROM should be "resume". Also, the $DATA_PATH in the code becomes $DATA_PATH/dataset i.e. if $DATA_PATH="./data" and dataset="tiny_stories", then the path becomes "./data/tiny_stories" in the code automatically. You explicitly need not append tiny_stories at the end while exporting DATA_PATH.
Also, when --nproc_per_node=4 changes based on the number of available GPUs, make sure you check the ./tasks/gpt/config/train_baby_gpt_tiny_stories.py file for any changes.
torchrun --standalone --nproc_per_node=4 -m tasks.gpt.train --model_name "baby_gpt" --dataset "tiny_stories" --dataset_path $DATA_PATH --seed $SEED --activation $ACTIVATION --results_path $RESULTS_PATH --init_from $INIT_FROM --max_iters 10000 --lr_decay_iters 10000
GPT2-S - OpenWebText
First prepare the OpenWebText dataset - change $DATA_DIR and $NUM_WORKERS accordingly.
python -m tasks.gpt.prepare_owt --cache_dir "$DATA_DIR/open_web_text" --num_workers $NUM_WORKERS
Change $DATA_PATH, $SEED, $ACTIVATION, $RESULTS_PATH and $INIT_FROM accordingly.
When training for the first time $INIT_FROM should be "scratch", while when resuming training $INIT_FROM should be "resume". Also, the $DATA_PATH in the code becomes $DATA_PATH/dataset i.e. if $DATA_PATH="./data" and dataset="open_web_text", then the path becomes "./data/open_web_text" in the code automatically. You explicitly need not append open_web_text at the end while exporting DATA_PATH.
Also, when --nproc_per_node=4 changes based on the number of available GPUs, make sure you check the ./tasks/gpt/config/train_gpt2s_open_web_text.py file for any changes.
torchrun --standalone --nproc_per_node=4 -m tasks.gpt.train --model_name "gpt2s" --dataset "open_web_text" --dataset_path $DATA_PATH --seed $SEED --activation $ACTIVATION --results_path $RESULTS_PATH --init_from $INIT_FROM --max_iters 600000 --lr_decay_iters 600000
Semantic Segmentation
DeepLabV3 ResNet-50 - MS-COCO
Download the dataset before training the model. Follow the commands below,
Training Dataset
wget http://images.cocodataset.org/zips/train2017.zip
Validation Dataset
wget http://images.cocodataset.org/zips/val2017.zip
Annotations
wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Unzip the files
unzip train2017.zip
unzip val2017.zip
unzip annotations_trainval2017.zip
Training command - Change --nproc_per_node=4 according to the number of available GPUs. --batch_size here is per GPU. --backbone-checkpoint-path uses the pre-trained checkpoints of ResNet50 on ImageNet-1K. It selects the checkpoint as per the activation and seed combination.
Set the $NUM_WORKERS, $DATA_PATH, $RESULTS_PATH, $SEED and $ACTIVATION accordingly. You can simply change --lr to do a learning rate ablation.
torchrun --standalone --nproc_per_node=4 -m tasks.semantic_segmentation.train --model "deeplabv3_resnet50" --output-dir $RESULTS_PATH --workers $NUM_WORKERS --seed $SEED --activation $ACTIVATION --dataset "coco" --data-path $DATA_PATH --backbone-checkpoint-path "$RESULTS_PATH/imagenet_1k/resnet50" --lr 0.02 --batch-size 8 --aux-loss "True" --print-freq 100
Object Detection
Faster R-CNN-FPN ResNet-50 - MS-COCO
Follow the process to download the data as per DeepLabV3 ResNet-50 - MS-COCO.
Training command - Change --nproc_per_node=4 according to the number of available GPUs. --batch_size here is per GPU. --backbone-checkpoint-path uses the pre-trained checkpoints of ResNet-50 on ImageNet-1K. It selects the checkpoint as per the activation and seed combination.
Set the $NUM_WORKERS, $DATA_PATH, $RESULTS_PATH, $SEED and $ACTIVATION accordingly. You can simply change --lr to do a learning rate ablation.
torchrun --standalone --nproc_per_node=4 -m tasks.object_detection.train --model "fasterrcnn_resnet50_fpn" --output-dir $RESULTS_PATH --workers $NUM_WORKERS --seed $SEED --activation $ACTIVATION --dataset "coco" --data-path $DATA_PATH --batch-size 4 --lr 0.02 --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --print-freq 100 --backbone-checkpoint-path "$RESULTS_PATH/imagenet_1k/resnet50"
RetinaNet-FPN ResNet-50 - MS-COCO
Follow the process to download the data as per DeepLabV3 ResNet-50 - MS-COCO.
Training command - Change --nproc_per_node=4 according to the number of available GPUs. --batch_size here is per GPU. --backbone-checkpoint-path uses the pre-trained checkpoints of ResNet-50 on ImageNet-1K. It selects the checkpoint as per the activation and seed combination.
Set the $NUM_WORKERS, $DATA_PATH, $RESULTS_PATH, $SEED and $ACTIVATION accordingly. You can simply change --lr to do a learning rate ablation.
torchrun --standalone --nproc_per_node=4 -m tasks.object_detection.train --model "retinanet_resnet50_fpn_v2" --output-dir $RESULTS_PATH --workers $NUM_WORKERS --seed $SEED --activation $ACTIVATION --dataset "coco" --data-path $DATA_PATH --batch-size 4 --lr 0.0001 --weight-decay 0.05 --norm-weight-decay 0.0 --opt adamw --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --data-augmentation "multiscale" --print-freq 100 --backbone-checkpoint-path "$RESULTS_PATH/imagenet_1k/resnet50"
Instance Segmentation
Mask R-CNN-FPN ResNet-50 - MS-COCO
Follow the process to download the data as per DeepLabV3 ResNet-50 - MS-COCO.
Training command - Change --nproc_per_node=4 according to the number of available GPUs. --batch_size here is per GPU. --backbone-checkpoint-path uses the pre-trained checkpoints of resnet50 on ImageNet-1K. It selects the checkpoint as per the activation and seed combination. You can simply change --lr to do a learning rate ablation.
Set the $NUM_WORKERS, $DATA_PATH, $RESULTS_PATH, $SEED and $ACTIVATION accordingly.
torchrun --standalone --nproc_per_node=4 -m tasks.instance_segmentation.train --model "maskrcnn_resnet50_fpn" --output-dir $RESULTS_PATH --workers $NUM_WORKERS --seed $SEED --activation $ACTIVATION --dataset "coco" --data-path $DATA_PATH --batch-size 4 --lr 0.02 --epochs 26 --lr-steps 16 22 --aspect-ratio-group-factor 3 --print-freq 100 --backbone-checkpoint-path "$RESULTS_PATH/imagenet_1k/resnet50"
Diffusion
Denoising Diffusion Probabilistic Model - CelebA
Download the dataset before training the model - set $DATA_PATH accordingly. Downloading during training may cause the data to get corrupted as all processes start downloading the same. Hence, it's better to download it once when not doing Distributed Data Parallel.
from torchvision.datasets import CelebA
train_dataset = CelebA(root=$DATA_PATH, split="train", download=True)
valid_dataset = CelebA(root=$DATA_PATH, split="valid", download=True)
Set the $NUM_WORKERS, $RESULTS_PATH, $DATA_PATH, $SEED and $ACTIVATION accordingly. Also change --nproc_per_node=4 according to the number of available GPUs. You can simply change --lr to do a learning rate ablation.
torchrun --standalone --nproc_per_node=4 -m tasks.diffusion.train_ddp --num_workers $NUM_WORKERS --save_dir $RESULTS_PATH --data_dir $DATA_PATH --seed $SEED --activation $ACTIVATION
Visualizations
For any visualizations in the ./visualizations folder you can simply run a command as follows.
python -m visualizations.<folder_name>.<file_name>
A simple example is as follows,
python -m visualizations.activations.activation_visualizations
Similarly, you can run the rest of visualizations too.
🚀 Now you're ready to integrate GoLU into your deep learning models!
Please cite GoLU in case you use it in your work 🙌
@article{TBD,
title={TBD},
author={TBD},
year={TBD}
}