Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: run tests only on 1.10 for now #975

Merged
merged 2 commits into from
Oct 8, 2024
Merged

ci: run tests only on 1.10 for now #975

merged 2 commits into from
Oct 8, 2024

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Oct 8, 2024

till EnzymeAD/Enzyme.jl#1358 gets resolved lets run CI only on 1.10

@avik-pal avik-pal changed the title ci: run tests only on 1.10 for now ci: run tests only on 1.10 for now Oct 8, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

Benchmark Results (ASV)

main 67d259c... main/67d259c6d1c364...
basics/overhead 0.0557 ± 0.0015 μs 0.0566 ± 0.0018 μs 0.984
time_to_load 1.33 ± 0.025 s 1.32 ± 0.016 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal merged commit 04deedf into main Oct 8, 2024
36 of 48 checks passed
@avik-pal avik-pal deleted the ap/ci branch October 8, 2024 19:19
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 67d259c Previous: d230834 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 411645.5 ns 415291 ns 0.99
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 322937.5 ns 243167 ns 1.33
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322041 ns 244625 ns 1.32
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739541 ns 740667 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43450 ns 44725 ns 0.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 592583.5 ns 1279354.5 ns 0.46
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 2412000 ns 1221916 ns 1.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 13927209 ns 16280791 ns 0.86
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2248999.5 ns 2240458 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 188422 ns 203277 ns 0.93
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 734083 ns 1383187.5 ns 0.53
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 2603292 ns 1309667 ns 1.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 14021833.5 ns 16210875 ns 0.86
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2236271.5 ns 2235875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1631250 ns 1666375 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1080125 ns 1104041.5 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1555750 ns 1509958 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2947354 ns 2989666 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 209946.5 ns 213111 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12284250 ns 12146875 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8808604.5 ns 8841167 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9229438 ns 9243875 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18579958 ns 18585666.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1913542 ns 1936768 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17282645.5 ns 17311083.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 13958458 ns 13983375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14540625.5 ns 14496187.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21833708 ns 21837875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121488833 ns 250126228.5 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148322083 ns 148997875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116063000 ns 116519479.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 446423041 ns 446906458 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5496616 ns 5468434 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 596935833.5 ns 1223788875 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 931042833 ns 933142709 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 829999521 ns 832839417 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1630204291 ns 1630170292 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35312892 ns 31512911.5 ns 1.12
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 711570895.5 ns 1149549375 ns 0.62
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1015965291.5 ns 997374541.5 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1327562479.5 ns 1308662646 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1729988979 ns 1731062979.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 866979.5 ns 1122500 ns 0.77
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1620750 ns 1658708 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3533895.5 ns 3605667 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 788625 ns 782708.5 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 271924 ns 284470.5 ns 0.96
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2708167 ns 2990375 ns 0.91
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4121417 ns 4122208 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 9736417 ns 10934125 ns 0.89
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3152021 ns 3140208 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1079439 ns 1127614 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2240062.5 ns 2349749.5 ns 0.95
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1473583 ns 1366187.5 ns 1.08
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1700333 ns 1585125 ns 1.07
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4359917 ns 4341687 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 211937.5 ns 211956.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 20399166 ns 20292146 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16953541.5 ns 16982750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 18292375 ns 18160625 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 26726958 ns 26736042 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1984165 ns 2009275 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 45070167 ns 44384292 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 41013229.5 ns 41010166.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 41270708.5 ns 41252542 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 47745792 ns 47742354 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4313167 ns 4667229 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2855708 ns 2627145.5 ns 1.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 3013645.5 ns 2754166 ns 1.09
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8659709 ns 8646833.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 516999 ns 471691 ns 1.10
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 39932125 ns 40759792 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 33968416 ns 34074937.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 34069417 ns 34004708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 53668500 ns 53724708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3241641 ns 3235352 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 90463791 ns 110050750 ns 0.82
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 136157500 ns 137101500 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 254064979 ns 251499542 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 95920292 ns 96734833 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 142447792 ns 270582500 ns 0.53
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 160728542 ns 157462229 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 128320500 ns 124550542 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 489398208 ns 489233625 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7076007 ns 7003527 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 878198000 ns 1494868312.5 ns 0.59
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1159512542 ns 1205204209 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1091950937.5 ns 1091914979 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2031129145.5 ns 2033756875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33948505 ns 34486848.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1675439708 ns 2031846083.5 ns 0.82
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1856136458 ns 1856502416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 2152323750 ns 2218211729 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2557724375 ns 2563679583 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1489375 ns 2093250 ns 0.71
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3016271 ns 3113375 ns 0.97
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 7413417 ns 9724750 ns 0.76
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2315437.5 ns 2446083.5 ns 0.95
lenet(28, 28, 1, 128)/forward/GPU/CUDA 269701.5 ns 275113 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7569375 ns 9682833 ns 0.78
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12022729 ns 12076166 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 25932458 ns 24267792 ns 1.07
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11703000 ns 11496500 ns 1.02
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1118633.5 ns 1185320 ns 0.94
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 184907416.5 ns 380917104.5 ns 0.49
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 284546083.5 ns 315455208 ns 0.90
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 241426750 ns 265045166.5 ns 0.91
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 452991354 ns 453577208.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 5112454 ns 4825872 ns 1.06
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 645024208 ns 1157170792 ns 0.56
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 997483375 ns 976146875 ns 1.02
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 925202750 ns 1071077458 ns 0.86
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1397551666 ns 1399279583 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17373860 ns 18526493 ns 0.94
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1087083 ns 1057416 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 2091125 ns 1660750 ns 1.26
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 5668875 ns 5839187.5 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1359020.5 ns 1297896 ns 1.05
lenet(28, 28, 1, 64)/forward/GPU/CUDA 272985 ns 270186.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6008312.5 ns 6497437.5 ns 0.92
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 12422354 ns 13095667 ns 0.95
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 19220333 ns 19774958 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6086791 ns 6060250 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1148718.5 ns 1207468 ns 0.95
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23312791.5 ns 70439459 ns 0.33
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43424291.5 ns 43880645.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39621249.5 ns 39802542 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132704688 ns 132617229.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1838219.5 ns 1928198.5 ns 0.95
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183969250 ns 354773521 ns 0.52
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 270639125 ns 271527854 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 254150458 ns 253115833 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534737354 ns 534735167 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16495205 ns 13227623 ns 1.25
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 296709063 ns 395827000 ns 0.75
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 408346313 ns 373039667 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 675605375 ns 703091167 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 713172208 ns 714378250 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 658745500 ns 1187937250 ns 0.55
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 691800875 ns 839767834 ns 0.82
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 627903750 ns 640628833 ns 0.98
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1771393041.5 ns 1772779750.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12470963 ns 12386874 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1896686208.5 ns 3628821667 ns 0.52
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2821462667 ns 2842192167 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2708179792 ns 2716722458 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5031662417 ns 5042550875 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49800691 ns 49688646 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3056042 ns 3430062.5 ns 0.89
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2073917 ns 2069021 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2534959 ns 2518417 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6039604 ns 6032959 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 581352.5 ns 573246 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25564458 ns 26098500 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19122958 ns 19045208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19457084 ns 19561125 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39399583.5 ns 39345062.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3193363 ns 3186388 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35357833.5 ns 55895354 ns 0.63
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 82465979.5 ns 83953562.5 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 173616833.5 ns 177984916 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45586854 ns 45586542 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1658104.5 ns 1786000.5 ns 0.93
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1103708 ns 1108812.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1584042 ns 1583271 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3040583 ns 3031458.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215913 ns 216476 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12742166 ns 12561896 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9206125 ns 9222083 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9678249.5 ns 9681604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19021750 ns 18991354 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1940651.5 ns 1983529 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17683791.5 ns 17661854.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14325396.5 ns 14350708 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14622000 ns 14571666 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22174333 ns 22207958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23780042 ns 70523437 ns 0.34
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 43554208.5 ns 43757146 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39504333 ns 39692875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132670229.5 ns 132543875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1845256.5 ns 1868597 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 191180395.5 ns 358019500 ns 0.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 347504021 ns 348616458.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 303868146 ns 304684062.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 726867708 ns 726741083 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13917534 ns 14313431.5 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 301366583.5 ns 420910145.5 ns 0.72
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 420040000 ns 427953667 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 725917916.5 ns 711470292 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 718330750 ns 718110625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1914000 ns 1783333.5 ns 1.07
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1577999.5 ns 1377417 ns 1.15
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1563229 ns 1380791 ns 1.13
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2670709 ns 2616709 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 575822 ns 569443 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6147833 ns 9249354 ns 0.66
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13077125 ns 15832708.5 ns 0.83
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 31966833 ns 32885020.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 10221083.5 ns 10214250 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1411574.5 ns 1406558.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18784479.5 ns 22309667 ns 0.84
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 23789541 ns 28394500 ns 0.84
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 50722250 ns 56878750 ns 0.89
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 18859041 ns 18878041 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 72708 ns 690833.5 ns 0.11
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 645208.5 ns 613625 ns 1.05
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1037333 ns 1078916 ns 0.96
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 723792 ns 724417 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 48478 ns 47653 ns 1.02
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 321229 ns 1550500 ns 0.21
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1073041.5 ns 1006604.5 ns 1.07
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1391333.5 ns 1431333.5 ns 0.97
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2244937.5 ns 2290167 ns 0.98
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 216787 ns 227007.5 ns 0.95
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 432104 ns 1559479 ns 0.28
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1120771 ns 1065562.5 ns 1.05
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 1403125 ns 1941250 ns 0.72
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2259020.5 ns 2187500 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3047209 ns 3412458 ns 0.89
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2067438 ns 2060333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2509271 ns 2504750 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6022208 ns 6004208.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582721 ns 571869.5 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23636354 ns 24064937.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17233750 ns 17186562.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17161270.5 ns 17163520.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37565083.5 ns 37576333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3117456 ns 3169039 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33371978.5 ns 53946459 ns 0.62
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 87222979.5 ns 83764604.5 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 169899604.5 ns 175113292 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44498458.5 ns 44468375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 122773083 ns 250717708 ns 0.49
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148369750 ns 148723729 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 115599833 ns 116337041.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447883812 ns 447560562.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5473384 ns 5458848 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 472111084 ns 1101190667 ns 0.43
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 855488604.5 ns 856965729.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 827233750 ns 828981916.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1751025791 ns 1751973959 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32267631 ns 29300703 ns 1.10
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 641085708 ns 1020791479.5 ns 0.63
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 974367167 ns 981034709 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1299244125 ns 1298484958 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1727066979 ns 1724676458.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1281875 ns 1192334 ns 1.08
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 926208 ns 722208.5 ns 1.28
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 925000 ns 802271 ns 1.15
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2055687 ns 2055959 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 572649.5 ns 554738 ns 1.03
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2956041 ns 5970125 ns 0.50
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 6528541.5 ns 9028833 ns 0.72
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 25140500 ns 27064125 ns 0.93
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7117667 ns 7113729 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1376682 ns 1360766 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6629417 ns 9717625 ns 0.68
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 13121959 ns 16161979 ns 0.81
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 31633541.5 ns 34006416.5 ns 0.93
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7727083 ns 7613041 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 39000 ns 386625 ns 0.10
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 369125 ns 466375 ns 0.79
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 2060917 ns 2797833 ns 0.74
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 91500 ns 91041.5 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28479 ns 28215 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 175271 ns 410729 ns 0.43
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 456875 ns 458375 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4747958 ns 4385625 ns 1.08
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 278792 ns 273062.5 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 222945.5 ns 212092.5 ns 1.05
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 442208 ns 682000 ns 0.65
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 729125 ns 731083.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 4992042 ns 4635250 ns 1.08
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 510708.5 ns 510917 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13875 ns 329062.5 ns 0.042165242165242166
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 307479 ns 405333 ns 0.76
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 742729 ns 775209 ns 0.96
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54562.5 ns 53250 ns 1.02
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28484 ns 27988 ns 1.02
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25708 ns 358354 ns 0.07173911830201421
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 338916.5 ns 340937.5 ns 0.99
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 863958 ns 667854 ns 1.29
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151875 ns 151583 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 212558.5 ns 199391.5 ns 1.07
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 45958 ns 372916.5 ns 0.12
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 353000 ns 354896 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 707124.5 ns 585667 ns 1.21
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 151208 ns 151375 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 319447750 ns 600844791 ns 0.53
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 426610458.5 ns 434479500 ns 0.98
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 375378312 ns 395023625 ns 0.95
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 871258750 ns 872456875 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7669123 ns 7629063.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1100131354 ns 1996796291.5 ns 0.55
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1622570062 ns 1637741500 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1650938062.5 ns 1582333333.5 ns 1.04
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2689218667 ns 2658961958 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27018808 ns 26619843 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 192917 ns 532479 ns 0.36
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 438750 ns 405208 ns 1.08
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 1662083 ns 2880604.5 ns 0.58
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 872375 ns 877791.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 48456 ns 47573 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1206375 ns 1905250 ns 0.63
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 2525667 ns 1799584 ns 1.40
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 14712021 ns 16464375 ns 0.89
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2780833 ns 2818750 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 229439 ns 239346.5 ns 0.96
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2304041.5 ns 2932000 ns 0.79
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 5195271 ns 4975687.5 ns 1.04
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 15008833 ns 16759417 ns 0.90
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 3672875 ns 3748708 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1605250.5 ns 1367812.5 ns 1.17
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1257417 ns 930041 ns 1.35
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1184583 ns 1056709 ns 1.12
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2332750 ns 2313729.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 564450.5 ns 567030 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3206145.5 ns 5196958 ns 0.62
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 4660438 ns 8601584 ns 0.54
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 25391500 ns 26184083.5 ns 0.97
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7348916 ns 7337728.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1322967.5 ns 1330201.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8828437.5 ns 11580375 ns 0.76
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 14136583 ns 18587958.5 ns 0.76
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 34282271 ns 37621062.5 ns 0.91
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9545875 ns 9557791 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2417 ns 3041 ns 0.79
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2583 ns 2792 ns 0.93
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3000 ns 3375 ns 0.89
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2875 ns 2854 ns 1.01
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24442 ns 25102 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7500 ns 7125 ns 1.05
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 6917 ns 6958 ns 0.99
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7333 ns 7875 ns 0.93
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7500 ns 7083 ns 1.06
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 186841 ns 201877.5 ns 0.93
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8666 ns 8292 ns 1.05
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8208 ns 8333 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8500 ns 8542 ns 1.00
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 6000 ns 5958 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10708.5 ns 10813 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 15917 ns 13916 ns 1.14
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10666.5 ns 11312.5 ns 0.94
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 8583 ns 7709 ns 1.11
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24811 ns 25316 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 21875 ns 21583 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 21375 ns 21625 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 21834 ns 21708 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 21625 ns 21500 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 196501.5 ns 219161 ns 0.90
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 56667 ns 53500 ns 1.06
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 53417 ns 53458 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 53583 ns 53542 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 51125 ns 51166.5 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28584 ns 28292 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28833 ns 28792 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 29166.5 ns 28375 ns 1.03
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46083 ns 46125 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25667 ns 26235 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 44000 ns 229583 ns 0.19
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 272416 ns 277792 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 4134937.5 ns 4446854.5 ns 0.93
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 150250 ns 145500 ns 1.03
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 167133.5 ns 197661 ns 0.85
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 68375 ns 246666.5 ns 0.28
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 290000 ns 296000 ns 0.98
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 4134250 ns 4144084 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145834 ns 145750 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1708 ns 1834 ns 0.93
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1875 ns 1750 ns 1.07
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2208 ns 2500 ns 0.88
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 2000 ns 3750 ns 0.53
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23056 ns 23319.5 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5542 ns 5292 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5125 ns 5000 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5459 ns 5416 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5250 ns 5000 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 171241.5 ns 226307 ns 0.76
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 8209 ns 7459 ns 1.10
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7333 ns 7375 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 7708 ns 7792 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5208 ns 5042 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 34121125 ns 81067334 ns 0.42
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 49829583 ns 48673125 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 45610000 ns 43747500 ns 1.04
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 153620250 ns 153700375 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2656287 ns 2718893 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 398908813 ns 621060459 ns 0.64
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 429050459 ns 430659541 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 411940459 ns 409758041.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 700408042 ns 699041292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15233782 ns 15621337.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 746953208 ns 875541666.5 ns 0.85
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 839728187.5 ns 845831187.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 1151171375 ns 1160340833.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1175552896 ns 1177842604 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant