You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're looking into superoptimizers, specifically adding a luminal_gpu crate with a gpu-agnostic superoptimizer pass, which will then feed into downstream gpu compilers in luminal_cuda and luminal_metal.
We use e-graphs and simple rewriting rules to transform the expressions in this language into various semantically identical forms. Since loops are codified in this language, we can apply rewrites to do loop transformations as well, thereby unifying schedule and algorithm optimizations. We will build an e-graph until it's saturated (or perhaps use a beam width to limit the search space)
Extraction
Given a graph, how can the cost / runtime of that graph be determined? One simple way is to generate a few representative inputs randomly and just them through the graph and measure runtime. This is what tinygrad does (and I think torch.compile). The reason I don't like that much is because I want to make a device-agnostic search stage, so it won't have access to the actual device. Also running a graph with a representative workload takes a long time which severely limits the amount of candidates you can search.
It should be possible to work out all data movement in a graph from global to shared to register memory, and count how many bytes are moving. Weighted appropriately, I wonder if this would be an adequate proxy for cost?
This memory-movement-minimization idea is the main thing that needs to be proven out. Some issues still remain: it doesn't capture occupancy or coalesced memory accesses, which matter a lot. I don't know of any other symbolic cost function out there that doesn't involve learning a cost model.
If anyone has any ideas or papers they think can add value, please throw them here!
The text was updated successfully, but these errors were encountered:
We're looking into superoptimizers, specifically adding a luminal_gpu crate with a gpu-agnostic superoptimizer pass, which will then feed into downstream gpu compilers in luminal_cuda and luminal_metal.
So far, I'm specifically taking ideas from Mirage (https://www.cs.cmu.edu/~zhihaoj2/papers/mirage.pdf), EinNet (https://www.usenix.org/system/files/osdi23-zheng.pdf) and Tensat (https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2022_2023/papers/YANG_MLSYS_2021.pdf). The goal here is to define tensor operations in terms of a linear algebra language, ideally even a scalar-based language (see einnet) which codifies, and then use equality saturation to apply a set of simple rewrite rules simultaneously to discover exponentially many equivalent expressions in linear space. Since these rewrite rules are symbolically guarenteed to generate semantically equivalent expressions, we don't need to worry about testing for correctness like mirage does.
My current thinking for an approach:
Language
We turn linear algebra into a symbolic language, similar to how EinNet does theirs, where Matmul can be represented like this:
Equality Saturation
We use e-graphs and simple rewriting rules to transform the expressions in this language into various semantically identical forms. Since loops are codified in this language, we can apply rewrites to do loop transformations as well, thereby unifying schedule and algorithm optimizations. We will build an e-graph until it's saturated (or perhaps use a beam width to limit the search space)
Extraction
Given a graph, how can the cost / runtime of that graph be determined? One simple way is to generate a few representative inputs randomly and just them through the graph and measure runtime. This is what tinygrad does (and I think torch.compile). The reason I don't like that much is because I want to make a device-agnostic search stage, so it won't have access to the actual device. Also running a graph with a representative workload takes a long time which severely limits the amount of candidates you can search.
It should be possible to work out all data movement in a graph from global to shared to register memory, and count how many bytes are moving. Weighted appropriately, I wonder if this would be an adequate proxy for cost?
This memory-movement-minimization idea is the main thing that needs to be proven out. Some issues still remain: it doesn't capture occupancy or coalesced memory accesses, which matter a lot. I don't know of any other symbolic cost function out there that doesn't involve learning a cost model.
If anyone has any ideas or papers they think can add value, please throw them here!
The text was updated successfully, but these errors were encountered: