SplitCompute - github link
For the last week or so, I had been working on this project I called "SplitCompute", with the intention of sharding neural networks between the cloud and the edge. This could potentially allow low to no cloud GPU costs for inference on such models.
The idea behind sharding, instead of just transferring, was that no one wants 5-10GB of models being downloaded to their browsers and then run there. ML on the edge is definitely getting better, but it is still not very feasible because of the transfer and compute requirements.
For the most part, the project turned out to be successful - for arbitrary embedding-based models, you can use the components and get it to work. The problem with the project was that the problem I set out to solve - wasn't really one :(
While looking for ways to run neural networks in the browser, I stumbled upon some of Karpathy's work because of course he had tried to do this 11 years ago. It was pure JS, which was still quite performant. I was looking for something lower-level still.
ONNX runtime is another popular choice to run models, but I was unsure how to shard the NNs with it. Moreover, its performance was dubious from what I had read online.
Eventually I stumbled upon an unfinished library with a torch-like API written in TS - webgpu-torch. It was full of bugs and incomplete code, but it felt like a good exercise to fix and extend it, so I went along with it.
When it did work, it was really fast - fast enough not to impact UX.
Webgpu-torch uses lazy buffers to handle tensors - each being associated with a computation graph. The computation graph is compiled into kernels, which are really just WGSL strings containing bindings and ops.
The library does no kernel fusion at the moment, though I'm planning to add that when I get the time. Tinygrad has already done most of this.
It was always a benefit-convenience tradeoff, but the benefit is far smaller than I expected. GPU costs for embedding models aren't really that big a concern for most commercial users, and the added complexity far outweighs the pros.
Overall, it was a fun week or two. It was also nice to dive into various tensor library internals and find a LOT of absolute slop (tensorflow).