SplitCompute and working with Tensors

SplitCompute - github link

So for the last week or so I had been working on this project I called "SplitCompute", with the intention of sharding neural networks between the cloud and the edge. This could potentially allow low to no cloud GPU costs for inference on such models.

The idea behind sharding, instead of just transferring, was also because no one wants 5-10GB of models being downloaded on their browsers, and being attempted to run. ML on the edge is definitely getting better but it is still not very feasible because of the transfer and compute requirements.

For the most part, the project turned out successful - for arbitrary embedding based models, you can use the components and get it to work. The problem with the project turned out to be the problem I was set to solve - wasn't really one :(

The Core

Why webgpu-torch

While looking for ways to run neural networks on the browser, I stumbled upon some of Karpathy's work because obviously he had tried to do this 11 Years ago. It was some pure JS, which was still quite performant. I was looking for something lower level still.

ONNX runtime is another popular choice to run models, but I was unsure how to shard the NNs with it. Moreover its performance was also dubious from what I read on online forums.

Eventually I stumbled upon a unfinished library with torch-like API written in TS - webgpu-torch. It was full of bugs and incomplete code, but it felt like a good exercise to fix and extend it, so I went along with it.

When it did work, it was really fast - fast enough to not impact UX.

How it works

Webgpu-torch uses lazy buffers to handle tensors - each being associated with a computation graph. The computation graph is compiled to kernels, which are really just WGSL strings containing bindings and ops.

The library does no kernel fusion at the moment, though I'm planning to add that when I get the time. Tinygrad has already done most of this.

The problem with the problem

It was always a benefit-convenience tradeoff, but the benefit is far smaller than I expected. GPU costs for embedding models aren't really that big of a concern for most commercial users, and the complexity far outweighs the pros.

Overall it was a fun week or two. Was also nice to dive into various tensor library internals, find a LOT of absolute slop (tensorflow).