`mmap()` and the confusion with LLaMa's sparsity

A recent article on HN blew up claiming the usage of mmap() had caused a memory usage reduction for LLaMa (30B param variant large language model) inference, and now it was possible to do it with 6GB RAM instead of $\ge$ 30GB to load weights and activations. Even quantized to 4-bit integers, it would still require $\ge$ 17GB to fully load the model, but maybe loading it fully isn't necessary? (it actually is)

`mmap()`

It is used to create a mapping in the virtual address space of the process that calls it. This mapping can be to a file or a device, and they might require far more memory than is available, which is allotted to them in the virtual space. The OS handles the memory accesses on demand, using swap space and caching.

Using mmap() for loading LLaMa 30B resulted in a reported memory consumption of only 6GB through htop, which was largely attributed to sparsity within the weights.

Matrix multiplication

In reality, htop seemed to be lacking a reporting of filesystem cache where the entire LLaMa was still being accessed through OS calls.

It is not possible to fix sparsity through mmap(). Transformers are a bunch of stacked matrix multiplications, and you cannot access parts of matrices and multiply them fully.
The sparsity claim itself lacks ground.

Is it beneficial at all then to use mmap()? Turns out yes, since now the model isn't being copied over to memory. Model loading is faster, and subsequent runs make use of caching through OS. However the inference time takes a massive hit due to constant disk reads.

mmap() and the confusion with LLaMa's sparsity

mmap()

Matrix multiplication

`mmap()` and the confusion with LLaMa's sparsity

`mmap()`