Models are trained by data science teams but putting them in real-world production software, especially edge applications is hard.
Software teams have to manage energy consumption, latency issues, disk size limitations, RAM size limitations, parallelism with machine learning.
I’m going to summarize my research on these topics to save you some time.
Right now, it’s articulated in a way that’s helpful for software engineers interested in deploying ML models and it’s not especially focused on engaging the research community (no research works cited).
In many cases, especially with LLMs, you can convert your weights from floating to int8 types. This reduces disk size required to store your model (float 32 require 32 bits, int8s require 8 bits) on disk.
Quantization also helps with RAM / VRAM (VRAM means Video RAM which is the GPU equivalent of RAM). Conceptually, both refer to the amount of memory space required by your model to make a computation. In particular, since models operate with massive linear matrix multiplications, if we can reduce the size of the matrices being multiplied, we reduce the RAM / VRAM required to operate on those matrices.
After you’ve quantized your model, be sure to evaluate your performance as this approach is lossy.
You can use beta PyTorch software quantization docs to help you with this if you’re working with a PyTorch model (or to generally learn more).
You can also do this in Tensorflow.
There’s another concept here called Quantization-Aware Training. This is what it sounds like: you create a model architecture during training that can be easily scaled down after training. Talk to your data science teams to hear if they’re doing that. See Pytorch on Quantization Aware Training.
I don’t currently know much about model pruning but this seems to be a helpful guide Model Pruning Medium Article.
Generally, my understanding is that by zeroing out the weights you can convert dense matrices into sparse onces while achieving roughly the same performance, by discarding values from your weight matrices nearly equal to zero.
You can retrain a smaller version of your model architecture to achieve the same performance as your large version of your model: just use your larger version of the model as a teacher and your small version of the model as the learner and learn from teacher. This should be seen as a last resort, after you’ve compiled and quantized your model as it’s much more involved. Here’s a study on the SOTA of Knowledge Distillation in the space of LLMs: Knowledge Distillation Survey Paper.
Here’s another blog on the subject. Knowledge Distillation
See Qualcomm’s research team’s Aimet on GitHub, that walks you through their implementation of some of these methods and discusses other options (including Singular Value Decomposition of weight matrices).
In the ideal state of loss-less compression, you achieve greater compression while accepting a hopefully slight
Sometimes you’re OK with a loss in accuracy (or other performance metric) in exchange for gains in model size or speed. In these cases, you lose fidelity (i.e lossy compression).
Problem: For real time applications, latency of models is crucial to get right. For applications like self-driving cars, this means the project will fail to reach production or for lower stakes productions, it will diminish the user experience.
Problem-Specific Solutions:
Like with C, you can compile your Pytorch code. According to Pytorch docs, torch.compile
“effectively inlines [your model layers], producing a large [computation] graph to compile”.
VRAM means Video RAM which is the GPU equivalent of RAM. Conceptually, both refer to the amount of memory space required by your model to make a computation.
I’m Sam Randall. I’ve worked as an iOS Engineer embedding ML applications and I’ve worked to develop data-based tools for data scientists to improve their models’ performance. I try and keep abreast of the latest trends in software engineering, machine learning, mathematics and their intersections. Model Compression in Edge Software is very high on that list.
If you want to talk about how to compress your models, let’s talk. I’m eager to take on consulting work.