1. Implementing model pruning techniques
Let's explore another technique to reduce model size with minimal accuracy loss: model pruning.
2. When to use pruning?
Pruning is ideal for low-resource settings where memory and speed matter most—like mobile, embedded, or IoT devices.
It pairs well with dynamic quantization, further reducing model size while maintaining acceptable performance.
3. What is model pruning?
Pruning removes weights that contribute least to predictions, resulting in smaller and more efficient models. One common approach is L1 unstructured pruning, which zeros out weights with the smallest absolute values.
4. What is model pruning?
In this example, we prune 40% of the weights in model.fc using prune.l1_unstructured(). It ranks weights by absolute value and zeros out the smallest ones, creating a sparse layer that uses less memory and speeds up inference.
As you can see here, before pruning, all weights had non-zero values. After applying 40% pruning, some of the smallest weights — like 0.05 and -0.02 — have been replaced with exact zeros. This sparsity reduces memory use and can boost speed.
5. Understanding pruning masks
When we prune a model in PyTorch, the original weights aren't deleted—instead, a mask is applied. This binary mask determines which weights are active and which are ignored during inference. A value of 1 in the mask keeps the weight, while 0 effectively zeroes it out at runtime.
These masks make pruning reversible. Until removed, the original weights remain.
6. Making pruning permanent
After pruning, the model still holds the pruning mask, which keeps track of which weights were zeroed. To make pruning permanent — and simplify deployment — we must remove this mask. A pruning mask in PyTorch is a binary tensor (with values of 0s and 1s) that overlays a model's weight tensor to indicate which weights have been "pruned" (i.e., set to zero). It's part of PyTorch's internal mechanism for tracking which weights should be ignored during inference.
The prune.remove function finalizes the pruning by eliminating the mask and embedding the pruned weights directly into the model's parameters. This step is crucial before saving or exporting the model.
The example clearly shows that the pruning-specific wrapper (PrunedParam) is removed, and the layer is converted back to a standard Linear layer — with pruned weights now permanently zeroed.
7. Evaluating pruning impact
As with quantization, it's essential to evaluate the trade-offs introduced by pruning. After pruning and finalizing the model, we assess its accuracy and inference speed.
In practice, we may see slight accuracy degradation, but significant reductions in model size and memory footprint. The key is to strike a balance between model compactness and predictive performance.
8. Let's practice!
In the upcoming exercises, we'll apply pruning to a trained model, evaluate its performance, and explore how pruning can be used in combination with other optimization techniques like quantization. Let's get hands-on!