Why Pruning?
As deep learning’s popularity continues to explode, an increasing number of companies have begun to rely on cutting-edge deep learning algorithms to deliver core product functionality to users. However, developing and deploying large neural networks is an extremely resource-intensive process that quickly becomes a core business cost for these companies. Furthermore, many applications of deep learning, ranging from mobile phone applications to robotic control systems, require rapid model inference. This is very difficult to accomplish with large and unwieldy models.

Ongoing research in neural network pruning aims to address many of these issues by exploring techniques for removing extraneous parameters from neural networks without compromising their accuracy. Smaller networks naturally require less space and are faster and cheaper to both train and deploy, making this line of work particularly important for those in industry actively seeking to reduce the costs surrounding model development and deployment.

Deep Learning 101
Before we dive into things, let’s quickly cover some important terms and concepts that will be referred to in subsequent sections. Deep learning models, also known as neural networks, are a class of machine learning algorithms that have found immense success in recent years. For simplicity’s sake, we will focus on supervised learning tasks, where we are provided with a set of datapoints and a corresponding set of labels describing these datapoints. Supervised learning with neural networks has been most successful in the domains of vision (image data) and language (text data). The goal is to learn a function, \(f\), that maps a given datapoint (such as an image or piece of text) to its corresponding label. \(f\) is parameterized by a neural network, which is defined by composing many matrix multiplications and subsequent nonlinearities on top of each other. For example, we can define \(f(x)\) for an input \(x\) and label \(y\) as follows, where \(w_1\) and \(w_2\) are coefficient matrices known as weights and \(h_1\) and \(h_2\) are nonlinear functions that are applied elementwise to their inputs: \(f(x) = h_2(w_2(h_1(w_1(x)))\). This can also be viewed as passing an input through consecutive layers of interconnected neurons where the connections between neurons are quantified using the weight matrices (\(w_1\) and \(w_2\)). Training a neural network refers to the process of iteratively updating the network’s weights so that its predictions on the given datapoints closely match the provided data labels. Neural networks can be trained using a variety of different optimization algorithms that provide instructions on how to update the weights in order to bestmatch the network outputs to the labels. Model inference refers to the process of obtaining a trained network’s predictions on new data in production.

The Lottery Ticket Hypothesis
Most of the recent research in model pruning was kickstarted by Jonathan Frankle and Michael Carbin’s 2018 paper titled The Lottery Ticket Hypothesis: Finding Sparse, Trainable, Neural Networks. The titular Lottery Ticket Hypothesis proposes that we view large neural networks as ensembles of many smaller subnetworks, a few of which will perform extremely well on the task at hand. However, there is no way to identify these winning subnetworks apriori, so we use really large networks to get more shots on goal and increase the likelihood that one of the component subnetworks is a winner. If we can find these winning subnetworks, or “lottery tickets”, then we can discard, or “prune” the rest of the network without sacrificing the accuracy. In practice, this is done using an iterative process with the following steps:

  1. Initialize network weights and train to convergence on the given task
  2. Identify smallest weights and prune them by setting them to zero
  3. Repeat steps 1 and 2 with the remaining weights until the desired proportion of weights have been pruned

While this approach does succeed in identifying winning subnetworks that amazingly manage to achieve similar performance to the original network, it does not offer any true cost savings for a few reasons. First of all, the iterative pruning process is actually more time-intensive than the standard training process for a neural network because it requires repeated re-initialization of the network weights. This is equivalent to re-training the network from scratch many times until the desired subnetwork size is reached. The hope of course is that the more expensive training leads to large savings during inference time. However, this turns out not to be the case. Frankle’s pruning method does not actively remove layers or weights from a network. Instead, it prunes weights by simply setting them to zero. While this does produce some space savings due to the storage of zeros instead of large floating point values, it does not lead to any speed improvement during inference. From a hardware standpoint, zero weights and non-zero weights look exactly the same. All the matrix multiplications involving the zero weights still have to be executed by the hardware because it has no way of determining which weights are zero and which are not ahead of time. Thus, networks pruned using this method do not offer any tangible reduction in the time and GPU compute required to perform inference.

A few follow up works from Frankle et al. (1, 2) along with another paper by Zhou et al. determine that while it is not necessary to re-initialize the weights completely during pruning, it is necessary to reset them their state in one of the earliest iterations of training. Setting all the pruned weights to zero appears to be crucial as well. While these papers do not offer any concrete speed improvements for either training or inference, they inform a variety of future research directions that are extremely promising. For example, research by Morcos et al. has found that lottery tickets learned for one dataset using a specific optimization algorithm can also transfer to other datasets and optimization algorithms. This suggests that these subnetworks could be encoding specific structure and connectivity characteristics that could be invariant to the particulars of a dataset or optimization procedure. Meanwhile, Yu et al. find that Frankle’s work extends beyond vision tasks and is effective on language and reinforcement learning tasks as well. The real hope here is that continued study of lottery tickets and their properties will eventually give us a way to identify a few golden ticket subnetworks whose structure and initial weight values work extremely well across a wide array of datasets. We can then directly initialize these subnetworks and realize significant savings during both training and inference time. Though we are not yet at this point, papers that have been published over the past few months seem to be moving in the right direction.

One particularly promising recent work is Tanaka et al.’s November 2020 paper, which details a method for pruning without using any training data at all. This iterative process repeatedly prunes weights that exhibit a low “synaptic saliency”, a measure that quantifies how much each weight is contributing to the overall network output. The lack of dependence on training data allows this pruning method to execute very quickly, and Tanaka’s experiments indicate that it does outperform other methods that attempt pruning at the initialization stage prior to training. However, it still does not beat methods such as Frankle’s that utilize extensive training throughout the pruning process, and since it also prunes by setting weights to zero, it fails to produce tangible savings at training or inference time.

(Note: Robert Lange’s blog post on the lottery ticket hypothesis covers each of these papers in much more technical detail.)

Going from Unstructured to Structured Pruning
All of the pruning methods discussed thus far have been unstructured, since they prune individual network weights by zeroing them without any regard for where in the network each weight is located. On the other hand, structured pruning methods seek to systematically eliminate entire network layers or replace layers with compressed versions of themselves. Since these methods actually reduce the total number of weights in the network, they offer the potential for real speedups during inference time. One older structured approach proposed by Zhang et al. is based on a common dimensionality reduction technique known as principle components analysis, or PCA. As we mentioned previously, each layer of a network applies a matrix multiplication and subsequent elementwise nonlinearity to the input. This process of performing a matrix multiplication with the input is known as a linear transformation because it takes some linear combination of all the inputs and adds them together to obtain an output. Now, consider a 100-neuron layer. We can think of this layer as performing a linear transformation on its inputs that then results in a 100-dimensional output. PCA offers a way to capture most of the information present in this output using far fewer dimensions. Let us assume we can represent the 100-dimensional output using just 25 dimensions. We can then replace our 100-neuron layer with a much smaller 25-neuron layer which directly outputs this compressed 25-dimensional output. Zhang et al. use this technique to prune trained networks when provided with a small subset of training data to compute the layerwise outputs described above.

This method offers inference speedups of up to 4x while maintaining a high degree of accuracy. From a practical standpoint, this is also far simpler to integrate into existing deep learning development and deployment pipelines compared to a method like Frankle’s which would require a complete re-implementation of model training code. Companies can implement this algorithm as a post-training step without modifying their training or inference code at all. It is important to note that while the paper suggests that this pruning method offers a very favorable accuracy vs. speed tradeoff, it has only been tested on a specific image classification task with one particular network architecture. Those seeking to use this in practice will need to make a deployment-time decision on whether to use the pruned model or the original model based on the accuracy vs. speed tradeoff that it offers on their particular task. There is a good chance that it will be infeasible in settings that require extremely high accuracy on more complex tasks such as image segmentation or object detection due to the pruned models exhibiting a higher accuracy dropoff.

Another interesting approach for structured pruning, proposed by Louizos et al., makes use of a Bayesian perspective on neural networks. Instead of viewing each network weight as a fixed value, we view it as a probability distribution. Over the course of training, each weight’s distribution is updated based on the data it sees. When we want to compute the network’s output for a new training example, we sample each weight’s distribution and use the sampled value to perform the computation. Let us quickly consider the case of a weight with a high variance distribution after training. If we pass the same datapoint through the network multiple times, the sampled value of thisweight will vary widely. Despite this noisiness in the weight’s exact value, the network is able to perform its task effectively, suggesting that this weight minimally influences the network’s output. Louizos et al. extend this idea to learn distributions over entire layers instead of individual weights. Layers with high variance distributions after training are considered unimportant to the network’s performance and are pruned.

Louizos’s experiments show that this method yields an 8x improvement on inference time and a 3x reduction in GPU compute costs during inference. Much like Zhang et al.’s work however, it is only tested on a single image classification task with a few different network architectures. Using this approach in an industry setting would first require extensive research and development in order to confirm that it maintains high accuracy for the task at hand. Additionally, Bayesian neural network training is far more time-intensive than traditional training, offsetting much of the inference speedup that this method offers.

The final structured pruning approach that we will cover makes use of the theory that the early stages of training are focused on learning connectivity patterns between layers which then get fixed in the later stages. You et al.’s 2020 paper views the output of convolutional neural networks as a weighted combination of all the filters present in the network (while the details of convolutional networks are not covered here for the sake of brevity, a comprehensive introduction can be found here). This view is more or less analogous to trying to learn a coefficient for each neuron in a neural network that quantifies its impact on the network’s final output. You et al. add an extra term to their training objective that penalizes the network for having high coefficients for too many neurons. As a result, the network quickly assigns zero coefficients to many of the neurons during training. These neurons can then be pruned. You’s iterative pruning strategy is quite similar to the one proposed by Frankle, but instead of training a network to convergence during every iteration, they only need to train around 20% of the way to convergence and can also do so using low-precision computations.

The experiments conducted in this study suggest that the proposed pruning method offers a 6-10x energy savings over Frankle’s strategy. Despite using a structured pruning approach, the paper does not evaluate the pruned models’ performance during inference. There is no reason to believe that inference speedups comparable to Zhang and Louizos’s studies cannot be achieved, but this must be empirically verified. Of course, further work must also be done to test this method’s efficacy with larger models and more complex tasks outside the scope of the experiments detailed in the paper. Overall, this work appears to be the closest to offering true cost benefits during both training and inference time.

Developments in model pruning research over the past few years have resulted in numerous promising avenues for reducing the immense costs associated with training and deploying neural networks. Work around unstructured pruning, while yet to offer tangible cost improvements, has provided numerous insights into the properties and training dynamics of neural networks. Continued work in this area will hopefully lead to algorithms for the systematic creation of small networks with architectures and initial weight values that perform extremely well on a given task or dataset. Meanwhile, research in structured pruning already offers tangible speedups, primarily in the model inference stage. Industry teams seeking to use these methods must however invest some time into research & development in order to ensure that they maintain high levels of accuracy on the specific dataset and task at hand.

Many companies that use deep learning, particularly AI-first companies who continually develop and deploy large neural networks, have already begun to invest resources into working on this problem due to the large costs they currently incur from both training and inference. Model pruning will continue to be an area of emphasis in the near future because effective pruning schemes will ultimately go a long way towards enabling these companies to achieve favorable margins at scale with deep learning-based products.