Enabling linear acceleration and lossless performance for large-scale deep learning training, a BMUF-based Adam optimizer parallelization practice

October 16, 2020

Share this page

As an adaptive learning rate stochastic optimization method, Adam has been widely used in the field of deep learning since it was first proposed in 2014. In order to improve its training efficiency when applied to large-scale tasks, Adam is often combined with the synchronous stochastic gradient (SSG) technique to run in parallel on multiple workers. In this article, we refer to this method as “Sync-Adam.”

Essentially, Sync-Adam speeds up training by distributing the gradient computation of samples within a minibatch to multiple workers, and so communication occurs very frequently. With the increase of parallel workers, the number of samples within a minibatch also increases proportionally, which often hurts the final model performance. To solve the poor scalability issue of SSG-based Adam, we plugged Adam into a Blockwise Model-Update Filtering (BMUF) framework.

BMUF is a communication-efficient distributed optimization framework proposed by researchers from MSRA’s speech group in 2016. This framework periodically synchronizes model-update information among parallel workers and combines it with historically updated information to improve the performance of the global model. Compared with the SSG-based method, BMUF realizes low communication frequency, near linear speed up, and very little model performance degradation. Given all these advantages, BMUF has been widely used in the industry to train large-scale deep learning based models.

We used BMUF to parallelize Adam and evaluated it on Microsoft’s large-scale OCR and speech product tasks. Experimental results show that BMUF-Adam delivers almost linear speed-up without performance degradation with up to 64 workers on large-scale OCR tasks, and similar effects are achieved on large vocabulary continuous speech recognition (LVCSR) task with up to 32 workers.

Next, we will investigate how we can empower Adam with BMUF to achieve further positive results.

Review of BMUF

In the BMUF-based training framework, we have N parallel workers. A worker can be a GPU card, a compute node, etc. Given a block of training data, it will be partitioned into N splits, and each split will contain τ mini-batches. Starting from a common initial model parameter θ_(t-τ)^((init)), all workers will update their local models with respective data splits for τ steps in parallel to obtain N local models{θ_(t,1),θ_(t,2),…,θ_(t,N)}. This procedure is called “intra-block parallel optimization” (IBPO). Instead of treating the averaged local models θ ̅_t as an updated global model, BMUF introduces a block momentum to leverage historical updated information to obtain a better global model as follows:

where η refers to block momentum and ζ refers to block learning rate. We set ζ=1 in this paper, which is common practice for BMUF. As discussed in [2], block momentum can compensate per mini-batch’s inadequate contribution to the final model-update caused by averaged operation to improve model performance. We defined a new variable ρ_n to represent the number of equivalent mini-batches required to get Δ_n. Since

therefore：

If η=1-1/N according to [2], lim┬(n→∞)⁡(ρ_n)=Nτ, which is equal to the number of mini-batches in a data block. This deduction shows that lim┬(n→∞)(Δ_n)can simulate model-update resulting from processing Nτmini-batch data blocks in serial. This conclusion requires an assumption that θ ̅_t-θ_(t-τ)^((init)) is stationary.

In this article, we use the Nesterov block momentum technique to obtain the initial model for the next IBPO operation.

Since

therefore,

Review of Adam

Adam is an adaptive learning rate based stochastic optimizer, whose learning rates are determined by the first and second moment of the stochastic gradient as follows:

where ⨀ represents element-wise multiplication and g_t is the stochastic gradient of the t-th mini-batch. Based on the assumption that $ E[g_t]and E[g_t⊙g_t]are stationary, Adam uses an exponential moving average to estimate these two moments as follows:

a close up of a watch

where

BMUF-Adam

Since BMUF is a general distributed optimization framework, we can plug Adam into it as a local optimizer. According to our revision of BMUF and Adam, if we directly plug Adam into BMUF without any change, the 1st and 2nd order moments of Adam will be mismatched with the model parameter. Assume we set:

Obviously, due to the existence of ηΔ_n, the averaged 1st and 2nd moments will become stale for the initial model parameter of the next IBPO operation. Since the number of equivalent minibatches of ηΔ_n is ηρ_n, we believe the m_t^((init)),v_t^((init)) that is compatible with θ_t^((init)) can be obtained by processing ηρ_n minibatches starting from m ̅_t,v ̅_t.

According to the 1st moment update formulation, we can have:

Combined with the stationary consumption of E[g_t], we have:

Where E[g^((n))] represents the expectation of the minibatch stochastic gradient of n-th IBPO. Therefore,

because

Finally, we obtain:

According to the formulation of m ̅_t and m_t^((init)), it’s not difficult to find that when τ is small and N is large, m_(t-τ)^((init)) overly contributes to m ̅_t, which results in more serious inconsistency between θ_t^((init)) and m ̅_t, and this phenomenon is verified in our experiments. With the introduction of an equivalent minibatch, the contribution of stale m_(t-τ)^((init)) to m_t^((init)) can be reduced significantly, which will improve the model performance. Meanwhile, using small β_1 can also reduce the contribution of m_(t-τ)^((init)). The above analysis can also be applied to the 2nd moment. Finally, we can obtain BMUF-Adam as follows:

Experimental results

We first verified our method on Microsoft’s large scale English OCR task. First, different moments estimation strategies are compared when Adam is plugged into BMUF. The experimental results are listed in Table 1 (τ=8 in these experiments):

Table 1: experimental results

We find that when an averaging strategy is used, model performance degradation becomes increasingly serious with an increase in parallel workers. The proposed strategy can relieve the degradation significantly, and combined with a small β_1, the model performance is further improved. Next, we compared Sync-Adam with BMUF-Adam on this task. Experimental results are listed in Table 2:

Table 2: a comparison of Sync-Adam and BMUF-Adam

With an increase in parallel workers, Sync-Adam suffers increasingly serious performance degradation and the speedup ratio worsens. As a comparison, BMUF-Adam can achieve almost linear speedup with little performance degradation.

We also verified BMUF-Adam on the LVCSR task with 6000 hours of Microsoft product data, and the results are listed in Table 3:

Table 3: experimental results on the speech recognition task

Again, BMUF-Adam can achieve almost linear speedup with little performance degradation and shows excellent scalability.

Conclusion

In this article, we used BMUF to parallelize the widely used Adam algorithm, and experimental results show that compared with the traditional SSG based method, BMUF-Adam can achieve faster training speed, better model performance, and better scalability. This algorithm has been applied to several Microsoft products, and we welcome everyone to give it a try. ????

More details:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9052983

References:
[1] D. P. Kinagma, J. Ba, “Adam: a method for stochastic optimization,” Proc. ICLR-2015, arXiv:1412.6980.
[2] K. Chen, Q. Huo, “Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering,” Proc. ICASSP-2016, pp.5880-5884.
[3] K. Chen, H. Ding, Q. Huo, “Parallelizing Adam optimizer with blockwise model-update filtering,” Proc. ICASSP-2020, pp.3027-3031.

Microsoft Research Lab – Asia