Loss Scaling
Overview
Loss scaling is used to solve the underflow problem that occurs during the gradient calculation due to the small representation range of float16. The loss calculated in the forward pass is multiplied by the loss scale S to amplify the gradient during the backward gradient calculation. In the mixed precision training scenario on some networks, the loss scaling function needs to be enabled. Otherwise, the loss does not converge.
Using Loss Scaling
If the loss scaling function is used on the original network, you need to migrate LossScaleOptimizer to NPULossScaleOptimizer or NPUOptimizer. The following uses NPULossScaleOptimizer as an example.
- Static loss scaling: You can use a fixed loss scaling factor during mixed precision training.
When using the static loss scaling, before creating NPULossScaleOptimizer, instantiate a FixedLossScaleManager class to specify the loss scaling factor.
- Dynamic loss scaling: You can adjust the loss scaling factor based on the abnormal status of floating-point computation during mixed precision training.
When using the dynamic loss scaling, before creating NPULossScaleOptimizer, instantiate a ExponentialUpdateLossScaleManager class to dynamically specify the loss scaling factor.
In addition, when NPULossScaleOptimizer is used, set is_distributed to True to support the loss scaling function in the distributed training scenario.
Original TensorFlow code:
if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.bert_loss_scale == 0: loss_scale_manager = tf.contrib.mixed_precision.ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5) elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = tf.contrib.mixed_precision.FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale) opt = tf.contrib.mixed_precision.LossScaleOptimizer(opt_tmp, loss_scale_manager)
Code after migration:
from npu_bridge.estimator.npu.npu_loss_scale_optimizer import NPULossScaleOptimizer from npu_bridge.estimator.npu.npu_loss_scale_manager import FixedLossScaleManager from npu_bridge.estimator.npu.npu_loss_scale_manager import ExponentialUpdateLossScaleManager from npu_bridge.estimator.npu.npu_optimizer import NPUDistributedOptimizer if FLAGS.use_fp16 and (FLAGS.bert_loss_scale not in [None, -1]): opt_tmp = opt if FLAGS.bert_loss_scale == 0: loss_scale_manager = ExponentialUpdateLossScaleManager(init_loss_scale=2**32, incr_every_n_steps=1000, decr_every_n_nan_or_inf=2, decr_ratio=0.5) elif FLAGS.bert_loss_scale >= 1: loss_scale_manager = FixedLossScaleManager(loss_scale=FLAGS.bert_loss_scale) else: raise ValueError("Invalid loss scale: %d" % FLAGS.bert_loss_scale) # Check whether the number of devices is greater than 1. If yes, perform distributed training. if ops_adapter.size() > 1: opt_tmp = NPUDistributedOptimizer(opt_tmp) opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager, is_distributed=True) else: opt = NPULossScaleOptimizer(opt_tmp, loss_scale_manager)
Updating the Global Step
After the loss scaling function is enabled, the step where the loss scaling overflow occurs needs to be discarded. For details, see the update step logic of the optimizer.
- In most cases, for example, the tf.train.MomentumOptimizer used on the ResNet-50HC network updates the global step in apply_gradients, the step does not need to be updated when overflow occurs. Therefore, the script does not need to be modified.
- However, for the BERT network, the global step update is implemented in create_optimizer, including the judgment logic. In this case, the global step update needs to be performed in the optimizer. The following is a migration example:
In the original TensorFlow code, the global step is updated in create_optimizer, including the judgment logic.
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1, optimizer_type="adam", allreduce_post_accumulation=False): ... if tf.flags.FLAGS.npu_bert_clip_by_global_norm: new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) else: new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
During the migration to the Ascend platform, you need to update the global step in the optimizer as follows:
- Comment out the global step update logic implemented in create_optimizer in the script.
def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, hvd=None, manual_fp16=False, use_fp16=False, num_accumulation_steps=1, optimizer_type="adam", allreduce_post_accumulation=False): ... #if tf.flags.FLAGS.npu_bert_clip_by_global_norm: # new_global_step = tf.cond(all_are_finite, lambda: global_step + 1, lambda: global_step) #else: # new_global_step = global_step + 1 #new_global_step = tf.identity(new_global_step, name='step_update') #train_op = tf.group(train_op, [global_step.assign(new_global_step)]) return train_op
- Before the last return statement of the apply_gradients function, add the logic for updating the global step in the AdamWeightDecayOptimizer and LAMBOptimizer classes, respectively. The apply_gradients function is called only when overflow is not found in the status check during loss scaling.
def apply_gradients(self, grads_and_vars, global_step=None, name=None, manual_fp16=False): assignments = [] for (grad, param) in grads_and_vars: ... new_global_step = global_step + 1 new_global_step = tf.identity(new_global_step, name='step_update') assignments.extend([global_step.assign(new_global_step)]) return tf.group(*assignments, name=name)