Iteration Offloading
Overview
Iterations_per_loop is the number of iterations per training loop performed on the device side per sess.run() call. Training is performed according to the specified number of iterations per loop (iterations_per_loop) on the device side and then the result is returned to the host. This parameter can save unnecessary interactions between the host and device and reduce the training time consumption. Note the following:
- The default value of iterations_per_loop is 1, and the total number of training iterations must be an integer multiple of iterations_per_loop.
- If the value of iterations_per_loop is greater than 1, the value of save_checkpoints_steps must be a positive integer multiple of iterations_per_loop; otherwise, checkpoint data is not saved as defined by save_checkpoints_steps. If the value of iterations_per_loop is greater than 1, data is not saved as defined by save_summary_steps and log_step_count_steps. For details, see Log and Summary Operators.
- In mixed computing mode (mix_compile_mode is set to True), this parameter must be set to 1.
- The getNext operator can be scheduled to the device side only when enable_data_pre_proc is enabled and the iterator tf.data.make_initializable_iterator() is used. In this case, the iterations_per_loop parameter takes effect only when it is set to a value greater than 1.
When enable_data_pre_proc is disabled or other data preprocessing iterators, such as tf.data.make_one_shot_iterator(), are used, the getNext operator will not be scheduled to the device side. Therefore, iterations_per_loop does not take effect when it is set to a value greater than 1.
In the Estimator mode, if the return value of input_fn is dataset, tf.data.make_initializable_iterator() is implicitly called in internal processing of Estimator.
- During network commissioning, you are advised to set iterations_per_loop to 1 to facilitate log printing every iteration. After the network is set up correctly, you can set the iterations_per_loop parameter to shorten the training time.
Setting iterations_per_loop with Estimator
In Estimator mode, configure this parameter by setting iterations_per_loop in NPURunConfig as follows.
from npu_bridge.estimator.npu.npu_config import NPURunConfig from npu_bridge.estimator import npu_ops session_config=tf.ConfigProto() config = NPURunConfig(session_config=session_config, iterations_per_loop=10)
Setting iterations_per_loop with sess.run
In sess.run mode, configure the iterations_per_loop parameter by using set_iteration_per_loop and change the number of sess.run() calls to the original number of calls divided by the value of iterations_per_loop. The following shows how to configure iterations_per_loop.
from __future__ import print_function import input_data from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig mnist = input_data.read_data_sets("/test/", one_hot=True) import tensorflow as tf # Set the model. # Set the learning rate. learning_rate = 0.01 # Set the number of training iterations. training_epochs = 10 # Set the batch size. batch_size = 100 # Set the number of iterations after which the loss is displayed once. display_step = 1 x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10]) # Set the model parameters. W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) # Build the model. pred = tf.nn.softmax(tf.matmul(x, W) + b) # Define the loss function: cross entropy. cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1)) # Optimize the gradient descent. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # Initialize all variables. init = tf.global_variables_initializer() config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # Perform training on the Ascend AI Processor. custom_op.parameter_map["mix_compile_mode"].b = False # Disable the mixed computing function (disabled by default). custom_op.parameter_map["iterations_per_loop"].i = 10 # Determine whether the training iteration is offloaded. Must be equal to iterations_per_loop of set_iteration_per_loop. config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Disable remapping. # Train the model. with tf.Session(config=config) as sess: sess.run(init) # Set the number of iterations per loop to 10 in sess.run mode. train_op = util.set_iteration_per_loop(sess, optimizer, 10) for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size) for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys}) avg_cost += c / total_batch
The preceding API involves graph modification. If a graph cannot be modified (for example, the graph is frozen or a session is created using tf.train.Supervisor), you cannot use the set_iteration_per_loop API to set the loops and iterations per loop. In this case, use the create_iteration_per_loop_var and load_iteration_per_loop_var APIs to set the number of iterations per loop. The following is an example.
from __future__ import print_function import input_data from npu_bridge.estimator.npu import util from tensorflow.core.protobuf.rewriter_config_pb2 import RewriterConfig mnist = input_data.read_data_sets("/test/", one_hot=True) import tensorflow as tf # Set the model. # Set the learning rate. learning_rate = 0.01 # Set the number of training iterations. training_epochs = 10 # Set the batch size. batch_size = 100 # Set the number of iterations after which the loss is displayed once. display_step = 1 x = tf.placeholder(tf.float32, [None, 784]) y = tf.placeholder(tf.float32, [None, 10]) # Set the model parameters. W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) # Build the model. pred = tf.nn.softmax(tf.matmul(x, W) + b) # Define the loss function: cross entropy. cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1)) # Optimize the gradient descent. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost) # Initialize all variables. init = tf.global_variables_initializer() config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True # Perform training on the Ascend AI Processor. custom_op.parameter_map["mix_compile_mode"].b = False # Disable the mixed computing function (disabled by default). custom_op.parameter_map["iterations_per_loop"].i = 10 # Used for functional validation. Must be equal to iterations_per_loop of set_iteration_per_loop. config.graph_options.rewrite_options.remapping = RewriterConfig.OFF # Disable remapping. # Train the model. with tf.Session(config=config) as sess: sess.run(init) # Set the number of iterations per loop to 10 in sess.run mode. iteration = util.IterationPerLoop() train_op = iteration.create_iteration_per_loop_var(optimizer) # Modify the graph. tf.train.Supervisor(logdir="/home/xxxx",init_op=init) # Freeze the graph. iteration.load_iteration_per_loop_var(sess, 10) # Set the number of iterations per loop. for epoch in range(training_epochs): avg_cost = 0 total_batch = int(mnist.train.num_examples / batch_size) for i in range(total_batch): batch_xs, batch_ys = mnist.train.next_batch(batch_size) _, c = sess.run([train_op, cost], feed_dict={x: batch_xs, y: batch_ys}) avg_cost += c / total_batch
Checking Whether iterations_per_loop Takes Effect
Search "Insert op success" in the host log file to check whether iterations_per_loop takes effect, as shown in Figure 5-1.
Figure 5-2 shows the related log information when the value of iterations_per_loop is set to 1 (the default value).