Data Preprocessing Performance Improvement
Balancing the Schedule of Data Preprocessing Operators
When the data volume is large, the data preprocessing performance might not meet the compute performance requirements of the downstream layers in a graph. In this case, you can balance the data preprocessing operators executed on the host and device sides to improve the training performance.
Determine whether a data preprocessing operator is scheduled to the device side as follows: From the bottom up, schedule all those that can run on the device to the device, stop at the operator that cannot run on the device, and schedule the downstream preprocessing operators to the host.
Currently, the following data preprocessing operators can run on the device side: map, batch, and map_and_batch. Other preprocessing operators run only on the host.
The following is a schedule example.
Original TensorFlow code:
TFRecordDataset and shuffle cannot run on the device side. Therefore, only the map and batch operators are scheduled to the device side.
train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size)
Code after migration:
train_dataset=tf.contrib.data.TFRecordDataset("./train_new.tfrecords") train_dataset=train_dataset.shuffle(1000) train_dataset=train_dataset.map(parse_tf) train_dataset=train_dataset.batch(batch_size) train_dataset = train_dataset.prefetch(buffer_size=buffer_size)
Insert a prefetch operator between the map and batch operators. Since the prefetch operator cannot run on the device side, all its downstream operators are scheduled to the host.
Binding Training Process to CPU
In the multi-device scenario, to evenly schedule the host CPU cores and further improve the training performance, you can bind each training process to specified host CPU cores. The following uses the 8-device scenario as an example.
- Query the total number of host CPU cores, for example, 96.
- Calculate the number (n) of host CPU cores allocated to each training process.
n = Total CPU cores/8 = 12
- Modify the training process startup script. Before starting the training script, use taskset -c to bind the processes to the specified host CPU cores, for example:Device 0:
taskset -c 0-11 python3.7 /home/test/xxx.py /
Device 7:taskset -c 84-95 python3.7 /home/test/xxx.py /