Overview
In a large-scale AI training cluster, training is usually completed in a data-parallel manner. Data parallelism means that all devices use the same model and different training samples, and gradient data calculated by each device needs to be aggregated for parameter update.
If classification is performed in gradient aggregation manner, the mainstream implementation of data parallelism includes the Parameter Server-Workers (PS-Worker) architecture and AllReduce collective communication. The Ascend platform supports both implementation types. For details, see Distributed Training Based on the AllReduce Architecture and Distributed Training Based on the PS-Worker Architecture.