Atlas 300T training card (model: 9000)
Currently, training using the training cards allows two scenarios: single-server single-device training and multi-server multi-device distributed training. A single training card is equipped with one piece of Ascend AI Processor.
Typical Networking
For distributed training with multiple servers, the 100G ETH ports provided by each training card are used for communication between servers, and the Ring + Halving-doubling algorithm is used to implement collective communication.
Restrictions
- The number of training cards in different servers must be the same.
- In the entire network, the IP addresses of the NICs of each training card are in the same network segment.
- Currently, only AllReduce, AllGather, Broadcast, and ReduceScatter are supported.
- Before training, use the HCCL_INTRA_PCIE_ENABLE and HCCL_INTRA_ROCE_ENABLE environment variables to set the communication mode between devices. The PCIe path is used by default. The RoCE path is recommended. For details, see Configuring Environment Variables.
- The number of Ascend AI Processors participating in training specified in the ranktable configuration file cannot be greater than the number of available Ascend AI Processors on the server. Template 1 must be used. For details, see Configuring Processor Resources.