Introduction
Figure 4-1 shows the overall architecture of the Atlas data center training solution. Table 4-1 describes the components in the solution.
Layer |
Component |
Component Description |
---|---|---|
Application enablement |
MindX DL |
MindX DL is a deep learning component reference design that integrates Atlas 800 AI training servers, Atlas 800 AI inference servers, and GPU-based servers. It provides basic functions such as Ascend AI Processor resource management and monitoring, Ascend AI Processor scheduling optimization, and distributed training collection communication configuration generation, enabling partners to quickly develop deep learning systems. You can log in to the Ascend developer zone to obtain source code and reference documents. |
ModelZoo |
ModelZoo provides common network models based on the mainstream deep learning frameworks such as MindSpore, TensorFlow, and Caffe, including ResNet50 and VGG16. After obtaining the model, you can perform customized training or develop an application based on the model and Ascend AI Processors. Click ModelZoo for login. |
|
Data center platform |
Container engine plug-in (Ascend Docker) |
Ascend Docker is a basic component of the Atlas data center solution, and can mount devices and drivers of Ascend AI Processors to containers so that AI jobs can run smoothly on Ascend devices as Docker containers. |
Kubernetes device plugin |
Kubernetes provides a device plugin framework mechanism for third-party devices. The Ascend AI Processor uses the Kubernetes device plugin to discover and report resources to kubelet. In addition, the Ascend device can be used in a K8S POD and container to periodically check the health status of the device. |
|
System tool |
CANN-based tools that are easy to deploy and maintain |
|
CANN |
AscendCL, GE, operator/acceleration library, HCCL, Runtime, and Driver |
|
Auxiliary development tool (CLI) |
Auxiliary tools for model or service development, including the precision comparison tool, operator development tool, and profiling performance debugging tool |
|
Full-pipeline development toolchain |
MindStudio |
MindStudio provides one-stop efficient development and simplified deployment capabilities for operators, models, and applications. For details, see the MindStudio Documentation. |
O&M management tool |
SmartKit |
Users can use SmartKit to deploy, maintain, and upgrade training hardware devices in batches, simplifying operations and improving work efficiency. For details about SmartKit, see the FusionServer Tools 2.0 SmartKit User Guide. |
FusionDirector |
FusionDirector is the management software for intelligent O&M throughout the server lifecycle. It provides intelligent version management, intelligent deployment, intelligent asset management, intelligent energy efficiency management, and intelligent fault management. The visualized management GUI provides ultimate O&M experience for customers. For details about FusionDirector, see related documents. |
|
Hardware |
Atlas 300T AI training card |
For details, see Table 2-1. |
Atlas 800 AI training server |
||
Atlas 900 AI cluster |