Built-in Operator GEMM Encapsulated into an ACL API
Basic Principles
Currently, the Gemm operator related to matrix-vector multiplication and matrix-matrix multiplication has been encapsulated into an ACL API. For details, see CBLAS Interfaces (blas).
The basic procedure of executing a single operator is as follows:
- Initialize resources, including initializing the ACL, setting the loading directory of the single-operator model file, specifying the device for computation, and allocating memory on the device and host to store operator data.
- Call acl.init to initialize the ACL.
- Build the .json single-operator definition file into an offline model (.om file) that adapts to Ascend AI Processors in advance by referring to ATC Tool Instructions.
- A single-operator model file can be loaded using the following APIs:
Call acl.op.set_model_dir to set the directory for loading the model file. The single-operator model file (.om file) is stored in the directory.
Call acl.op.load to load the single-operator model data from the memory. The memory is managed by the user. Single-operator model data refers to the data that is loaded to the memory from the .om file. The .om file is built from a single operator.
- Call acl.rt.set_device to specify the device for computation.
- Call acl.rt.create_context to explicitly create a context, and call acl.rt.create_stream to explicitly create a stream.
The default stream is used if no stream is created explicitly. The default stream is implicitly created with the acl.rt.set_device call. To pass the default stream to any API call, pass NULL directly.
- Copy the input data of the operator from the host to the device.
- Call acl.rt.memcpy to implement synchronous memory copy.
- Call acl.rt.memcpy_async to implement asynchronous memory copy.
- Call the ACL API to execute the operator. This section uses the acl.blas.gemm_ex API as an example.
- Copy the output data of the operator from the device to the host.
- Call acl.rt.memcpy to implement synchronous memory copy.
- Call acl.rt.memcpy_async to implement asynchronous memory copy.
- Destroy streams, contexts and devices in sequence.
- Call acl.rt_destroy_stream to destroy streams.
If no stream is created explicitly and the default stream is used, acl.rt.destroy_stream does not need to be called.
- Call acl.rt.destroy_context to destroy contexts.
If no context is created explicitly and the default context is used, acl.rt.destroy_context does not need to be called.
- Call acl.rt.reset_device to reset devices.
- Call acl.rt_destroy_stream to destroy streams.
- Call acl.finalize to deinitialize the ACL.
Initialize Resources
After the API is called, add an exception handling branch and specify log printing of different levels (such as ERROR_LOG and INFO_LOG).
import acl # 1. Initialize the ACL. # This path is relative to the directory of the executable file. acl.init("test_data/config/acl.json") # 2. Set the directory of the single-operator model file. # This directory is relative to the directory of the executable file. For example, if the executable file is stored in the run/out directory, the directory is run/out/op_models. ret = acl.op.set_model_dir("op_models") # 3. Specify the device for computation. deviceId = 0; ret = acl.rt.set_device(deviceId) # 4. Allocate memory on the device to store the input data of the operator. # For this matrix-matrix multiplication sample, sizeA_ indicates the size of matrix A, sizeB_ the size of matrix B, and sizeC_ the size of matrix C. in_dtype, out_dtype = 1, 1 sizeA_ = m_ * k_ * acl.data_type_size(acl_dtype) sizeB_ = m_ * k_ * acl.data_type_size(acl_dtype) sizeC_ = m_ * k_ * acl.data_type_size(acl_dtype) devMatrixA_, ret = acl.rt.malloc(sizeA_, ACL_MEM_MALLOC_NORMAL_ONLY) devMatrixB_, ret = acl.rt.malloc(sizeB_, ACL_MEM_MALLOC_NORMAL_ONLY) devMatrixC_, ret = acl.rt.malloc(sizeC_, ACL_MEM_MALLOC_NORMAL_ONLY) # 5. Allocate memory on the host to store the returned result of the operator. # For this matrix-matrix multiplication sample, m indicates the number of rows of matrix A and matrix C, n indicates the number of columns of matrix B and matrix C, and k indicates the number of columns of matrix A and the number of rows of matrix B. hostMatrixA_, ret = acl.rt.malloc_host(sizeA_ ) hostMatrixB_, ret = acl.rt.malloc_host(sizeB_ ) hostMatrixC_, ret = acl.rt.malloc_host(sizeC_ ) # ......
Copying Data to Device
After the API is called, add an exception handling branch and specify log printing of different levels (such as ERROR_LOG and INFO_LOG).
import acl # ...... # For this matrix-matrix multiplication sample, copy the data of matrix A and matrix B from the host to the device. ret = acl.rt.memcpy(devMatrixA_, sizeA_, hostMatrixA_, sizeA_, ACL_MEMCPY_HOST_TO_DEVICE) ret = acl.rt.memcpy(devMatrixB_, sizeB_, hostMatrixB_, sizeB_, ACL_MEMCPY_HOST_TO_DEVICE) # ......
Executing a Single Operator and Returning Result to the Host
After the API is called, add an exception handling branch and specify log printing of different levels (such as ERROR_LOG and INFO_LOG).
import acl # ...... # Explicitly create a stream. stm, ret = acl.rt.create_stream() # In this example, acl.blas.gemm_ex (asynchronous mode) is called to implement matrix-matrix multiplication. ret = acl.blas.gemm_ex(ACL_TRANS_N, ACL_TRANS_N, ACL_TRANS_N, m_, n_, k_, devAlpha_, devMatrixA_, k_, inputType_, devMatrixB_, n_, inputType_, devBeta_, devMatrixC_, n_, outputType_, ACL_COMPUTE_HIGH_PRECISION, stream) # Call acl.rt.synchronize_stream to block the host until all tasks in the specified streams are completed. ret = acl.rt.synchronize_stream(stream); # Copy the output data of the operator from the device to the host. acl.rt.memcpy(hostMatrixC_, sizeC_, devMatrixC_, sizeC_, ACL_MEMCPY_DEVICE_TO_HOST); # TODO: Display the operator output data on the terminal screen and write the operator output data to a file. # ......
Releasing Runtime Resources and Deinitializing ACL
After the API is called, add an exception handling branch and specify log printing of different levels (such as ERROR_LOG and INFO_LOG).
import acl #...... # Destroy the explicitly created streams. acl.rt.destroy_stream(stream) # Reset devices. ret = acl.rt.reset_device(deviceId) # Deinitialize the ACL. acl.finalize() # ......