Custom Operator
Loading and Executing a Fixed Shape Operator
- For details about developing custom operators, see "Operator Development Flow" in TBE Custom Operator Development Guide.
- For details about the calling process of a developed single operator, see Built-in Operator Not Encapsulated into an ACL API.
Loading and Executing a Dynamic-Shape Operator
Before loading and executing a dynamic Shape operator, you need to develop the custom operator and generate the corresponding binary file by referring to "TBE Custom Operator Development Guide" in TBE Custom Operator Development Guide.
Basic Principles
The procedure of loading and executing a dynamic-shape operator is as follows:
- Initialize resources, including initializing the ACL, setting the loading directory of the single-operator model file and specifying the device for computation.
- Call acl.init to initialize the ACL.
- Call the ACL APIs to register the custom operator to be built.
- Call acl.op.register_compile_func to register the operator selector (that is, the function for selecting the tiling policy). Different tilling strategies are adopted for different shapes when executing the operator.
The operator selector needs to be defined and implemented by the user in advance.
- Prototype:
op_selector(in_num, in_desc, out_num, out_desc, op_attr, op_kernel_desc) """ operator selector: The function and parameter names can be customized. The number and type of parameters must match. : param in_num: number of the input tensor description : param in_desc: list of the input tensor description : param out_num: number of the output tensor description : param out_desc: list of the output tensor description : param op_attr: address object of operator attributes for setting operator attributes : param op_kernel_desc: address object of operator kernel description for setting the workspace parameters of an operator in dynamic shape scenarios. :return: """
- Function implementation
You can write code logic to select a tiling policy and generate tiling parameters, and call acl.op.set_kernel_args to set tiling arguments and number of blocks for concurrent execution.
- Prototype:
- Call acl.op.create_kernel to register the operator to the system for code implementation when executing the operator.
- Call acl.op.register_compile_func to register the operator selector (that is, the function for selecting the tiling policy). Different tilling strategies are adopted for different shapes when executing the operator.
- Call acl.rt.set_device to specify the device for computation.
- Call acl.rt.create_context to explicitly create a context, and call acl.rt.create_stream to explicitly create a stream.
The default stream is used if no stream is created explicitly. The default stream is implicitly created with the acl.rt.set_device call. To pass the default stream to any API call, pass NULL directly.
- Construct the operator description (such as the input and output tensor description and operator attributes) and allocate memory for storing the input and output data of the operator.
- Copy the operator input data from the host to the device.
- Call acl.rt.memcpy to implement synchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Call acl.rt.memcpy_async to implement asynchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Build a single operator.
Call acl.op.update_params to build the operator and trigger the calling logic of the operator selector.
- Execute the single operator.
Call acl.op.execute to load and execute the operator.
- Copy the output data of the operator from the device to the host (memory on the host needs to be allocated in advance).
- Call acl.rt.memcpy to implement synchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Call acl.rt.memcpy_async to implement asynchronous memory copy. The memory needs to be freed in a timely manner after being used.
- Destroy streams, contexts and devices in sequence.
- Call acl.rt_destroy_stream to destroy streams.
If no stream is created explicitly and the default stream is used, acl.rt.destroy_stream does not need to be called.
- Call acl.rt.destroy_context to destroy contexts.
If no context is created explicitly and the default context is used, acl.rt.destroy_context does not need to be called.
- Call acl.rt.reset_device to reset devices.
- Call acl.rt_destroy_stream to destroy streams.
- Call acl.finalize to deinitialize the ACL.
Sample Code
A sample code snippet is provided as follows. You can view the complete sample code in the ACLlib installation path/acllib/sample/acl_execute_op/acl_execute_batchnorm/src directory.
import acl import numpy as np # ...... # 1. Initialize resources. # This path is relative to the directory of the executable file. ret = acl.init() ret = acl.op.register_compile_func("add", op_select) # Build the *.o file of the operator kernel in advance, call the NumPy to load the .o file, and convert the .o file into an address object. op_data_size_0 indicates the memory size occupied by the first .o file. # If there are .o files of multiple operator kernels, this API needs to be called for multiple times. ret = acl.op.create_kernel("add", "cce_add_11_33_float16_11_33_float16__kernel0", "cce_add_11_33_float16_11_33_float16__kernel0", np_op_0_ptr, op_data_size_0, ACL_ENGINE, 0) # 2. Construct the input and output tensor descriptions of the add operator as well as the descriptions of the input and output tensor, and allocate memory for storing the input and output data of the operator. # Enter the following parameters: a = np.random.rand(2, 1).astype(np.float16) b = np.random.rand(2, 1).astype(np.float16) # ...... a_ptr = acl.util.numpy_to_ptr(a) b_ptr = acl.util.numpy_to_ptr(b) input_desc_list = [acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND), acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND)] output_desc_list = [acl.create_tensor_desc(ACL_FLOAT16, [2, 1], ACL_FORMAT_ND)] #Allocate the device memory. size_a = acl.get_tensor_desc_size(input_desc_list[0]) size_b = acl.get_tensor_desc_size(input_desc_list[1]) size_c = acl.get_tensor_desc_size(output_desc_list[0]) dev_a, ret = acl.rt.malloc(size_a , ACL_MEM_MALLOC_NORMAL_ONLY) dev_b, ret = acl.rt.malloc(size_b , ACL_MEM_MALLOC_NORMAL_ONLY) dev_c, ret = acl.rt.malloc(size_c , ACL_MEM_MALLOC_NORMAL_ONLY) # 3. Copy the operator input data from the host to the device. ret = acl.rt.memcpy(dev_a, size_a, a_ptr, size_a, ACL_MEMCPY_HOST_TO_DEVICE) ret = acl.rt.memcpy(dev_b, size_b, b_ptr, size_b, ACL_MEMCPY_HOST_TO_DEVICE) # 4. Call acl.op.update_params to build the operator. op_attr = acl.op.create_attr() ret = acl.op.update_params("add", input_desc_list, output_desc_list, op_attr ) # ...... # 5. Call acl.op.execute to load and execute the operator. in_data_list = [acl.create_data_buffer(dev_a, size_a), acl.create_data_buffer(dev_b, size_b)] out_data_list = [acl.create_data_buffer(dev_c, size_c)] acl.op.execute("add", input_desc_list, in_data_list, output_desc_list, out_data_list, op_attr , stream) # ...... # 6. Copy the output data of the operator from the device to the host (memory on the host needs to be allocated in advance). host_ptr, ret = acl.rt.malloc_host(size_c) ret = acl.rt.memcpy(host_ptr, size_c, dev_c, size_c, ACL_MEMCPY_DEVICE_TO_HOST) out_np = acl.util.ptr_to_numpy(host_ptr, (size_c,), 1) # ...... # 7. Free the resources in sequence. # 7.1 Free the input and output tensor description. # 7.2 Free the memory on the host. # 7.3 Free the memory on the device. # 7.4 Release device management resources. # ...... acl.rt.destroy_stream(stream) acl.rt.destroy_context(context) acl.rt.reset_device(deviceId) acl.finalize()