Permute算子(TIK方式)
算子分析
使用TIK API开发Permute算子前,我们需要确定算子功能、输入、输出,算子开发方式、算子类型以及算子实现函数名称等。
- 明确算子的功能。
Permute算子,用于置换索引轴顺序
- 明确输入和输出。
- Permute算子有1个输入x,1个输出y,1个属性order(转换后的轴顺序, 样例仅支持[0, 2, 3, 1],即NCHW->NHWC)。
- 算子输入的数据类型为float16,算子输出的数据类型为float16。
- 算子输入支持所有shape,输出shape与输入shape相同。
- 算子输入支持的format:NCHW。
- 确定算子开发方式及使用的计算接口。
由于Permute算子涉及对tensor的不同维度上的不同元素同时操作,TBE DSL接口都无法满足此算子的计算要求,所以考虑使用TIK方式进行此算子的实现。
该算子实现核心的计算流程如下:- 将数据读入到Unified Buffer中。
- 使用vec_trans_scatter()接口实现从NCHW到NHWC的转换。
- 把数据从Unified Buffer搬运到Global Memory中。
- 明确算子实现文件名称、算子实现函数名称以及算子的类型(OpType)。
- 算子类型需要采用大驼峰的命名方式,即采用大写字符区分不同的语义。
- 算子文件名称和算子函数名称,可选用以下任意一种命名规则:
- 用户自定义,此时需要在算子信息定义中配置opFile.value与opInterface.value。
- 不配置算子信息定义中的opFile.value与opInterface.value,FE会将OpType按照如下方式进行转换后进行算子文件名和算子函数名的匹配。转换规则如下:
- 首字符的大写字符转换为小写字符。
例如:Abc -> abc
- 小写字符后的大写字符转换为下划线+小写字符。
例如:AbcDef -> abc_def
- 紧跟数字以及大写字符后的大写字符,作为同一语义字符串,查找此字符串后的第一个小写字符,并将此小写字符的前一个大写字符转换为下划线+小写字符,其余大写字符转换为小写字符。若此字符串后不存在小写字符,则直接将此字符串中的大写字符转换为小写字符。
例如:ABCDef -> abc_def;Abc2DEf -> abc2d_ef;Abc2DEF -> abc2def;ABC2dEF -> abc2d_ef。
- 首字符的大写字符转换为小写字符。
本例中,算子类型定义为PermuteTik;算子的实现文件名称及实现函数名称定义为permute_tik。
通过以上分析,得到PermuteTik算子的设计规格如下:
表14-7 PermuteTik算子设计规格算子类型(OpType)
PermuteTik
算子输入
name:x
shape:all
data type:float16
format:NCHW
-
name:order
-
data type:listInt
-
value:[0, 2, 3, 1]
算子输出
name:y
shape:all
data type:float16
format:NCHW
-
算子实现使用主要TIK接口
data_move()
vec_trans_scatter()
算子实现文件/实现函数名称
permute_tik
算子实现
本章节介绍算子实现中的关键功能点。
算子代码实现
- 样例中PermuteTik算子接收的数据类型为"float16",首先需要对算子类型进行校验,然后对参数进行设置,并调用算子计算函数。
def permute_tik(x, y, order=(0), kernel_name="permute_tik"): """ only support nchw->nhwc Parameters ---------- x : dict shape and dtype of input y : dict shape and dtype of output, should be same shape and type as input order: tuple, list axis transformation order kernel_name : str kernel name, default value is "permute_tik" Returns ------- None """ shape = x.get("shape") dtype = y.get("dtype") input_dtype = dtype.lower() supported_dtype = ["float16"] input_format = x.get("format") check_pass = False if input_format == 'NCHW': if len(order) == 4 and order[0] == 0 \ and order[1] == 2 and order[2] == 3 and order[3] == 1: check_pass = True if not check_pass: raise RuntimeError("only support nchw->nhwc") util.check_dtype_rule(input_dtype, supported_dtype) util.check_dtype_rule(dtype, supported_dtype) util.check_shape_rule(shape) util.check_tensor_shape_size(shape) util.check_kernel_name(kernel_name) input_dict = { "x": x, "y": y, "order": order } permute_process = Permute(input_dict) permute_process.permute_compute() permute_process.instance.BuildCCE(kernel_name=kernel_name, inputs=permute_process.x_gm, outputs=permute_process.y_gm) return permute_process.instance
- 算子计算函数的实现逻辑如下所示。
- 定义Permute类,并在初始化函数中初始化后续计算用到的参数。核心计算主要是计算每个输入的shape的大小,申请Global Memory大小。通过tbe_platform.cce_conf.get_soc_spec(tbe_platform.cce_conf.UB_SIZE)接口获取到UB的实际物理空间。后续的步骤中,我们还会使用这些数据来计算data_move、vec_trans_scatter等接口的参数。设置独立的tiling模块,将其与算子计算逻辑分离可以很好的做到算子的shape泛化。对于不同的shape,我们可以在不改变计算逻辑的情况下,只改变tiling参数来优化搬运和计算的次数,来做到泛化和高性能。
class Permute: """ Function: store permute parameters and compute permute """ def __init__(self, input_dict): """ init the permute parameters """ self.instance = tik.Tik(tik.Dprofile()) self.dtype = input_dict.get("x").get("dtype").lower() self.dsize = 2 size = get_shape_size(input_dict.get("x").get("shape")) self.x_gm = self.instance.Tensor(self.dtype, (size,), name="x_gm", scope=tik.scope_gm) self.y_gm = self.instance.Tensor(self.dtype, (size,), name="y_gm", scope=tik.scope_gm) ub_size = (UB_SIZE_B - 1024) // 4 // self.dsize // 256 * 256 self.ub_size = ub_size self.input_dict = input_dict def get_shape_info(self): """ determine whether to convert the shape based on the input shape """ shape = self.input_dict.get("x").get("shape") if shape[1] == 1 or shape[2] * shape[3] == 1: shape_size = get_shape_size(shape) shape_new = [shape_size] order_new = [0] shape_out_new = [shape_size] else: n_i = shape[0] col_len = shape[1] row_len = shape[2] * shape[3] shape_new = [n_i, col_len, row_len] order_new = [0, 2, 1] shape_out_new = [] for i in order_new: shape_out_new.append(shape_new[i]) return shape_new, order_new, shape_out_new def move_without_transform(self, shape): """ when C = 1 or H*W = 1, directly move data in and out """ ub_size = (UB_SIZE_B - 1024) // 2 // self.dsize // 16 * 16 if shape[0] <= 16: block_num = 1 else: all_block_num = shape[0] // 16 block_num = AICORE_NUM if all_block_num < AICORE_NUM: block_num = all_block_num each_len = shape[0] // block_num each_mod = shape[0] % block_num thread_num = 1 if each_len // ub_size > 1: thread_num = 2 with self.instance.for_range(0, block_num, block_num=block_num) \ as block_id: each_size = self.instance.Scalar("int32") each_size.set_as(each_len) with self.instance.if_scope(block_id == block_num - 1): each_size.set_as(each_len + each_mod) ub_loop = each_size // ub_size ub_mod = each_size % ub_size with self.instance.for_range(0, ub_loop, thread_num=thread_num) as loop_id: src_ub = self.instance.Tensor(self.dtype, (ub_size,), name="src_ub", scope=tik.scope_ubuf) burst_len = ub_size // 16 self.instance.data_move( src_ub, self.x_gm[each_len * block_id + loop_id * ub_size], 0, 1, burst_len, 0, 0) self.instance.data_move( self.y_gm[each_len * block_id + loop_id * ub_size], src_ub, 0, 1, burst_len, 0, 0) with self.instance.if_scope(ub_mod > 0): src_ub = self.instance.Tensor(self.dtype, (ub_size,), name="src_ub", scope=tik.scope_ubuf) with self.instance.if_scope( tik.all(block_num > 1, ub_mod % 16 != 0)): src_ub_1 = self.instance.Tensor(self.dtype, (16,), name="src_ub_1", scope=tik.scope_ubuf) index = each_len * block_id + ub_loop * ub_size with self.instance.if_scope(ub_mod >= 16): burst_len = ub_mod // 16 self.instance.data_move(src_ub, self.x_gm[index], 0, 1, burst_len, 0, 0) self.instance.data_move(self.y_gm[index], src_ub, 0, 1, burst_len, 0, 0) offset = index + burst_len * 16 - 16 + ub_mod % 16 self.instance.data_move(src_ub_1, self.x_gm[offset], 0, 1, 1, 0, 0) self.instance.data_move(self.y_gm[offset], src_ub_1, 0, 1, 1, 0, 0) with self.instance.else_scope(): offset = index - 16 + ub_mod % 16 self.instance.data_move(src_ub_1, self.x_gm[offset], 0, 1, 1, 0, 0) self.instance.data_move(self.y_gm[offset], src_ub_1, 0, 1, 1, 0, 0) with self.instance.else_scope(): burst_len = (ub_mod + 15) // 16 self.instance.data_move( src_ub, self.x_gm[ each_len * block_id + ub_loop * ub_size], 0, 1, burst_len, 0, 0) self.instance.data_move( self.y_gm[ each_len * block_id + ub_loop * ub_size], src_ub, 0, 1, burst_len, 0, 0) def trans_scatter(self, col_len_ub, row_len_ub, src_ub, dst_ub): """ transposes the data """ c_zu = col_len_ub // 16 r_zu = row_len_ub // 16 with self.instance.for_range(0, r_zu) as num_r: repeat = c_zu src_stride = 0 dst_stride = 0 if repeat != 1: src_stride = 16 * r_zu dst_stride = 1 dst_list = [ dst_ub[16 * col_len_ub * num_r + 16 * c_zu * i] for i in range(16)] src_list = [src_ub[16 * num_r + 16 * r_zu * j] for j in range(16)] self.instance.vec_trans_scatter(False, False, dst_list, src_list, repeat, dst_stride, src_stride) def move_gm_to_ub(self, row_len, col_len_ub, row_len_ub, src_ub, index): """ move data from gm to ub """ stride = (row_len - row_len_ub) // 16 row_len_ub_align = (row_len_ub + 15) // 16 * 16 if row_len % 16 == 0 and stride < 65535: n_burst = col_len_ub burst_len = row_len_ub_align // 16 self.instance.data_move(src_ub, self.x_gm[index], 0, n_burst, burst_len, stride, 0) else: with self.instance.for_range(0, col_len_ub) as c_i: burst_len = row_len_ub_align // 16 self.instance.data_move( src_ub[c_i * row_len_ub_align], self.x_gm[index + c_i * row_len], 0, 1, burst_len, 0, 0) def move_ub_to_gm(self, col_len, col_len_ub, row_len_ub, index, dst_ub): """ move data from ub to gm when c >= 16 """ stride = (col_len - col_len_ub) // 16 if col_len % 16 == 0 and stride < 65535: n_burst = row_len_ub burst_len = col_len_ub // 16 self.instance.data_move(self.y_gm[index], dst_ub, 0, n_burst, burst_len, 0, stride) else: with self.instance.for_range(0, row_len_ub) as r_i: burst_len = col_len_ub // 16 self.instance.data_move( self.y_gm[index + r_i * col_len], dst_ub[r_i * col_len_ub], 0, 1, burst_len, 0, 0) def move_ub_to_gm_with_tail(self, input_dict): """ move data from ub to gm when c < 16 """ shape = input_dict.get("shape") dst_ub = input_dict.get("dst_ub") ub_tail = input_dict.get("ub_tail") tail_offset = input_dict.get("tail_offset") tail_num = input_dict.get("tail_num") block_num = input_dict.get("block_num") row_index = input_dict.get("row_index") out_index = input_dict.get("out_index") tail_start = input_dict.get("tail_start") total_loop = input_dict.get("total_loop") r_i = input_dict.get("r_i") num = input_dict.get("num") _, col_len, row_len = shape col_len_align = (col_len + 15) // 16 * 16 with self.instance.if_scope( tik.all(row_index >= num, block_num > 1)): scalar = self.instance.Scalar(ub_tail.dtype) with self.instance.for_range(0, col_len) as time: scalar.set_as(dst_ub[r_i * col_len_align + time]) ub_tail[tail_offset + time].set_as(scalar) tail_offset.set_as(tail_offset + col_len) with self.instance.if_scope( row_index == total_loop * row_len - 1): each_burst_num = 32 // self.dsize n_burst = self.instance.Scalar("int32") n_burst.set_as((tail_num * self.dsize) // 32) mod = self.instance.Scalar("int32") mod.set_as((tail_num * self.dsize) % 32) # 32b alignment with self.instance.if_scope(mod == 0): self.instance.data_move(self.y_gm[tail_start], ub_tail, 0, 1, n_burst, 0, 0) # bigger than 32b with self.instance.else_scope(): self.instance.data_move(self.y_gm[tail_start], ub_tail, 0, 1, n_burst, 0, 0) offset = tail_num - each_burst_num scalar = self.instance.Scalar(ub_tail.dtype) with self.instance.for_range(0, each_burst_num) as time: scalar.set_as(ub_tail[offset + time]) ub_tail[time].set_as(scalar) self.instance.data_move(self.y_gm[tail_start + offset], ub_tail, 0, 1, 1, 0, 0) with self.instance.else_scope(): burst_len = col_len_align // 16 self.instance.data_move( self.y_gm[out_index], dst_ub[r_i * col_len_align], 0, 1, burst_len, 0, 0) def compute_c_lt_16(self, input_dict): """ processing the scenario where c is less than 16 """ n_id = input_dict.get("n_id") total_loop = input_dict.get("each_loop") tail_offset = input_dict.get("tail_offset") ub_tail = input_dict.get("ub_tail") shape = input_dict.get("shape") x_index = input_dict.get("x_index") block_num = input_dict.get("block_num") _, col_len, row_len = shape col_len_align = (col_len + 15) // 16 * 16 row_len_ub = self.ub_size // col_len_align // 16 * 16 row_loop = row_len // row_len_ub row_mod = row_len % row_len_ub last_num = (16 + col_len - 1) // col_len num = total_loop * row_len - last_num src_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="src_ub", scope=tik.scope_ubuf) dst_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="dst_ub", scope=tik.scope_ubuf) if row_loop > 0: with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + n_id * col_len * row_len + \ row_len_ub * r_loop self.move_gm_to_ub(row_len, col_len, row_len_ub, src_ub, in_index) self.trans_scatter(col_len_align, row_len_ub, src_ub, dst_ub) with self.instance.for_range(0, row_len_ub) as r_i: row_index = n_id * row_len + row_len_ub * r_loop + r_i out_index = x_index + n_id * col_len * row_len + \ col_len * row_len_ub * r_loop + r_i * col_len tail_start = x_index + total_loop * row_len * col_len - \ last_num * col_len input_dict = { "shape": shape, "dst_ub": dst_ub, "ub_tail": ub_tail, "tail_offset": tail_offset, "tail_num": col_len * last_num, "block_num": block_num, "row_index": row_index, "out_index": out_index, "tail_start": tail_start, "total_loop": total_loop, "r_i": r_i, "num": num, } self.move_ub_to_gm_with_tail(input_dict) if row_mod > 0: in_index = x_index + n_id * col_len * row_len + \ row_len_ub * row_loop self.move_gm_to_ub(row_len, col_len, row_mod, src_ub, in_index) row_mod_align = (row_mod + 15) // 16 * 16 self.trans_scatter(col_len_align, row_mod_align, src_ub, dst_ub) with self.instance.for_range(0, row_mod) as r_i: row_index = n_id * row_len + row_len_ub * row_loop + r_i out_index = x_index + n_id * col_len * row_len + \ col_len * row_len_ub * row_loop + r_i * col_len tail_start = x_index + total_loop * row_len * col_len - \ last_num * col_len input_dict = { "shape": shape, "dst_ub": dst_ub, "ub_tail": ub_tail, "tail_offset": tail_offset, "tail_num": col_len * last_num, "block_num": block_num, "row_index": row_index, "out_index": out_index, "tail_start": tail_start, "total_loop": total_loop, "r_i": r_i, "num": num, } self.move_ub_to_gm_with_tail(input_dict) def compute_c_ge_16(self, shape, x_index): """ processing the scenario where the value of c is greater than or equal to 16 """ _, col_len, row_len = shape ub_div_16 = self.ub_size // 16 col_div_16 = col_len // 16 * 16 col_len_ub = ub_div_16 if ub_div_16 < col_div_16 else col_div_16 ub_div_col = self.ub_size // col_len_ub // 16 * 16 row_len_ub = ub_div_col if ub_div_col < row_len else row_len row_len_ub_align = (row_len_ub + 15) // 16 * 16 col_loop = col_len // col_len_ub col_mod = col_len % col_len_ub row_loop = row_len // row_len_ub row_mod = row_len % row_len_ub src_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="src_ub", scope=tik.scope_ubuf) dst_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="dst_ub", scope=tik.scope_ubuf) if col_loop > 0: with self.instance.for_range(0, col_loop) as c_loop: with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + c_loop * col_len_ub * row_len + \ row_len_ub * r_loop self.move_gm_to_ub(row_len, col_len_ub, row_len_ub, src_ub, in_index) self.trans_scatter(col_len_ub, row_len_ub_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * r_loop + \ c_loop * col_len_ub self.move_ub_to_gm(col_len, col_len_ub, row_len_ub, out_index, dst_ub) if row_mod > 0: in_index = x_index + c_loop * col_len_ub * row_len + \ row_len_ub * row_loop row_mod_align = (row_mod + 15) // 16 * 16 self.move_gm_to_ub( row_len, col_len_ub, row_mod, src_ub, in_index) self.trans_scatter(col_len_ub, row_mod_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * row_loop + \ c_loop * col_len_ub self.move_ub_to_gm(col_len, col_len_ub, row_mod, out_index, dst_ub) if col_mod > 0: col_mod_align = (col_mod + 15) // 16 * 16 offset = col_mod_align - col_mod with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + (col_loop * col_len_ub - offset) * \ row_len + row_len_ub * r_loop self.move_gm_to_ub( row_len, col_mod_align, row_len_ub, src_ub, in_index) self.trans_scatter(col_mod_align, row_len_ub_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * r_loop + \ col_loop * col_len_ub - offset self.move_ub_to_gm(col_len, col_mod_align, row_len_ub, out_index, dst_ub) if row_mod > 0: in_index = x_index + (col_loop * col_len_ub - offset) * \ row_len + row_len_ub * row_loop self.move_gm_to_ub(row_len, col_mod_align, row_mod, src_ub, in_index) self.trans_scatter(col_mod_align, row_mod_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * row_loop + \ col_loop * col_len_ub - offset self.move_ub_to_gm(col_len, col_mod_align, row_mod, out_index, dst_ub) def permute_compute(self): """ compute permute """ shape, order, _ = self.get_shape_info() if order != [0, 2, 1]: self.move_without_transform(shape) else: _, col_len, row_len = shape block_num, inner_loop, tail, thread_num = \ get_block_num_and_loop_cycle(shape) element_num = col_len * row_len with self.instance.for_range(0, block_num, block_num=block_num) \ as block_id: each_loop = self.instance.Scalar("int32") each_loop.set_as(inner_loop) offset = self.instance.Scalar("int32") if tail > 0: with self.instance.if_scope(block_id < tail): each_loop.set_as(each_loop + 1) offset.set_as(block_id * each_loop) if tail > 0: with self.instance.if_scope(block_id >= tail): offset.set_as(block_id * each_loop + tail) x_index = self.instance.Scalar("int32") x_index.set_as(offset * element_num) ub_tail = self.instance.Tensor(self.dtype, (256,), name="ub_tail", scope=tik.scope_ubuf) tail_offset = self.instance.Scalar("int32") tail_offset.set_as(0) with self.instance.for_range(0, each_loop, thread_num=thread_num) as n_id: if col_len >= 16: index = self.instance.Scalar("int32") index.set_as(x_index + n_id * col_len * row_len) self.compute_c_ge_16(shape, index) else: input_dict = { "n_id": n_id, "each_loop": each_loop, "tail_offset": tail_offset, "ub_tail": ub_tail, "shape": shape, "x_index": x_index, "block_num": block_num } self.compute_c_lt_16(input_dict)
- 根据shape信息,分三种场景去处理。
- 场景1:C=1 或者 H*W=1的场景,不需要转换。
- 场景2:除了场景1之外,其他场景下,C >= 16的场景。
- 场景3: 除了场景1之外,其他场景下,C < 16的场景。
def permute_compute(self): """ compute permute """ shape, order, _ = self.get_shape_info() if order != [0, 2, 1]: self.move_without_transform(shape) # 场景1: C=1 或者 H*W=1的场景,不需要转换 else: _, col_len, row_len = shape block_num, inner_loop, tail, thread_num = \ get_block_num_and_loop_cycle(shape) element_num = col_len * row_len with self.instance.for_range(0, block_num, block_num=block_num) \ as block_id: each_loop = self.instance.Scalar("int32") each_loop.set_as(inner_loop) offset = self.instance.Scalar("int32") if tail > 0: with self.instance.if_scope(block_id < tail): each_loop.set_as(each_loop + 1) offset.set_as(block_id * each_loop) if tail > 0: with self.instance.if_scope(block_id >= tail): offset.set_as(block_id * each_loop + tail) x_index = self.instance.Scalar("int32") x_index.set_as(offset * element_num) ub_tail = self.instance.Tensor(self.dtype, (256,), name="ub_tail", scope=tik.scope_ubuf) tail_offset = self.instance.Scalar("int32") tail_offset.set_as(0) with self.instance.for_range(0, each_loop, thread_num=thread_num) as n_id: if col_len >= 16: # 场景2:除了场景1之外,其他场景下,C >= 16的场景 index = self.instance.Scalar("int32") index.set_as(x_index + n_id * col_len * row_len) self.compute_c_ge_16(shape, index) else: # 场景3: 除了场景1之外,其他场景下,C < 16的场景 input_dict = { "n_id": n_id, "each_loop": each_loop, "tail_offset": tail_offset, "ub_tail": ub_tail, "shape": shape, "x_index": x_index, "block_num": block_num } self.compute_c_lt_16(input_dict)
- C=1 或者 H*W=1的场景,不需要转换。此场景直接根据输入数据的大小进行tiling计算,如果要使用多核,通过for_range循环开启double buffer和多核,对输入数据进行切分,实现高效运算。并且定义UB tensor的操作必须定义在多核循环内,防止编译时出现冲突。
def move_without_transform(self, shape): """ when C = 1 or H*W = 1, directly move data in and out """ ub_size = (UB_SIZE_B - 1024) // 2 // self.dsize // 16 * 16 # 根据tiling计算结果判断能否开多核,如果需要开多核,需要指定多核循环 if shape[0] <= 16: # 总大小<16时,单核处理 block_num = 1 else: # 总大小>=16时,多核处理 all_block_num = shape[0] // 16 block_num = AICORE_NUM if all_block_num < AICORE_NUM: block_num = all_block_num each_len = shape[0] // block_num each_mod = shape[0] % block_num thread_num = 1 if each_len // ub_size > 1: thread_num = 2 #通过for_range循环开启double buffer和多核,对输入数据进行切分,实现高效运算 with self.instance.for_range(0, block_num, block_num=block_num) \ as block_id: each_size = self.instance.Scalar("int32") each_size.set_as(each_len) with self.instance.if_scope(block_id == block_num - 1): each_size.set_as(each_len + each_mod) ub_loop = each_size // ub_size ub_mod = each_size % ub_size with self.instance.for_range(0, ub_loop, thread_num=thread_num) as loop_id: # 定义UB tensor的操作必须定义在多核循环内,防止编译时出现冲突 src_ub = self.instance.Tensor(self.dtype, (ub_size,), name="src_ub", scope=tik.scope_ubuf) burst_len = ub_size // 16 self.instance.data_move( src_ub, self.x_gm[each_len * block_id + loop_id * ub_size], 0, 1, burst_len, 0, 0) self.instance.data_move( self.y_gm[each_len * block_id + loop_id * ub_size], src_ub, 0, 1, burst_len, 0, 0) with self.instance.if_scope(ub_mod > 0): src_ub = self.instance.Tensor(self.dtype, (ub_size,), name="src_ub", scope=tik.scope_ubuf) with self.instance.if_scope( tik.all(block_num > 1, ub_mod % 16 != 0)): src_ub_1 = self.instance.Tensor(self.dtype, (16,), name="src_ub_1", scope=tik.scope_ubuf) index = each_len * block_id + ub_loop * ub_size with self.instance.if_scope(ub_mod >= 16): burst_len = ub_mod // 16 self.instance.data_move(src_ub, self.x_gm[index], 0, 1, burst_len, 0, 0) self.instance.data_move(self.y_gm[index], src_ub, 0, 1, burst_len, 0, 0) # 尾块处理方式为,向前取数据,补到16对齐,然后搬出 offset = index + burst_len * 16 - 16 + ub_mod % 16 self.instance.data_move(src_ub_1, self.x_gm[offset], 0, 1, 1, 0, 0) self.instance.data_move(self.y_gm[offset], src_ub_1, 0, 1, 1, 0, 0) with self.instance.else_scope(): offset = index - 16 + ub_mod % 16 self.instance.data_move(src_ub_1, self.x_gm[offset], 0, 1, 1, 0, 0) self.instance.data_move(self.y_gm[offset], src_ub_1, 0, 1, 1, 0, 0) with self.instance.else_scope(): burst_len = (ub_mod + 15) // 16 self.instance.data_move( src_ub, self.x_gm[ each_len * block_id + ub_loop * ub_size], 0, 1, burst_len, 0, 0) self.instance.data_move( self.y_gm[ each_len * block_id + ub_loop * ub_size], src_ub, 0, 1, burst_len, 0, 0)
- 其他两种场景。
- 场景2:除了场景1之外,其他场景下,C >= 16的场景。
- 场景3: 除了场景1之外,其他场景下,C < 16的场景。
上述两种场景tiling策略一致,对N轴进行多核切分,转换过程中,尾块处理逻辑不同。
(1)get_block_num_and_loop_cycle()主要是确定是否开启多核和double buffer,确定每个核内处理的循环个数。
def get_block_num_and_loop_cycle(shape): """ get block dim and loop cycle Parameters ---------- shape: input shape Returns ------- block_num: the number of cores inner_loop: the number of cycles per core inner_loop_mod: the number of remaining cycles thread_num: whether to enable double buffer 1:false 2:true """ batch, col_len, row_len = shape size = batch * col_len * row_len block_num = AICORE_NUM inner_loop = 1 inner_loop_mod = 0 thread_num = 1 if size <= 16: # 总大小小与16时,单核处理 block_num = 1 return block_num, inner_loop, inner_loop_mod, thread_num all_block_num = shape[0] if col_len * row_len >= 16: # 当C*H*W>=16时,直接对N轴进行多核切分 if all_block_num < AICORE_NUM: block_num = all_block_num else: # 当C*H*W<16时,需要先切N轴,使num*C*H*W>=16,然后对剩余大小进行多核切分 chw = col_len * row_len num = (16 + chw) // chw if batch // num < AICORE_NUM: block_num = batch // num inner_loop = all_block_num // block_num inner_loop_mod = all_block_num % block_num if inner_loop > 1: thread_num = 2 return block_num, inner_loop, inner_loop_mod, thread_num
(2)compute_c_ge_16()计算C >= 16场景,把HW当做一个轴处理,优先切C轴,确保放下足够大的C轴数据。
def compute_c_ge_16(self, shape, x_index): """ processing the scenario where the value of c is greater than or equal to 16 """ _, col_len, row_len = shape ub_div_16 = self.ub_size // 16 col_div_16 = col_len // 16 * 16 col_len_ub = ub_div_16 if ub_div_16 < col_div_16 else col_div_16 ub_div_col = self.ub_size // col_len_ub // 16 * 16 row_len_ub = ub_div_col if ub_div_col < row_len else row_len row_len_ub_align = (row_len_ub + 15) // 16 * 16 col_loop = col_len // col_len_ub col_mod = col_len % col_len_ub row_loop = row_len // row_len_ub row_mod = row_len % row_len_ub src_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="src_ub", scope=tik.scope_ubuf) dst_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="dst_ub", scope=tik.scope_ubuf) if col_loop > 0: with self.instance.for_range(0, col_loop) as c_loop: with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + c_loop * col_len_ub * row_len + \ row_len_ub * r_loop # 把数据从GM搬移到UB self.move_gm_to_ub(row_len, col_len_ub, row_len_ub, src_ub, in_index) # CHW->HWC,可以看成CH*W->H*WC,HW当成一个维度处理 self.trans_scatter(col_len_ub, row_len_ub_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * r_loop + \ c_loop * col_len_ub # 把数据从UB搬移到GM self.move_ub_to_gm(col_len, col_len_ub, row_len_ub, out_index, dst_ub) if row_mod > 0: in_index = x_index + c_loop * col_len_ub * row_len + \ row_len_ub * row_loop row_mod_align = (row_mod + 15) // 16 * 16 self.move_gm_to_ub( row_len, col_len_ub, row_mod, src_ub, in_index) self.trans_scatter(col_len_ub, row_mod_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * row_loop + \ c_loop * col_len_ub self.move_ub_to_gm(col_len, col_len_ub, row_mod, out_index, dst_ub) if col_mod > 0: col_mod_align = (col_mod + 15) // 16 * 16 offset = col_mod_align - col_mod # C轴尾块处理,向上取几行,凑够16行去处理 with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + (col_loop * col_len_ub - offset) * \ row_len + row_len_ub * r_loop self.move_gm_to_ub( row_len, col_mod_align, row_len_ub, src_ub, in_index) self.trans_scatter(col_mod_align, row_len_ub_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * r_loop + \ col_loop * col_len_ub - offset self.move_ub_to_gm(col_len, col_mod_align, row_len_ub, out_index, dst_ub) if row_mod > 0: # H*W轴尾块处理,向后多取一些数据,凑够16的倍数,转换完成后,只输出实际数据的大小 in_index = x_index + (col_loop * col_len_ub - offset) * \ row_len + row_len_ub * row_loop self.move_gm_to_ub(row_len, col_mod_align, row_mod, src_ub, in_index) self.trans_scatter(col_mod_align, row_mod_align, src_ub, dst_ub) out_index = x_index + col_len * row_len_ub * row_loop + \ col_loop * col_len_ub - offset self.move_ub_to_gm(col_len, col_mod_align, row_mod, out_index, dst_ub)
(3)compute_c_lt_16()计算C < 16场景,把HW当做一个轴处理,当H*W放不下时,需要切H*W数据。
def compute_c_lt_16(self, input_dict): """ processing the scenario where c is less than 16 """ n_id = input_dict.get("n_id") total_loop = input_dict.get("each_loop") tail_offset = input_dict.get("tail_offset") ub_tail = input_dict.get("ub_tail") shape = input_dict.get("shape") x_index = input_dict.get("x_index") block_num = input_dict.get("block_num") _, col_len, row_len = shape col_len_align = (col_len + 15) // 16 * 16 row_len_ub = self.ub_size // col_len_align // 16 * 16 row_loop = row_len // row_len_ub row_mod = row_len % row_len_ub last_num = (16 + col_len - 1) // col_len num = total_loop * row_len - last_num src_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="src_ub", scope=tik.scope_ubuf) dst_ub = self.instance.Tensor(self.dtype, (self.ub_size,), name="dst_ub", scope=tik.scope_ubuf) if row_loop > 0: with self.instance.for_range(0, row_loop) as r_loop: in_index = x_index + n_id * col_len * row_len + \ row_len_ub * r_loop # 把数据从GM搬移到UB self.move_gm_to_ub(row_len, col_len, row_len_ub, src_ub, in_index) # CHW->HWC,可以看成CH*W->H*WC,HW当成一个维度处理 self.trans_scatter(col_len_align, row_len_ub, src_ub, dst_ub) # 由于C<16,转换过程中都有对C轴进行数据补齐,数据搬出时需要把补齐的数据去掉, # 通过for_range方式一行一行处理,多核场景时,需要对最后>=32B数据特殊处理,防止多核踩踏。 with self.instance.for_range(0, row_len_ub) as r_i: row_index = n_id * row_len + row_len_ub * r_loop + r_i out_index = x_index + n_id * col_len * row_len + \ col_len * row_len_ub * r_loop + r_i * col_len tail_start = x_index + total_loop * row_len * col_len - \ last_num * col_len input_dict = { "shape": shape, "dst_ub": dst_ub, "ub_tail": ub_tail, "tail_offset": tail_offset, "tail_num": col_len * last_num, "block_num": block_num, "row_index": row_index, "out_index": out_index, "tail_start": tail_start, "total_loop": total_loop, "r_i": r_i, "num": num, } self.move_ub_to_gm_with_tail(input_dict) if row_mod > 0: in_index = x_index + n_id * col_len * row_len + \ row_len_ub * row_loop self.move_gm_to_ub(row_len, col_len, row_mod, src_ub, in_index) row_mod_align = (row_mod + 15) // 16 * 16 self.trans_scatter(col_len_align, row_mod_align, src_ub, dst_ub) with self.instance.for_range(0, row_mod) as r_i: row_index = n_id * row_len + row_len_ub * row_loop + r_i out_index = x_index + n_id * col_len * row_len + \ col_len * row_len_ub * row_loop + r_i * col_len tail_start = x_index + total_loop * row_len * col_len - \ last_num * col_len input_dict = { "shape": shape, "dst_ub": dst_ub, "ub_tail": ub_tail, "tail_offset": tail_offset, "tail_num": col_len * last_num, "block_num": block_num, "row_index": row_index, "out_index": out_index, "tail_start": tail_start, "total_loop": total_loop, "r_i": r_i, "num": num, } self.move_ub_to_gm_with_tail(input_dict)
- 定义Permute类,并在初始化函数中初始化后续计算用到的参数。核心计算主要是计算每个输入的shape的大小,申请Global Memory大小。通过tbe_platform.cce_conf.get_soc_spec(tbe_platform.cce_conf.UB_SIZE)接口获取到UB的实际物理空间。后续的步骤中,我们还会使用这些数据来计算data_move、vec_trans_scatter等接口的参数。设置独立的tiling模块,将其与算子计算逻辑分离可以很好的做到算子的shape泛化。对于不同的shape,我们可以在不改变计算逻辑的情况下,只改变tiling参数来优化搬运和计算的次数,来做到泛化和高性能。
- 调用BuildCCE()进行编译。
permute_process.instance.BuildCCE(kernel_name=kernel_name, inputs=permute_process.x_gm, outputs=permute_process.y_gm)
算子适配插件实现
开发者需要自定义实现ParseParamsPermute函数,实现原始Caffe中Permute算子到适配昇腾AI处理器的PermuteTik算子的属性映射。
ParseParamsPermute函数的实现如下所示:
Status ParseParamsPermute(const ge::Operator& op_src, ge::Operator& op_dest) { vector<int64_t> orders; if (ge::GRAPH_SUCCESS == op_src.GetAttr(ATTR_ORDER, orders)){ op_dest.SetAttr(ATTR_ORDER, orders); } return SUCCESS; }
算子原型定义
permute_tik.h对PermuteTik算子进行原型定义。
REG_OP(PermuteTik) .INPUT(x, TensorType({DT_FLOAT16, DT_FLOAT})) .OUTPUT(y, TensorType({DT_FLOAT16, DT_FLOAT})) .ATTR(order, ListInt, {0}) .OP_END_FACTORY_REG(PermuteTik) }
原型定义的关键点是推理输出Tensor的shape及dtype,如下所示:
static graphStatus TransposeCommonInferShape(const std::vector<int64_t>& order_list, Operator& op) { Shape shape = op.GetInputDesc("x").GetShape(); size_t dim_num = shape.GetDimNum(); if (order_list.empty() || (order_list.size() != dim_num)) { return GRAPH_FAILED; } for (size_t i = 0; i < dim_num; ++i) { if ((size_t)order_list[i] >= dim_num || (size_t)order_list[i] < 0) { return GRAPH_FAILED; } } vector<int64_t> out_vec; for (size_t i = 0; i < dim_num; ++i) { out_vec.push_back(shape.GetDim(order_list[i])); } Shape out_shape(out_vec); TensorDesc tensordesc_output = op.GetOutputDesc("y"); tensordesc_output.SetShape(out_shape); tensordesc_output.SetDataType(op.GetInputDesc("x").GetDataType()); (void)op.UpdateOutputDesc("y", tensordesc_output); return GRAPH_SUCCESS; } IMPLEMT_COMMON_INFERFUNC(PermuteTikInferShape) { auto input_shape = op.GetInputDesc("x").GetShape(); std::vector<int64_t> input_shape_dims = input_shape.GetDims(); std::vector<int64_t> perm_list; if (ge::GRAPH_SUCCESS != op.GetAttr("order", perm_list)) { return GRAPH_FAILED; } for (size_t i = 0; i < input_shape_dims.size(); ++i) { if (std::find(perm_list.begin(), perm_list.end(), i) == perm_list.end()) { perm_list.push_back((int64_t)i); } } op.SetAttr("order", perm_list); return TransposeCommonInferShape(perm_list, op); } COMMON_INFER_FUNC_REG(PermuteTik, PermuteTikInferShape);
算子信息定义
PermuteTik算子的信息定义文件请参见“tbe/op_info_cfg/ai_core/<soc_version>/permute_tik.ini”。
说明:当前PermuteTik算子仅提供了昇腾310 AI处理器上的信息定义文件,若开发者想将此算子运行于其他昇腾AI处理器,请参考实现对应的算子信息定义文件并将其存放到对应的<soc_version>目录下。