Machine learning

Automatically optimizing execution of dynamic tensor operations

New auto-scheduler speeds optimization process sixfold while improving performance of resulting code up to 70%.

By Bojian Zheng, Yida Wang

September 08, 2022

Deep-learning models rely centrally on algebraic operations involving tensors — higher-dimensional analogues of matrices — that might be repeated tens of thousands of times. Efficient learning requires optimizing frequently repeated tensor operations.

But operations involving tensors of different shapes — 32x32, 64x64, 128x128, etc. — have to be optimized individually. Auto-schedulers are programs that learn optimizations for shapes whose implementations may be suboptimal in current tensor operation libraries.

Existing auto-schedulers struggle, however, with workloads whose shapes vary. Many natural-language-processing applications, for instance, take inputs of arbitrary length, which means tensors of arbitrary shape.

Dynamic workloads

NLP models that handle text strings of arbitrary length are examples of dynamic-by-design models, which allow variably sized inputs. But other applications also call for dynamic workloads.

Optimal_Neural_Network_movie.gif._CB452651292_.gif

Microkernels

Auto-schedulers typically rely on computational kernels — program templates whose use greatly accelerates the speed with which different candidate optimizations can be evaluated. Odd-shaped workloads, however, may not fit the kernels precisely. For instance, if a tensor has 513 elements along one of its dimensions, but the kernel capacity is only 512, then two kernels must be tiled together in order to accommodate the tensor.

The tiled kernels, however, have a combined capacity of 1,024 along the relevant dimension, compared to only 513 for the input tensor. The input tensor thus has to be padded out in order to fill the kernel. This padding can slow down the optimization process dramatically, as it leads to unnecessary calculations that then have to be pruned out of the result.

Graphic of Agora sampling the development set and generating, labeling and adding new points back to the training set.

Determining the optimal architectural parameters reduces network size by 84% while improving performance on natural-language-understanding tasks.

DietCode uses microkernels that are sized according to the available hardware, not the input shape, which aids in optimization for that hardware. For a given hardware configuration, DietCode can also generate a range of different microkernel shapes and sizes, which can be used in combination.

The microkernels are small enough that they can usually be tiled across an input, to fit its shape more precisely. This may still require some padding at the edges, but much less than larger kernels require.

The real advantage of microkernels, however, is that they enable DietCode to optimize operators for multiple shapes at once. A standard auto-scheduler will take a workload shape, pad it as necessary to fit its tiled kernels, and then estimate the efficiency of different implementations using a cost model that extracts program features such as loop structures and memory access patterns. Then it will repeat that process for the next shape.

DietCode, by contrast, breaks operators up across microkernels. The cost model has two components: one that evaluates features of the partial operation assigned to each microkernel and one that evaluates the cost of stitching those partial operations together to form a complete operator.

DietCode v. conventional.png — A traditional auto-encoder *(left)* separately optimizes implementations of tensor operations for different-shaped workloads. DietCode *(right)* instead optimizes implementations for multiple workloads at once, saving time and improving performance.

Here is where we realize our greatest gains in efficiency, because each partial operation is a component of operators for multiple workload shapes. Compared to the computational cost of evaluating operations — which is a machine learning process that involves real hardware measurements — the cost of stitching partial operations together is low.

With our optimized microkernels in hand, we train an efficient decision tree model to map workload shapes to microkernels. That decision tree is incorporated into the binary file for the execution of the tensor operations, to route inputs of arbitrary shape to the proper microkernels for processing.

For experimental results and more details, please refer to our paper.

Acknowledgements: Cody Yu, Yizhi Liu, Gennady Pekhimenko

About the Author

Bojian Zheng

Bojian Zheng is a doctoral student at the University of Toronto who has been interning with Amazon Web Services on and off since 2020.

Yida Wang

Yida Wang is a principal applied scientist with Amazon Web Services.

Automatically optimizing execution of dynamic tensor operations

New auto-scheduler speeds optimization process sixfold while improving performance of resulting code up to 70%.

Dynamic workloads

Microkernels

Related content

Work with us