FlashFuser: Boosting Deep Learning Efficiency for Energy Innovations

Researchers from the National University of Defense Technology in China have developed a new compiler framework called FlashFuser, designed to optimize the performance of deep learning workloads on modern GPUs. The team, led by Ziyu Huang and Yangjie Zhou, presents their findings in a paper published in the Proceedings of the ACM on Programming Languages.

Deep learning workloads are often constrained by memory bandwidth, as computation throughput continues to improve faster than memory access speeds. Kernel fusion is a technique used to alleviate this issue by combining multiple operations into a single kernel, reducing the amount of data that needs to be moved between memory and the processor. However, existing compilers and frameworks are limited to using local scratchpad memory for kernel fusion, which can be insufficient for larger intermediate results.

FlashFuser addresses this limitation by utilizing the inter-core connection mechanism known as Distributed Shared Memory (DSM) found in modern GPUs like the NVIDIA H100. DSM provides a larger, high-bandwidth, and low-latency on-chip memory pool that can be shared among cores. FlashFuser extends established fusion techniques to the DSM domain through three main contributions.

First, the researchers propose a powerful DSM-based communication abstraction that formalizes complex cluster-based data exchange patterns, such as reduce, shuffle, and multiply. This abstraction allows for more efficient data movement and communication between cores.

Second, FlashFuser introduces a dataflow analyzer that generalizes loop scheduling, resource mapping, and tile selection to the distributed memory hierarchy. The analyzer determines the optimal execution order and tile sizes by quantifying data movement across memory levels, ensuring that data is moved as efficiently as possible.

Finally, FlashFuser integrates these components into a unified search engine that employs analytical cost modeling and DSM-aware pruning strategies to efficiently discover the optimal execution plan. This search engine automatically optimizes the kernel fusion process, reducing the need for manual tuning.

The researchers evaluated FlashFuser on an NVIDIA H100 GPU and found that it reduced memory access by 58% and delivered kernel speedups of 3.3x against highly-tuned libraries and 4.1x against state-of-the-art compilers. This resulted in a 1.24x end-to-end speedup for deep learning workloads.

For the energy sector, this research could have practical applications in optimizing the performance of machine learning algorithms used for energy forecasting, grid management, and other energy-related tasks. By reducing memory access and improving computation efficiency, FlashFuser could help lower the energy consumption and carbon footprint of data centers and high-performance computing systems used in the energy industry.

This article is based on research available at arXiv.

Related Posts