High Performance AMD Matrix Core Codegen

Nov 25 2024 Technical 28 minutes read (About 4221 words)

不久之前的一篇分享里，我介绍了AMD CDNA架构(MI210, MI250, MI300)上的异步拷贝相关指令，在BitBLAS可以找到相关的实现，然而实际过程中发现AMD的异步拷贝指令的要求实际上要比那篇分享所写的更加苛刻，每个warp里的线程必须要求访问连续的数据，或者通过M0寄存器来控制每个线程的偏置。

一般来说，我们习惯这个指令就是明确的要load给定指针的一小块数据就行了，但是这个指令因为上述提到的两个限制就很难做到。经过笔者非常繁琐的Micro bencmark之后，笔者终于调教出了可以让每个线程Load给定数据块的写法，如下:

template <bool pre_nop = false>
CK_TILE_DEVICE void async_buffer_load_dword_v(void* smem, int32x4_t rsrc, index_t voffset) {
  auto const lds_ptr_sgpr = __builtin_amdgcn_readfirstlane((reinterpret_cast<uintptr_t>(smem)));
  asm volatile(
      "s_mov_b32 m0, %0; \n\t"
      "buffer_load_dword %1, %2, 0 offen lds;\n\t" ::"s"(lds_ptr_sgpr),
      "v"(voffset), "s"(rsrc)
      : "memory");
}
if constexpr(N == 4) {
  async_buffer_load_dword_v(lds_base_ptr, make_wave_buffer_resource(((int32_t *)global_base_ptr) - threadIdx.x), threadIdx.x * N /*assume 4 bytes*/);
}

在这篇文章里，笔者填一下AMD Matrix Core的坑，介绍一下过去一个月里BitBLAS针对AMD的的高性能Matrix Core支持，在这篇文章里笔者将介绍一下MFMA（AMD版的MMA）。如何进行AMD Kernel的性能分析，及Profile一个AMD Kernel，最后我们介绍若干种绞尽了笔者脑汁的优化方法，完全利用好硬件的带宽(全都是128bits的内存访问指令，并且没有Memory bank conflict)。

这篇文章涉及到的算子有矩阵乘法和Flash Attention。本篇文章的实现在BitBLAS里, Codegen以及Swizzle等Layout变换依托于TVM, TVM可以帮助我们显式地操作一个数据的Layout，相比Triton更加灵活和可观。虽然AMD提供的文档十分有限，但是在这一个月里笔者参考了很多AMD开发人员提供的实现，例如Composable Kernel和Triton for ROCm，笔者从这些项目中收获良多。

本文假设读者对Nvidia GPU的编程有一定的了解，熟悉最基本的Tile优化程序的方法，以及Tensor Core的基本概念。

AMD Async Copy

Nov 12 2024 Technical 14 minutes read (About 2047 words)

最近给BitBLAS添加了AMD的后端，发现AMD的异步拷贝等和Nvidia有很大的不同(但是FA3在MI300上需要用到这一个Feature)，然而官方根本没有文档，只有Instruction Set，我在这里做一下自己的理解和解读，大部分内容是参考自这个Instruction Set。

Extending TVM with CMake Include Dependencies

Oct 11 2024 Technical 10 minutes read (About 1454 words)

As discussed in Phasing out Legacy Components, Third-party developers often choose to directly apply inplace modification to TVM rather than contributing their changes upstream for several reasons. First, TVM’s codebase is complex, and understanding or modifying it requires significant effort. Developers frequently face scenarios where TVM’s existing capabilities cannot meet their specific optimization needs, such as adding custom schedules, transformation passes, or backends for certain hardware architectures. These custom modifications are often too specific or “hacky” to meet the high code quality and design standards required by the TVM community, making it difficult for such changes to be merged upstream. Furthermore, the process of contributing upstream can be cumbersome and time-consuming, requiring rigorous testing and CI checks, which may outweigh the benefits for individual projects. Additionally, developers often lock their forks to specific versions of TVM to stabilize their custom modifications, making it harder to keep up with upstream updates. As a result, it is easier and faster for developers to maintain their own fork rather than engage in the lengthy and complex process of merging code upstream. Finally, the diverse nature of TVM-based projects means that different forks often have highly specialized code, which is not always applicable to the broader community, further reducing the motivation to merge changes back into TVM’s mainline codebase.

通过Include依赖扩展TVM

Oct 11 2024 Technical 28 minutes read (About 4268 words)

之前在一篇文章中我提到过一句：一千个基于TVM的项目，就有一千个被爆改过的TVM，这是我对基于TVM开发项目现状的吐槽。理解TVM的代码对于开发者来说已经是一件不容易的事情，更不用说开发者们在面对一个当前TVM无法解决的场景，想要修改进行扩展的时候是怎样的困难。往往，基于TVM的项目都是Fork一份TVM的代码来修改，例如为TVM添加一个新的优化Pass，就在src/tir/transformation文件夹下面新建一个Pass文件，然后通过ffi绑定到python侧的代码，其他的需求，例如注册一个新的语法树节点，添加新的代码生成等，也都是如此来实现,我自己的github上fork的LeiWang1999/tvm就包含十几个分支，有为了BitBLAS扩展(引入了一些新的Node和Schedule来进行优化)的bitblas分支，有为了Ladder/Welder做高性能的算子融合而添加了一些优化Pass的ladder分支，有为给AMD上做代码生产的amd_hip分支。这些分支的关系已经非常错综复杂了，我以BitBLAS为例，探讨一下为什么这样的开发方式会导致困难，并且提供一种解决方法(参考自MLC-LLM)，供大家一起讨论，代码放在LeiWang1999/TVM.CMakeExtend。

TVM中的Shared Memory Reuse Pass 分析

Sep 14 2024 Technical 42 minutes read (About 6309 words)

近期在基于TVM(其实是bitblas.tl) 复现PPoPP 2023的一篇论文Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU . 简单来说，这个方法可以把k轴均匀地切分到每个SM上，从而缓解小shape下的SM Waves浪费（BitBLAS在Contiguous Batching等场景上确实碰到了这样的问题，为了优化这部分性能不得已去复现这个论文的方法。然而这篇Blog不讲Stream-K的算法与实现细节，也不讲BitBLAS, 而是来分析一下TVM的MergeSharedMemoryAllocations这一个Pass，原因是高效的Stream-K实现需要引入大量的shared memory，而TVM中负责进行Liveness分析来合并shared memory访存的这个Pass，在复杂场景下存在BUG，导致shared memory的复用达不到预期，阻止了我们探索更大的tile size. 为此不得不对这个Pass进行一下改进，本文记录一下对这个Pass的分析和修改，以及我相信大部分TVM的用户在Hack TVM的代码的时候都会头秃，穿插一些TVM的设计和调试经验）

为什么padding能解bank conflict?

Feb 2 2023 Technical 9 minutes read (About 1281 words)

之前回答某个知乎问题的时候简单描述了一下为什么通过加padding的方式可以解bank conflict：

https://www.zhihu.com/question/565420155

当时我画了这样一个图片:

有一些同学还是不理解为什么这种方式可以解掉bank conflict，再加上我搜一搜也没发现有人讲清楚过这件事情。这篇文章以利用tensor core的矩阵乘法为例，较详细地分析一下解conflict的方法，同样我们选择一个最典型的cutlass tile 128x256x32 的 float16 的tile，用来说明问题，在最后，我会提供一份复现的代码，由Tensor IR实现，方便实现各种Tile(虽然我觉得加pad的性能并不能足够到sota。

High Performance AMD Matrix Core Codegen

AMD Async Copy

Extending TVM with CMake Include Dependencies

通过Include依赖扩展TVM

TVM中的Shared Memory Reuse Pass 分析

为什么padding能解bank conflict?

Your browser is out-of-date!