As discussed in Phasing out Legacy Components, Third-party developers often choose to directly apply inplace modification to TVM rather than contributing their changes upstream for several reasons. First, TVM’s codebase is complex, and understanding or modifying it requires significant effort. Developers frequently face scenarios where TVM’s existing capabilities cannot meet their specific optimization needs, such as adding custom schedules, transformation passes, or backends for certain hardware architectures. These custom modifications are often too specific or “hacky” to meet the high code quality and design standards required by the TVM community, making it difficult for such changes to be merged upstream. Furthermore, the process of contributing upstream can be cumbersome and time-consuming, requiring rigorous testing and CI checks, which may outweigh the benefits for individual projects. Additionally, developers often lock their forks to specific versions of TVM to stabilize their custom modifications, making it harder to keep up with upstream updates. As a result, it is easier and faster for developers to maintain their own fork rather than engage in the lengthy and complex process of merging code upstream. Finally, the diverse nature of TVM-based projects means that different forks often have highly specialized code, which is not always applicable to the broader community, further reducing the motivation to merge changes back into TVM’s mainline codebase.

As @tianqi mentioned in the discussion, developers are recommended to avoid directly modifying TVM’s core code in their individual projects. Instead, the goal is to ensure that all downstream projects (such as ProjectA and ProjectB) can rely on a shared, unmodified core TVM library. This approach prevents compatibility issues that arise when each project forks and customizes TVM independently. To achieve this, @tianqi suggests taking inspiration from projects like MLC-LLM, where custom passes and optimizations are added as separate, modular extensions rather than inline modifications to TVM itself (a perfect example).

In this thread, let me summarize the key points of this approach and share a workflow that I have developed to extend TVM without altering its core code LeiWang1999/TVM.CMakeExtend, and some important considerations that we should aware of.

Solution Overview

This project demonstrates how to:

  • Keep TVM as an Independent Module: Treat TVM as an external dependency, either as a submodule or by linking to a prebuilt version.
  • Use CMake for Modular Builds: Utilize CMake to build your custom code separately, linking against the TVM libraries without integrating your code into TVM’s source tree.
  • Avoid Code Duplication and Conflicts: By not modifying TVM directly, you avoid merge conflicts and can benefit from the latest updates in TVM without additional overhead.
  • Facilitate Collaboration: Other developers can contribute to your project without needing to navigate a custom version of TVM.

Repository Structure

1
2
3
4
5
6
7
8
9
10
TVM.CMakeExtend/
├── 3rdparty/
│ └── tvm/ # Submodule pointing to TVM
├── build/ # Build directory
├── include/ # Custom header files
├── src/ # Custom source files (passes, codegens, etc.)
├── python/
│ └── your_project/ # Python bindings and extensions
├── CMakeLists.txt # Main CMake configuration
└── README.md # This README file

CMake Modular Build

The key to this approach is the CMake configuration that allows you to build your project separately while linking against TVM.

Using Prebuilt TVM Libraries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
if (DEFINED TVM_PREBUILD_PATH)
message(STATUS "Using prebuilt TVM from ${TVM_PREBUILD_PATH}")
add_library(tvm SHARED IMPORTED)
set_target_properties(tvm PROPERTIES
IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm.so"
INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
)
add_library(tvm_runtime SHARED IMPORTED)
set_target_properties(tvm_runtime PROPERTIES
IMPORTED_LOCATION "${TVM_PREBUILD_PATH}/libtvm_runtime.so"
INTERFACE_INCLUDE_DIRECTORIES "${TVM_PREBUILD_PATH}/../include"
)
else()
message(STATUS "Building TVM from source")
add_subdirectory(${TVM_SOURCE_DIR} tvm EXCLUDE_FROM_ALL)
endif()

This configuration checks if TVM_PREBUILD_PATH is defined:

  • If it is, it treats TVM as a prebuilt library and links against it.
  • If not, it adds TVM as a subdirectory to build it from source.

Building Your Custom Extensions

1
2
3
4
5
6
7
8
9
file(GLOB_RECURSE CUSTOM_SRCS src/*.cc)
add_library(custom_objs OBJECT ${CUSTOM_SRCS})
set(CUSTOM_INCLUDES
${TVM_SOURCE_DIR}/include
${TVM_SOURCE_DIR}/src
${TVM_SOURCE_DIR}/3rdparty/dlpack/include
${TVM_SOURCE_DIR}/3rdparty/dmlc-core/include
)
target_include_directories(custom_objs PRIVATE ${CUSTOM_INCLUDES})

This sets up your custom source files and includes TVM’s headers for compilation.

You have two options:

  1. Use a Prebuilt TVM

    If you already have TVM installed or built elsewhere (e.g., via pip install apache-tvm), you can link against it.

    1
    2
    3
    mkdir build && cd build
    cmake .. -DTVM_PREBUILD_PATH=/path/to/tvm/build
    make -j$(nproc)

    Replace /path/to/tvm/build with the actual path to your TVM build directory containing libtvm.so and libtvm_runtime.so.

  2. Build TVM from Source

    If you prefer to build TVM from source along with your project:

    1
    2
    3
    4
    5
    6
    7
    mkdir build && cd build
    cp ../3rdparty/tvm/cmake/config.cmake .
    # Edit config.cmake to enable desired features
    echo "set(USE_LLVM ON)" >> config.cmake
    echo "set(USE_CUDA ON)" >> config.cmake
    cmake ..
    make -j$(nproc)

    This will build both TVM and your custom extensions.

Handling Python Bindings

To ensure that your custom extensions are properly registered with TVM’s Python API:

  • Load Custom Libraries Before Importing TVM: Load your custom shared libraries before importing tvm in Python to ensure that the global functions are registered.

  • Modify __init__.py: In your Python package’s __init__.py, handle environment variables and library loading:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    import os
    import sys

    # Set up environment variables
    os.environ['TVM_LIBRARY_PATH'] = '/path/to/your/libs'

    # Load custom libraries
    from .libinfo import find_lib_path
    _LIBS = find_lib_path()
    for lib in _LIBS:
    tvm.lib.load_library(lib)

    import tvm

Custom Library Loader (libinfo.py)

Implement a custom library finder that locates your shared libraries at runtime.

1
2
3
4
5
6
7
8
9
10
11
12
import os

def find_lib_path():
curr_path = os.path.dirname(os.path.abspath(__file__))
lib_path = []
for lib in ['libyour_project.so', 'libyour_project.dylib', 'your_project.dll']:
full_path = os.path.join(curr_path, lib)
if os.path.exists(full_path):
lib_path.append(full_path)
if not lib_path:
raise RuntimeError("Cannot find your_project library")
return lib_path

Note: The process of loading the library must happen before importing TVM because the registration of global functions occurs during TVM’s import process. At that point, the library must already be loaded; otherwise, any custom global functions you’ve written won’t be registered in TVM’s global function table during the import process.

For example, directly running import tvm:

1
2
3
4
5
from tvm._ffi.registry import list_global_func_names
print(list_global_func_names())
'''
'tir.analysis.find_anchor_block', ... , 'relay._transform.ToBasicBlockNormalForm', 'relay._transform.to_cps', 'relay._transform.ToCPS', 'topi.bitwise_or', 'relay._transform.InferTypeLocal', 'relay.backend.aot.CreateExecutorMetadata', 'relay.build_module._AOTExecutorCodegen', 'relay.build_module._BuildModule', 'tvm_callback_cuda_compile', 'relay.backend.CreateExecutor']
'''

Loading the library before importing TVM:

1
2
3
4
5
6
7
8
import tilelang
from tilelang import tvm as tvm
from tvm._ffi.registry import list_global_func_names
print(list_global_func_names())
'''
tl* is our own pass
'tl.transform.LowerHopperIntrin', ..., 'tir.analysis.find_anchor_block', 'tir.analysis.find_anchor_block', ... 'relay._transform.InferTypeLocal', 'relay.backend.aot.CreateExecutorMetadata', 'relay.build_module._AOTExecutorCodegen', 'relay.build_module._BuildModule', 'tvm_callback_cuda_compile', 'relay.backend.CreateExecutor']
'''

Examples

Adding a Custom Pass

C++ Implementation (src/my_pass.cc):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <tvm/tir/transform.h>

namespace tvm {
namespace tir {
namespace transform {

tvm::transform::Pass MyCustomPass() {
auto pass_func = [](PrimFunc f, IRModule m, PassContext ctx) {
// Implement your pass logic here
return f;
};
return tvm::transform::CreatePrimFuncPass(pass_func, 0, "MyCustomPass", {});
}

TVM_REGISTER_GLOBAL("my_project.transform.MyCustomPass")
.set_body_typed(MyCustomPass);

} // namespace transform
} // namespace tir
} // namespace tvm

Python Usage:

1
2
3
4
5
import tvm
import your_project.transform

mod = ... # your IRModule
mod = your_project.transform.MyCustomPass()(mod)

I think the final blueprint of this approach: when installing a project based on TVM, there’s no need to compile a separate version of TVM. You can simply install the upstream version of TVM via pip install apache-tvm and then install your custom extension module with pip install xxx. Even when building from source, there’s no need to compile the entire TVM codebase. You can just download the latest TVM release using pip install apache-tvm, and your extension module’s libraries will automatically link to the installed TVM package. If everyone adopts this approach, the overall experience of working with TVM should improve significantly.

之前在一篇文章中我提到过一句:一千个基于TVM的项目,就有一千个被爆改过的TVM,这是我对基于TVM开发项目现状的吐槽。理解TVM的代码对于开发者来说已经是一件不容易的事情,更不用说开发者们在面对一个当前TVM无法解决的场景,想要修改进行扩展的时候是怎样的困难。往往,基于TVM的项目都是Fork一份TVM的代码来修改,例如为TVM添加一个新的优化Pass,就在src/tir/transformation文件夹下面新建一个Pass文件,然后通过ffi绑定到python侧的代码,其他的需求,例如注册一个新的语法树节点,添加新的代码生成等,也都是如此来实现,我自己的github上fork的LeiWang1999/tvm就包含十几个分支,有为了BitBLAS扩展(引入了一些新的Node和Schedule来进行优化)的bitblas分支,有为了Ladder/Welder做高性能的算子融合而添加了一些优化Pass的ladder分支,有为给AMD上做代码生产的amd_hip分支。这些分支的关系已经非常错综复杂了,我以BitBLAS为例,探讨一下为什么这样的开发方式会导致困难,并且提供一种解决方法(参考自MLC-LLM),供大家一起讨论,代码放在LeiWang1999/TVM.CMakeExtend

Read More

近期在基于TVM(其实是bitblas.tl) 复现PPoPP 2023的一篇论文Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU . 简单来说,这个方法可以把k轴均匀地切分到每个SM上,从而缓解小shape下的SM Waves浪费(BitBLAS在Contiguous Batching等场景上确实碰到了这样的问题,为了优化这部分性能不得已去复现这个论文的方法。然而这篇Blog不讲Stream-K的算法与实现细节,也不讲BitBLAS, 而是来分析一下TVM的MergeSharedMemoryAllocations这一个Pass,原因是高效的Stream-K实现需要引入大量的shared memory,而TVM中负责进行Liveness分析来合并shared memory访存的这个Pass,在复杂场景下存在BUG,导致shared memory的复用达不到预期,阻止了我们探索更大的tile size. 为此不得不对这个Pass进行一下改进,本文记录一下对这个Pass的分析和修改,以及我相信大部分TVM的用户在Hack TVM的代码的时候都会头秃,穿插一些TVM的设计和调试经验)

example

Read More

之前回答某个知乎问题的时候简单描述了一下为什么通过加padding的方式可以解bank conflict:

https://www.zhihu.com/question/565420155

当时我画了这样一个图片:

img

有一些同学还是不理解为什么这种方式可以解掉bank conflict,再加上我搜一搜也没发现有人讲清楚过这件事情。这篇文章以利用tensor core的矩阵乘法为例,较详细地分析一下解conflict的方法,同样我们选择一个最典型的cutlass tile 128x256x32 的 float16 的tile,用来说明问题,在最后,我会提供一份复现的代码,由Tensor IR实现,方便实现各种Tile(虽然我觉得加pad的性能并不能足够到sota。

Read More

回答知乎提问:https://www.zhihu.com/question/565420155

最近正好研究了一下这个schedule,顺便简单总结一下,官方给的文档介绍确实比较抽象: https://tvm.apache.org/docs/reference/api/python/tir.html

题主困惑的应该是factor和offset是什么意思,为什么这样能够解决shared memory bank conflict?

第一个问题,可以看看代码,首先是底层的实现(https://github.com/apache/tvm/blob/HEAD/src/tir/transforms/storage_flatten.cc#L480-L481):

1
2
3
4
5
6
7
8
9
10
11
12
PrimExpr stride = make_const(shape[first_dim].dtype(), 1);
for (size_t i = shape.size(); i != 0; --i) {
size_t dim = i - 1;
if (dim < avec.size() && avec[dim].align_factor != 0) {
PrimExpr factor = make_const(stride.dtype(), avec[dim].align_factor);
PrimExpr offset = make_const(stride.dtype(), avec[dim].align_offset);
stride = stride + indexmod(factor + offset - indexmod(stride, factor), factor);
stride = bound_analyzer_->Simplify(stride);
}
rstrides.push_back(stride);
stride = stride * shape[dim];
}

Read More

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×