LeiWang1999

Jun 7 2025 9 minutes read (About 1331 words)

About

Introduction

He is currently a second year postgraduate student in University of Chinese Academy of Sciences UCAS, Institute of Computing Technology ICT, advised by Yinhe Han and Haobo Xu. He is also a full-time research intern in System Research Group of Microsoft Research Asia(MSRA), advised by Dr. Jilong Xue and Dr. Lingxiao Ma.

And he is now focusing on :

He also enjoys writing technical posts and contributes to various open-source communities, including Microsoft NNFusion, Apache TVM, and Tengine etc.

LeiWang1999's Github Chart

Education

University of Chinese Academy of Science
Institute of Computing Technology
Master in Computer Science(Aug. 2021 - Present)
Nanjing Tech University
Bachor in Electronic Engineering(Aug. 2017 - Jun. 2021)
Overall GPA: 3.95/4.00
Ranking: 1/59

Awards & Honors

2018 Chinese National Scholarship(Top 0.3%)
2021 Excellent New Student Award of Chinese Academy of Science
Njtech Person of Year 2020

First Price of 2019 NUEDC (Top 0.5%) (全国大学生电子设计竞赛)
First Price of 2018 Electronic Design Competition of Province
Third Price of Integrated Circuit Innovation and Entrepreneurship Competition (FPGA hardware Accelerator for digital recognition)
Third prize of National FPGA Competition (FPGA based FOSDA Alogrithom Implementation)

Experience

Netease Intelligent Hardware R&D Department
Bei Jing, China
NPU Development intern. (Sep. 2021 - Oct. 2021)
Microsoft Research Asia
Bei Jing, China
Systems Research intern. (April. 2022 - now)

Publications

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang
Proceedings of the Nineteenth European Conference on Computer Systems., EuroSys 2025

Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation

Lei Wang, Lingxiao Ma, Shijie Cao, Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng, Ziming Miao, Fan Yang, Ting Cao, Yuqing Yang, Mao Yang
18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, Furu Wei
Arxiv, 2024. HG Daily Paper TOP1!

PrimPar: Efficient Spatial-temporal Tensor Partition for Large Transformer Model Training
Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yinhe Han
ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2024

LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang
Arxiv, 2024

ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores
Yuetao Chen, Kun Li, Yuhao Wang, Donglin Bai, Lei Wang, Lingxiao Ma, Liang Yuan, Yunquan Zhang, Ting Cao, Mao Yang
Symposium on Principles and Practice of Parallel Programming, PPoPP, 2024. Best Paper Award!

Efficient Tensor Compilation on Customized Data Format
Lei Wang, Lingxiao Ma, Shijie Cao, Ningxin Zheng, Quanlu Zhang
17th USENIX Symposium on Operating Systems Design and Implementation (Poster), OSDI, 2023

PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators
Sun Xiaotian, Wang Xinyu, Li Wanqian, Wang Lei, Han Yinhe, Chen Xiaoming
60th Design Automation Conference, DAC, 2023

PIMSYN: Synthesizing Processing-in-memory CNN Accelerators
Wanqian Li, Xiaotian Sun, Xinyu Wang, Lei Wang, Yinhe Han, Xiaoming Chen
Design, Automation and Test in Europe Conference, DATE, 2024

Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning

Lin Bin*, Zheng Ningxin*, Wang Lei*, Cao Shijie, Ma Lingxiao, Zhang Quanlu, Zhu Yi, Cao Ting, Xue Jilong, Yang Yuqing, et al. (* represents co-first author)
Proceedings of Machine Learning and Systems, MLSYS, 2023

Projects

Microsoft BitBLAS, 2024 Lead!

BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the W_wdtypeA_adtype mixed-precision matrix multiplication where C_cdtype[M, N] = A_adtype[M, K] × W_wdtype[N, K]. BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the W_wdtypeA_adtype quantization in large language models (LLMs), for example:
FPGA Accelerator for Digital Recognition, 2020

Utilizing FPGA technology, this project aims to provide accelerated digital analysis during the time when convolutional neural networks began to gain prominence. This enhancement allowed for faster and more efficient digital recognition processes.
[ Watch the Video ]
FPGA Accelerator for Beam Forming, 2020

With the aim of identifying sound location, this FPGA accelerator leverages a tetragonal microphone array to enhance sounds from specific points, we named the project FOSDA.
[ Watch the Video ]
Full Stack FPGA Implementation of NVDLA, 2021

This project involved a full-stack FPGA implementation of the open-source Deep Learning Accelerator Framework, NVDLA. To enhance the utility of this accelerator, he designed a new compiler and runtime framework. This allowed networks to do transition between CPU fallbacks and hardware acceleration, ensuring optimal performance and usability.
[ Read the Post: DLA Deploy ] [ Read the Post: Compiler Design ] [ View Github ]
Opensource Contributions [Github]
Familar with Microsoft NNFusion, Apache TVM, Tengine, etc.

[Invited] Talks

[10/26/24] [GPU MODE] BitBLAS/Ladder: Enabling Efficient Low-Precision Deep Learning Computing
[Slides] [Record] [Tutorials]

[09/12/24] [MSRA] BitBLAS/Ladder: Enabling Efficient Low-Precision Deep Learning Computing
[Slides] [Record]

[09/12/24] [Huawei Noah Lab] BitBLAS: Enabling Efficient Low-Precision Deep Learning Computing
[Slides]

[09/23/21] [Tengine Community] Tengine后端之OpenDLA概述
[Slides] [Record]

Selected Media Reports

[08/22/24] [WeChat Official Account] [微软亚洲研究院] 微软亚洲研究院多项创新技术，弥合大模型低比特量化与终端部署间鸿沟

[08/17/24] [Zhihu] [微软亚洲研究院] 顶尖高校优秀学子齐聚微软亚洲研究院新星科技节，论道科研！

[08/09/24] [WeChat Official Account] [量子位] 手机跑大模型提速4-5倍！微软亚研院开源新技术，有CPU就行

[02/29/24] [WeChat Official Account] [机器之心] BitNet b1.58：开启1-bit大语言模型时代

Introduction

LeiWang1999's Github Chart

Education

Awards & Honors

Experience

Publications

Projects

Microsoft BitBLAS, 2024 Lead!

FPGA Accelerator for Digital Recognition, 2020

FPGA Accelerator for Beam Forming, 2020

Full Stack FPGA Implementation of NVDLA, 2021

Opensource Contributions [Github]

[Invited] Talks

Selected Media Reports

Comments

Your browser is out-of-date!