He is currently a second year postgraduate student in University of Chinese Academy of Sciences UCAS, Institute of Computing Technology ICT, advised by Yinhe Han and Haobo Xu. He is also a full-time research intern in System Research Group of Microsoft Research Asia(MSRA), advised by Dr. Jilong Xue and Dr. Lingxiao Ma.

And he is now focusing on :

He also enjoys writing technical posts and contributes to various open-source communities, including Microsoft NNFusion, Apache TVM, and Tengine etc.


LeiWang1999's Github Chart

LeiWang1999's Github chart


  • University of Chinese Academy of Science
    Institute of Computing Technology
    Master in Computer Science(Aug. 2021 - Present)
  • Nanjing Tech University
    Bachor in Electronic Engineering(Aug. 2017 - Jun. 2021)
    Overall GPA: 3.95/4.00
    Ranking: 1/59

Awards & Honors

  • 2018 Chinese National Scholarship(Top 0.3%)
  • 2021 Excellent New Student Award of Chinese Academy of Science
  • Njtech Person of Year 2020



  • Lin Bin*; Zheng Ningxin*; Wang Lei*; Cao Shijie; Ma Lingxiao; Zhang Quanlu; Zhu Yi; Cao Ting; Xue Jilong; Yang Yuqing; et al. Efficient GPU Kernels for N: M-Sparse Weights in Deep Learning. Proceedings of Machine Learning and Systems, MLSYS, 2023. (* represents co-first author) [paper]
  • Sun Xiaotian; Wang Xinyu; Li Wanqian; Wang Lei; Han Yinhe; Chen Xiaoming. PIMCOMP: A Universal Compilation Framework for Crossbar-based PIM DNN Accelerators. 60th. Design Automation Conference, DAC, 2023. [paper]
  • Lei Wang; Lingxiao Ma; Shijie Cao; Ningxin Zheng; Quanlu Zhang. Efficient Tensor Compilation on Customized Data Format. 17th USENIX Symposium on Operating Systems Design and Implementation (Poster), OSDI, 2023.
  • Yuetao Chen; Kun Li; Yuhao Wang; Donglin Bai; Lei Wang; Lingxiao Ma; Liang Yuan; Yunquan Zhang; Ting Cao; Mao Yang; ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores. Symposium on Principles and Practice of Parallel Programming, PPoPP, 2024. [paper] Best Paper Award!
  • Wanqian Li; Xiaotian Sun; Xinyu Wang; Lei Wang; Yinhe Han; Xiaoming Chen; PIMSYN: Synthesizing Processing-in-memory CNN Accelerators. Design, Automation and Test in Europe Conference, DATE, 2024. [paper]
  • Haoran Wang; Lei Wang; Haobo Xu; Ying Wang; Yinhe Han; PrimPar: Efficient Spatial-temporal Tensor Partition for Large Transformer Model Training. ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2024.
  • Shuming Ma; Hongyu Wang; Lingxiao Ma; Lei Wang; Wenhui Wang; Shaohan Huang; Li Dong; Ruiping Wang; Jilong Xue; Furu Wei; The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. Preprint, 2024. [paper] HG Daily Paper TOP1!
  • Lei Wang; Lingxiao Ma; Shijie Cao; Quanlu Zhang, Jilong Xue, Yining Shi, Ningxin Zheng; Ziming Miao; Fan Yang; Ting Cao; Yuqing Yang; Mao Yang; Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation. 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2024.


  • Microsoft BitBLAS, 2024

    BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the WwdtypeAadtype mixed-precision matrix multiplication where Ccdtype[M, N] = Aadtype[M, K] × Wwdtype[N, K]. BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the WwdtypeAadtype quantization in large language models (LLMs), for example:

    [ View Github ]
  • FPGA Accelerator for Digital Recognition, 2020

    Utilizing FPGA technology, this project aims to provide accelerated digital analysis during the time when convolutional neural networks began to gain prominence. This enhancement allowed for faster and more efficient digital recognition processes.
    [ Watch the Video ]

  • FPGA Accelerator for Beam Forming, 2020

    With the aim of identifying sound location, this FPGA accelerator leverages a tetragonal microphone array to enhance sounds from specific points, we named the project FOSDA.
    [ Watch the Video ]

  • Full Stack FPGA Implementation of NVDLA, 2021

    This project involved a full-stack FPGA implementation of the open-source Deep Learning Accelerator Framework, NVDLA. To enhance the utility of this accelerator, he designed a new compiler and runtime framework. This allowed networks to do transition between CPU fallbacks and hardware acceleration, ensuring optimal performance and usability.
    [ Read the Post: DLA Deploy ] [ Read the Post: Compiler Design ] [ View Github ]

  • Opensource Contributions [Github]

    Familar with Microsoft NNFusion, Apache TVM, Tengine, etc.


Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now