Pt-Bitnet: 1-Bit Large Language Model with Post-Training Quantization

10 Pages Posted: 14 Oct 2024

See all articles by Yufei Guo

Yufei Guo

affiliation not provided to SSRN

Zecheng Hao

Peking University

Jiahang Shao

Peking University

Jie Zhou

affiliation not provided to SSRN

Xiaode Liu

affiliation not provided to SSRN

Xin Tong

affiliation not provided to SSRN

Yuhan Zhang

affiliation not provided to SSRN

Yuanpei Chen

affiliation not provided to SSRN

Weihang Peng

affiliation not provided to SSRN

Zhe Ma

affiliation not provided to SSRN

Abstract

The deployment of Large Language Models (LLMs) has been constrained by their substantial hardware requirements and associated costs. Quantization techniques have emerged as a promising solution to address these challenges. Recently, BitNet~\citep{wang2023bitnet} proposed to use ternary values (+1, 0, -1) for weight quantization showing particular promise in eliminating multiplication operations, further significantly reducing the latency and energy consumption. However, BitNet's requirement for training models from scratch limits its scalability to models larger than 3 billion parameters.This paper introduces PT-BitNet, a novel post-training quantization method that extends the benefits of BitNet's ternary quantization to large-scale language models up to 70B parameters. To effectively quantize the model parameters down to $\pm1, 0$, we propose a two-stage algorithm. In the first stage, we transform the weight distribution to a quantization-friendly one, and in the second stage, we optimize the weight elements in a block-wise manner. We demonstrate the effectiveness of PT-BitNet through comprehensive experiments on various model sizes and downstream tasks. Our results show that PT-BitNet achieves substantial reductions in model size and inference time, with minimal impact on task performance. For example, PT-BitNet scales to 70B parameters LLM with 61\% average downstream accuracy, significantly outperforming the BitNet b.158 with 51.2\% average accuracy.

Keywords: Large Language ModelsTernary QuantizationPost-Training QuantizationEfficient Inference

Suggested Citation

Guo, Yufei and Hao, Zecheng and Shao, Jiahang and Zhou, Jie and Liu, Xiaode and Tong, Xin and Zhang, Yuhan and Chen, Yuanpei and Peng, Weihang and Ma, Zhe, Pt-Bitnet: 1-Bit Large Language Model with Post-Training Quantization. Available at SSRN: https://ssrn.com/abstract=4987078 or http://dx.doi.org/10.2139/ssrn.4987078

Yufei Guo (Contact Author)

affiliation not provided to SSRN ( email )

No Address Available

Zecheng Hao

Peking University ( email )

No. 38 Xueyuan Road
Haidian District
Beijing, 100871
China

Jiahang Shao

Peking University ( email )

No. 38 Xueyuan Road
Haidian District
Beijing, 100871
China

Jie Zhou

affiliation not provided to SSRN ( email )

No Address Available

Xiaode Liu

affiliation not provided to SSRN ( email )

No Address Available

Xin Tong

affiliation not provided to SSRN ( email )

No Address Available

Yuhan Zhang

affiliation not provided to SSRN ( email )

No Address Available

Yuanpei Chen

affiliation not provided to SSRN ( email )

No Address Available

Weihang Peng

affiliation not provided to SSRN ( email )

No Address Available

Zhe Ma

affiliation not provided to SSRN ( email )

No Address Available

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
231
Abstract Views
599
Rank
271,354
PlumX Metrics