Pt-Bitnet: 1-Bit Large Language Model with Post-Training Quantization
10 Pages Posted: 14 Oct 2024
Abstract
The deployment of Large Language Models (LLMs) has been constrained by their substantial hardware requirements and associated costs. Quantization techniques have emerged as a promising solution to address these challenges. Recently, BitNet~\citep{wang2023bitnet} proposed to use ternary values (+1, 0, -1) for weight quantization showing particular promise in eliminating multiplication operations, further significantly reducing the latency and energy consumption. However, BitNet's requirement for training models from scratch limits its scalability to models larger than 3 billion parameters.This paper introduces PT-BitNet, a novel post-training quantization method that extends the benefits of BitNet's ternary quantization to large-scale language models up to 70B parameters. To effectively quantize the model parameters down to $\pm1, 0$, we propose a two-stage algorithm. In the first stage, we transform the weight distribution to a quantization-friendly one, and in the second stage, we optimize the weight elements in a block-wise manner. We demonstrate the effectiveness of PT-BitNet through comprehensive experiments on various model sizes and downstream tasks. Our results show that PT-BitNet achieves substantial reductions in model size and inference time, with minimal impact on task performance. For example, PT-BitNet scales to 70B parameters LLM with 61\% average downstream accuracy, significantly outperforming the BitNet b.158 with 51.2\% average accuracy.
Keywords: Large Language ModelsTernary QuantizationPost-Training QuantizationEfficient Inference
Suggested Citation: Suggested Citation