Javascript must be enabled to continue!
Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
View through CrossRef
Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.
Title: Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
Description:
Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference.
Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8.
However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family.
Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance.
Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence.
To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization.
Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques.
The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.
Related Results
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
NICU Medication Errors: Describing the Cause and Nature of Medication Errors in a NICU in Qatar
NICU Medication Errors: Describing the Cause and Nature of Medication Errors in a NICU in Qatar
IntroductionA medication error can be defined as “any error occurring in the medication use process” and focuses on problems with the delivery of medication to a patient [1]. Medic...
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga
The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...
Synthetic data-driven overlapped neural spikes sorting: decomposing hidden spikes from overlapping spikes
Synthetic data-driven overlapped neural spikes sorting: decomposing hidden spikes from overlapping spikes
AbstractSorting spikes from extracellular recordings, obtained by sensing neuronal activity around an electrode tip, is essential for unravelling the complexities of neural coding ...
Field Investigation of Broken Cut Spikes on Elastic Fasteners Using Instrumented Spikes at FAST
Field Investigation of Broken Cut Spikes on Elastic Fasteners Using Instrumented Spikes at FAST
Abstract
Elastic fasteners have been shown to reduce gage widening and decrease the potential for rail roll compared to conventional cut-spike-only systems. For this...
Constrained Quantization for Probability Distributions
Constrained Quantization for Probability Distributions
In this work, we extend the classical framework of quantization for Borel probability measures defined on normed spaces Rk by introducing and analyzing the notions of the nth const...
Evaluation of coronary arteries in congenital heart disease
in children: diagnostic comparison of electrocardiogram-gated and non-electrocardiogram-gated computed tomography cardiac angiography
Evaluation of coronary arteries in congenital heart disease
in children: diagnostic comparison of electrocardiogram-gated and non-electrocardiogram-gated computed tomography cardiac angiography
IntroductionTo compare the visualization and anatomy of coronary arteries in children (≤ 2 years) with congenital heart disease (CHD) on non-electrocardiogram (ECG)-gated and ECG-g...
Positive and biphasic extracellular waveforms correspond to return currents and axonal spikes
Positive and biphasic extracellular waveforms correspond to return currents and axonal spikes
Abstract
Multiple biophysical mechanisms may generate non-negative extracellular waveforms during action potentials, but the origin and prevalence of positive spike...

