Javascript must be enabled to continue!

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.

MDPI AG

Jaewoo Yang Hayun Kim Junyung Ji Younghoon Kim

Future Internet

2025

Title: Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Description:

Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference.

Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8.

However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family.

Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance.

Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence.

To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization.

Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques.

The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.

Back

<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...

NICU Medication Errors: Describing the Cause and Nature of Medication Errors in a NICU in Qatar

IntroductionA medication error can be defined as “any error occurring in the medication use process” and focuses on problems with the delivery of medication to a patient [1]. Medic...

Učinak poučavanja razrednomu jeziku u izobrazbi nastavnika njemačkoga

The actual use of classroom language is principally limited to the classroom environment. As far as foreign language learning is concerned, the classroom often turns out to be the ...

Synthetic data-driven overlapped neural spikes sorting: decomposing hidden spikes from overlapping spikes

AbstractSorting spikes from extracellular recordings, obtained by sensing neuronal activity around an electrode tip, is essential for unravelling the complexities of neural coding ...

Field Investigation of Broken Cut Spikes on Elastic Fasteners Using Instrumented Spikes at FAST

Abstract Elastic fasteners have been shown to reduce gage widening and decrease the potential for rail roll compared to conventional cut-spike-only systems. For this...

Constrained Quantization for Probability Distributions

In this work, we extend the classical framework of quantization for Borel probability measures defined on normed spaces Rk by introducing and analyzing the notions of the nth const...

Evaluation of coronary arteries in congenital heart disease in children: diagnostic comparison of electrocardiogram-gated and non-electrocardiogram-gated computed tomography cardiac angiography

IntroductionTo compare the visualization and anatomy of coronary arteries in children (≤ 2 years) with congenital heart disease (CHD) on non-electrocardiogram (ECG)-gated and ECG-g...

Positive and biphasic extracellular waveforms correspond to return currents and axonal spikes

Abstract Multiple biophysical mechanisms may generate non-negative extracellular waveforms during action potentials, but the origin and prevalence of positive spike...

Email:
Password:

Email:

Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models

Related Results