Javascript must be enabled to continue!
A Unified Lightweight Compute Engine for Softmax, GELU, and SiLU Based on a Shared Exponential Unit
View through CrossRef
The self-attention mechanism and feed-forward networks of the Transformerarchitecture heavily rely on three non-linear activation functions: Softmax,GELU, and SiLU. Traditional hardware acceleration schemes typically deploy these three as independent modules, each equipped with dedicated exponential arithmetic units, look-up tables, and normalization logic, therebyleading to signicant area redundancy and power overhead. To address thisissue, this paper proposes a unied lightweight hardware compute enginefor the aforementioned three operators. Distinct from existing works thatuniversally adopt the tanh approximation, this paper introduces a sigmoidapproximation form of GELU. This mathematical transformation reveals theintrinsic structural identity between GELU and SiLUboth can be equivalently expressed as the product of an input operand and a two-elementsoftmax operator, with their dierence lying merely in a scalar pre-scalingfactor. Based on this architectural unication, Softmax, GELU, and SiLUcan share the same set of core hardware resources encompassing exponentiation, logarithm, and normalization. Switching across modes only requires theminor logic overhead of a single multiplier and a two-way selector. Building upon this, the design synergistically integrates Canonical Signed Digit(CSD) encoded shift-add arithmetic, a Mitchell approximate logarithm withrst-order error correction, a binary reduction adder tree, and mode-awaredata gating techniques, while oering an optional Time-Division Multiplexing (TDM) mechanism to systematically optimize Power-Performance-Area(PPA) metrics. The SystemVerilog hardware implementation, based on 16-bit xed-point inputs and 32-bit internal arithmetic formats, provides twodesign space congurations: the full-throughput conguration maintains thehigh eciency of processing one vector per cycle, reducing the area by 34.9%;the TDM conguration compromises with a halved throughput, further reducing the area by 50.2% and dynamic power by 40.6%. Synthesis evaluationsbased on the SMIC 55 nm CMOS process demonstrate that both congurations, while adding the SiLU functionality absent in baseline designs, achievean over an order-of-magnitude improvement in computational precision (Softmax MSE improved by 24.4×, GELU MSE improved by 17.8×).
Title: A Unified Lightweight Compute Engine for Softmax, GELU, and SiLU Based on a Shared Exponential Unit
Description:
The self-attention mechanism and feed-forward networks of the Transformerarchitecture heavily rely on three non-linear activation functions: Softmax,GELU, and SiLU.
Traditional hardware acceleration schemes typically deploy these three as independent modules, each equipped with dedicated exponential arithmetic units, look-up tables, and normalization logic, therebyleading to signicant area redundancy and power overhead.
To address thisissue, this paper proposes a unied lightweight hardware compute enginefor the aforementioned three operators.
Distinct from existing works thatuniversally adopt the tanh approximation, this paper introduces a sigmoidapproximation form of GELU.
This mathematical transformation reveals theintrinsic structural identity between GELU and SiLUboth can be equivalently expressed as the product of an input operand and a two-elementsoftmax operator, with their dierence lying merely in a scalar pre-scalingfactor.
Based on this architectural unication, Softmax, GELU, and SiLUcan share the same set of core hardware resources encompassing exponentiation, logarithm, and normalization.
Switching across modes only requires theminor logic overhead of a single multiplier and a two-way selector.
Building upon this, the design synergistically integrates Canonical Signed Digit(CSD) encoded shift-add arithmetic, a Mitchell approximate logarithm withrst-order error correction, a binary reduction adder tree, and mode-awaredata gating techniques, while oering an optional Time-Division Multiplexing (TDM) mechanism to systematically optimize Power-Performance-Area(PPA) metrics.
The SystemVerilog hardware implementation, based on 16-bit xed-point inputs and 32-bit internal arithmetic formats, provides twodesign space congurations: the full-throughput conguration maintains thehigh eciency of processing one vector per cycle, reducing the area by 34.
9%;the TDM conguration compromises with a halved throughput, further reducing the area by 50.
2% and dynamic power by 40.
6%.
Synthesis evaluationsbased on the SMIC 55 nm CMOS process demonstrate that both congurations, while adding the SiLU functionality absent in baseline designs, achievean over an order-of-magnitude improvement in computational precision (Softmax MSE improved by 24.
4×, GELU MSE improved by 17.
8×).
Related Results
Les notes biographiques de Victor Gelu : édition et étude critique
Les notes biographiques de Victor Gelu : édition et étude critique
Travaillant depuis des années sur Victor GELU, notamment en mastère, à l’Université Paul-Valéry de MONTPELLIER, sachant qu’un travail de transcription des Correspondances de l’écri...
OPTIMALISASI SISTEM PENDINGIN ENGINE CATERPILLAR 3406E MILIK POLITEKNIK NEGERI JAKARTA
OPTIMALISASI SISTEM PENDINGIN ENGINE CATERPILLAR 3406E MILIK POLITEKNIK NEGERI JAKARTA
ABSTRACTAn engine can not be separated from the various systems in which is one of them is a cooling system, cooling system is the most important system in supporting the performan...
Chasing a Better Decision Margin for Discriminative Histopathological Breast Cancer Image Classification
Chasing a Better Decision Margin for Discriminative Histopathological Breast Cancer Image Classification
When considering a large dataset of histopathologic breast images captured at various magnification levels, the process of distinguishing between benign and malignant cancer from t...
Development of the Tour Split-Cycle Internal Combustion Engine
Development of the Tour Split-Cycle Internal Combustion Engine
<div class="section abstract"><div class="htmlview paragraph">The Tour engine is a novel split-cycle internal combustion engine (ICE) that divides the four-stroke Otto ...
Quantitative Feedback Control of Air Path in Diesel-Dual-Fuel Engine
Quantitative Feedback Control of Air Path in Diesel-Dual-Fuel Engine
<div class="section abstract"><div class="htmlview paragraph">In this paper, we investigate a multivariable control of air path of a diesel-dual-fuel (DDF) engine. The ...
The F-16 Common Engine Bay
The F-16 Common Engine Bay
In 1979 the United States Air Force elected under the Engine Model Derivative Program (EMDP) to explore derivative engine concepts by the General Electric Company and the Pratt and...
Cummins/TACOM Advanced Adiabatic Engine
Cummins/TACOM Advanced Adiabatic Engine
<div class="htmlview paragraph">Cummins Engine Company, Inc. and the U.S. Army have been jointly developing an adiabatic turbocompound engine during the last nine years. Alth...
Impact of Alcohol-Gasoline Fuel Blends in Long-Tailed Boat Application
Impact of Alcohol-Gasoline Fuel Blends in Long-Tailed Boat Application
<div class="section abstract"><div class="htmlview paragraph">Nowadays, human realize to the environment pollution cause from old engine and use the engine misapply. Th...

