Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

A Compute-In-Memory Architecture and System-Technology Codesign Simulator Based 3D NAND Flash

View through CrossRef
The rapid advancement of large language models (LLM) such as ChatGPT has imposed unprecedented demands on hardware in terms of computational power, memory capacity, and energy efficiency. Compute-in-Memory (CIM) technology, which integrates computation directly into memory arrays, has emerged as a promising solution to overcome the data movement bottlenecks of traditional von Neumann architectures, significantly reducing power consumption and enabling large-scale parallel processing. Among various non-volatile memory candidates, 3D NAND flash stands out due to its mature manufacturing process, ultrahigh density, and cost-effectiveness, making it a strong contender for commercial CIM deployment and local inference of large models.<br> Despite these advantages, most existing research on 3D NAND-based CIM remains at the academic level, focusing on theoretical designs or small-scale prototypes, with little attention to system-level architecture design and functional validation using product-grade 3D NAND chips for LLM applications. To address this gap, we propose a novel CIM architecture based on 3D NAND flash, leveraging a Source Line (SL) slicing technique to partition the array for parallel computation at minimal manufacturing cost. This architecture is complemented by an efficient mapping algorithm and pipelined dataflow, enabling system-level simulation and rapid industrial iteration.<br> We develop a PyTorch-based behavioral simulator for LLM inference on the proposed hardware, evaluating the impact of current distribution and quantization on system performance. Our design supports INT4/INT8 quantization and employs dynamic weight storage logic to minimize voltage switching overhead, further optimized through hierarchical pipelining to maximize throughput under hardware constraints.<br> Simulation results show that our simulation-grade 3D NAND compute-in-memory chip reaches generation speeds of 20 tokens/s with an energy efficiency of 5.93 TOPS/W on GPT-2-124M and 8.5 tokens/s with 7.17 TOPS/W on GPT-2-355M, while maintaining system-level reliability for open-state current distributions with <i>σ</i> < 2.5 nA; in INT8 mode, quantization error is the dominant accuracy bottleneck.<br> Compared with previous CIM solutions, our architecture supports larger model loads, higher computational precision, and significantly reduced power consumption, as evidenced by comprehensive benchmarking. The SL slicing technique keeps array wastage below 3%, while hybrid wafer-bonding integrates high-density ADC/TIA circuits to enhance hardware resource utilization.<br> This work represents the first system-level simulation of LLM inference on product-grade 3D NAND CIM hardware, providing a standardized and scalable reference for future commercialization. The complete simulation framework is released on GitHub to facilitate further research and development. Future work will focus on device-level optimization of 3D NAND and iterative improvements to the simulator algorithm.
Acta Physica Sinica, Chinese Physical Society and Institute of Physics, Chinese Academy of Sciences
Title: A Compute-In-Memory Architecture and System-Technology Codesign Simulator Based 3D NAND Flash
Description:
The rapid advancement of large language models (LLM) such as ChatGPT has imposed unprecedented demands on hardware in terms of computational power, memory capacity, and energy efficiency.
Compute-in-Memory (CIM) technology, which integrates computation directly into memory arrays, has emerged as a promising solution to overcome the data movement bottlenecks of traditional von Neumann architectures, significantly reducing power consumption and enabling large-scale parallel processing.
Among various non-volatile memory candidates, 3D NAND flash stands out due to its mature manufacturing process, ultrahigh density, and cost-effectiveness, making it a strong contender for commercial CIM deployment and local inference of large models.
<br> Despite these advantages, most existing research on 3D NAND-based CIM remains at the academic level, focusing on theoretical designs or small-scale prototypes, with little attention to system-level architecture design and functional validation using product-grade 3D NAND chips for LLM applications.
To address this gap, we propose a novel CIM architecture based on 3D NAND flash, leveraging a Source Line (SL) slicing technique to partition the array for parallel computation at minimal manufacturing cost.
This architecture is complemented by an efficient mapping algorithm and pipelined dataflow, enabling system-level simulation and rapid industrial iteration.
<br> We develop a PyTorch-based behavioral simulator for LLM inference on the proposed hardware, evaluating the impact of current distribution and quantization on system performance.
Our design supports INT4/INT8 quantization and employs dynamic weight storage logic to minimize voltage switching overhead, further optimized through hierarchical pipelining to maximize throughput under hardware constraints.
<br> Simulation results show that our simulation-grade 3D NAND compute-in-memory chip reaches generation speeds of 20 tokens/s with an energy efficiency of 5.
93 TOPS/W on GPT-2-124M and 8.
5 tokens/s with 7.
17 TOPS/W on GPT-2-355M, while maintaining system-level reliability for open-state current distributions with <i>σ</i> < 2.
5 nA; in INT8 mode, quantization error is the dominant accuracy bottleneck.
<br> Compared with previous CIM solutions, our architecture supports larger model loads, higher computational precision, and significantly reduced power consumption, as evidenced by comprehensive benchmarking.
The SL slicing technique keeps array wastage below 3%, while hybrid wafer-bonding integrates high-density ADC/TIA circuits to enhance hardware resource utilization.
<br> This work represents the first system-level simulation of LLM inference on product-grade 3D NAND CIM hardware, providing a standardized and scalable reference for future commercialization.
The complete simulation framework is released on GitHub to facilitate further research and development.
Future work will focus on device-level optimization of 3D NAND and iterative improvements to the simulator algorithm.

Related Results

Integration of Deep-Learning-Based Flash Calculation Model to Reservoir Simulator
Integration of Deep-Learning-Based Flash Calculation Model to Reservoir Simulator
Abstract Flash calculation is an essential step in compositional reservoir simulation. However, it consumes a significant part of the simulation process, leading to ...
Perspectives on AI Architectures and Codesign for Earth System Predictability
Perspectives on AI Architectures and Codesign for Earth System Predictability
Abstract Recently, the U.S. Department of Energy (DOE), Office of Science, Biological and Environmental Research (BER), and Advanced Scientific Computing Research (ASCR) programs o...
Flash Radiation Therapy: Current Insights and Future Prospects
Flash Radiation Therapy: Current Insights and Future Prospects
FLASH radiotherapy (RT) is an innovative approach used in cancer treatment. The FLASH effect is observed at ultra-high dose rates (UHDR) of approximately 40 Gy/s or higher. This tr...
Simulator approaches in otorhinolaryngology ‐ Looking beyond the ophthalmological horizon
Simulator approaches in otorhinolaryngology ‐ Looking beyond the ophthalmological horizon
Purpose: To improve medical students' learning of handheld otoscopy technique and findings based on a standardized simulator‐based procedure.Methods: A group of 120 medical student...
CN tower lightning flash characteristics based on high-speed imaging
CN tower lightning flash characteristics based on high-speed imaging
The thesis emphasizes the analysis of fifty-eight flashes that struck the CN Tower during the last five years (2013-2017), based on video records of Phantom v5.0 digital high-speed...
Physics and biology of ultrahigh dose-rate (FLASH) radiotherapy: a topical review
Physics and biology of ultrahigh dose-rate (FLASH) radiotherapy: a topical review
Abstract Ultrahigh dose-rate radiotherapy (RT), or ‘FLASH’ therapy, has gained significant momentum following various in vivo studies published since 2014 which have...
A Detailed Analysis of Issues with Solid-State Devices
A Detailed Analysis of Issues with Solid-State Devices
Non-volatile memory technologies, such as NAND flash memory, have improved storage system performance, reliability, durability, and cost. Due to their speed and density, solid-stat...
BPM: A Bad Page Management Strategy for the Lifetime Extension of Flash Memory
BPM: A Bad Page Management Strategy for the Lifetime Extension of Flash Memory
The lifetime of NAND flash is highly restricted by bit error rate (BER) which would exponentially increase with the number of program/erase cycles. While the error correcting codes...

Back to Top