Javascript must be enabled to continue!
Locality-Aware CTA Clustering for Modern GPUs
View through CrossRef
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.
Association for Computing Machinery (ACM)
Title: Locality-Aware CTA Clustering for Modern GPUs
Description:
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward.
The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory.
In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality.
Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity.
To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache.
Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable.
By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM.
Our techniques require no hardware modification and can be directly deployed on existing GPUs.
In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization.
We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures.
The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.
46x, 1.
48x, 1.
45x, 1.
41x (up to 3.
8x, 3.
6x, 3.
1x, 3.
3x) performance speedups for applications with algorithm-related inter-CTA reuse.
Related Results
Towards zero-power wireless machine-to-machine networks
Towards zero-power wireless machine-to-machine networks
This thesis aims at contributing to overcome two of the main challenges for the deployment of M2M networks in data collection scenarios for the Internet of Things: the management o...
Classification, natural history, and evolution of Epiphloeinae (Coleoptera: Cleridae). Part VII. The genera Hapsidopteris Opitz, Iontoclerus Opitz, Katamyurus Opitz, Megatrachys Opitz, Opitzia Nemesio, Pennasolis Opitz, new genus, Pericales Opitz, new gen
Classification, natural history, and evolution of Epiphloeinae (Coleoptera: Cleridae). Part VII. The genera Hapsidopteris Opitz, Iontoclerus Opitz, Katamyurus Opitz, Megatrachys Opitz, Opitzia Nemesio, Pennasolis Opitz, new genus, Pericales Opitz, new gen
This study deals with minimally speciose epiphloeine genera. Hapsidopteris, based on H. diastenus Opitz, (type locality: México: Jalapa), is the presumed sister taxon of Opitzia Ne...
Preoperative Assessment for Carotid Artery Stenosis: Utility of a Combined Diagnostic Approach by Dynamic MRA and CTA
Preoperative Assessment for Carotid Artery Stenosis: Utility of a Combined Diagnostic Approach by Dynamic MRA and CTA
Dynamic Magnetic Resonance Angiography (MRA) and Computed Tomography Angiography (CTA) represent two non-invasive techniques which can perform a pre-therapeutic evaluation of carot...
Computed Tomography Angiography Utilization in Lower Extremity Trauma: Insights From a Canadian Level I Trauma Centre
Computed Tomography Angiography Utilization in Lower Extremity Trauma: Insights From a Canadian Level I Trauma Centre
Introduction:
Computed tomography angiography (CTA) plays an important role in assessing patients with suspected lower extremity traumatic vascular injury. However, CTA...
CT Angiography Image ASPECTS Shows Superior Correlation with 24-Hour DWI-ASPECTS Compared to NCCT-ASPECTS in Revascularized Emergent Large Vessel Occlusion
CT Angiography Image ASPECTS Shows Superior Correlation with 24-Hour DWI-ASPECTS Compared to NCCT-ASPECTS in Revascularized Emergent Large Vessel Occlusion
Abstract
Background and Aims:The Alberta Stroke Program Early CT Score (ASPECTS) is a semi-quantitative tool used to estimate infarct core in acute ischemic stroke (AIS). W...
Determine Cumulative Radiation Dose and Lifetime Cancer Risk in Marfan Syndrome Patients Who Underwent Computed Tomography Angiography of the Aorta in Northeast Thailand: A 5-Year Retrospective Cohort Study
Determine Cumulative Radiation Dose and Lifetime Cancer Risk in Marfan Syndrome Patients Who Underwent Computed Tomography Angiography of the Aorta in Northeast Thailand: A 5-Year Retrospective Cohort Study
Objective: To evaluate computed tomography angiography (CTA) data focusing on radiation dose parameters in Thais with Marfan syndrome (MFS) and estimate the distribution of cumulat...
IMMUNOGENIC CANCER-TESTIS ANTIGENS AND THEIR GENES IN MALIGNANT TUMORS
IMMUNOGENIC CANCER-TESTIS ANTIGENS AND THEIR GENES IN MALIGNANT TUMORS
The analysis of CTG and CTA expression in malignant tumors described in this review has been showed that different types of tumors are significantly different from each other accor...
CTA versus TOF-MRA for circle of Willis segmentation: Implications for hemodynamic modelling
CTA versus TOF-MRA for circle of Willis segmentation: Implications for hemodynamic modelling
Abstract
Modelling of hemodynamics in the circle of Willis (CoW) depends on vascular segmentation, which may vary based on imaging modality. Computed tomography ang...

