Javascript must be enabled to continue!

A model implementation of a scalable data store for scientific computing with DataLad

Introduction In recent years, a growing awareness of the role of sample size for replicable results emerged in neuroscience (Button et al., 2013; Turner et al., 2018), and publicly available datasets such as the Human Connectome Project and the UKBiobank provide sample sizes orders of magnitude larger than previous datasets (Bzdok & Yeo, 2017). In most cases, however, neither the computational infrastructure, nor researchers’ analysis workflows scale to these dataset sizes: The need for simultaneous data access by multiple users across multiple systems, storage and computing demands and multiplicative increases in dataset size when users create copies or derivatives of the original data, pose challenges for workstations and HPC/HTC infrastructure alike. The substantial dataset sizes deem “simply topping up” existing infrastructure not only economically wasteful, but unfeasible. Instead, these challenges require state-of-the-art data management solutions. Here, we detail a model implementation of a distributed, scalable data storage as one potential solution. Methods The storage implementation builds upon DataLad (Halchenko et al., 2019). The distributed version control capabilities of DataLad and its underlying software allow to build a remote data store separate from infrastructure used for computing (Figure 1). This data store hosts pristine data as DataLad datasets, and centers scientific workflows around DataLad for version control, collaboration, provenance capture, and data archival, with the potential for disk-space aware computing workflows. To allow data access, the store is configured as a git-annex RIA-remote (Poldrack & Hanke, 2019). Its internal structure is a tree of the datasets’ bare Git repositories with an - optionally 7zipped - annex keystore, identifiable via their dataset UUID (see Figure 2). Beyond flexible access to a remote storage location, the RIA remote thus allows for compression gains and can mitigate inode limitations while still providing read access to the (compressed) 7z archives. Its completely domain-agnostic data representation enables maintenance by non-neuroscientific data management personnel, and allows the remote data store to be backed-up and maintained by experienced system administrators and data curators. From a system-administrative perspective, computational best-practices are incentivized. The separation of $DATA and $COMPUTE allows for more efficient use of compute power for calculations instead of data storage. Since users $HOME’s (e.g., home directories, laptops, workstations) storage capacities are limited, they allow data exploration and code development but require larger analyses to be staged on compute nodes. Inputs are fetched from $DATA and analyses’ results are analogously published back to the data store. This separation helps to prevent uncontrolled clutter and unmanaged dataset copies on shared infrastructure. From a user’s perspective, technical overhead is kept minimal. Necessary configurations are distributed via system-wide DataLad procedures, and interactions with $DATA and $COMPUTE require no knowledge of the underlying implementation. Upon creation of a new analysis dataset, sibling-projects on the institute’s GitLab instance and in the data store are created automatically to allow data retrieval, data archival, result back-up, and collaboration. Results Changes in computational infrastructure and work routines around DataLad enable disk-space aware computational routines for datasets of any size, simplified version control, automatic linkage between analyses and data, streamlined data publication routines and result backups, and automatic provenance capture. Conclusions We present a scalable, domain-agnostic data storage that can be distributed over multiple machines. In conjunction with simplified data management workflows centered around DataLad datasets, researchers can employ reproducible, version-controlled, and FAIR scientific workflows on large-scale datasets with minimal technical overhead. References Button, K., Ioannidis, J., Mokrysz, C. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 14, 365–376 (2013) doi:10.1038/nrn3475 Bzdok, D. & Yeo, B.T. T. (2017). Inference in the age of big data: Future perspectives on neuroscience, NeuroImage, 155, 549-564. https://doi.org/10.1016/j.neuroimage.2017.04.061. Halchenko, Yaroslav O., Hanke, Michael, Poldrack, Benjamin, Meyer, Kyle, Solanky, Debanjum Singh, Alteva, Gergana, ... Markiewicz, Christopher J. (2019, October 20). datalad/datalad 0.12.0rc6 (Version 0.12.0rc6). Zenodo. http://doi.org/10.5281/zenodo.3512712 Poldrack, B. & Hanke, M., ria-remote 0.6.1, https://github.com/datalad/git-annex-ria-remote Turner, B.O., Paul, E.J., Miller, M.B. et al. Small sample sizes reduce the replicability of task-based fMRI studies. Commun Biol 1, 62 (2018) doi:10.1038/s42003-018-0073-z

F1000 Research Ltd

Benjamin Poldrack Adina Wagner Alex Waite Laura Waite Michael Hanke

2025

Title: A model implementation of a scalable data store for scientific computing with DataLad

Description:

Introduction In recent years, a growing awareness of the role of sample size for replicable results emerged in neuroscience (Button et al.

, 2013; Turner et al.

, 2018), and publicly available datasets such as the Human Connectome Project and the UKBiobank provide sample sizes orders of magnitude larger than previous datasets (Bzdok & Yeo, 2017).

In most cases, however, neither the computational infrastructure, nor researchers’ analysis workflows scale to these dataset sizes: The need for simultaneous data access by multiple users across multiple systems, storage and computing demands and multiplicative increases in dataset size when users create copies or derivatives of the original data, pose challenges for workstations and HPC/HTC infrastructure alike.

The substantial dataset sizes deem “simply topping up” existing infrastructure not only economically wasteful, but unfeasible.

Instead, these challenges require state-of-the-art data management solutions.

Here, we detail a model implementation of a distributed, scalable data storage as one potential solution.

Methods The storage implementation builds upon DataLad (Halchenko et al.

, 2019).

The distributed version control capabilities of DataLad and its underlying software allow to build a remote data store separate from infrastructure used for computing (Figure 1).

This data store hosts pristine data as DataLad datasets, and centers scientific workflows around DataLad for version control, collaboration, provenance capture, and data archival, with the potential for disk-space aware computing workflows.

To allow data access, the store is configured as a git-annex RIA-remote (Poldrack & Hanke, 2019).

Its internal structure is a tree of the datasets’ bare Git repositories with an - optionally 7zipped - annex keystore, identifiable via their dataset UUID (see Figure 2).

Beyond flexible access to a remote storage location, the RIA remote thus allows for compression gains and can mitigate inode limitations while still providing read access to the (compressed) 7z archives.

Its completely domain-agnostic data representation enables maintenance by non-neuroscientific data management personnel, and allows the remote data store to be backed-up and maintained by experienced system administrators and data curators.

From a system-administrative perspective, computational best-practices are incentivized.

The separation of $DATA and $COMPUTE allows for more efficient use of compute power for calculations instead of data storage.

Since users $HOME’s (e.

, home directories, laptops, workstations) storage capacities are limited, they allow data exploration and code development but require larger analyses to be staged on compute nodes.

Inputs are fetched from $DATA and analyses’ results are analogously published back to the data store.

This separation helps to prevent uncontrolled clutter and unmanaged dataset copies on shared infrastructure.

From a user’s perspective, technical overhead is kept minimal.

Necessary configurations are distributed via system-wide DataLad procedures, and interactions with $DATA and $COMPUTE require no knowledge of the underlying implementation.

Upon creation of a new analysis dataset, sibling-projects on the institute’s GitLab instance and in the data store are created automatically to allow data retrieval, data archival, result back-up, and collaboration.

Results Changes in computational infrastructure and work routines around DataLad enable disk-space aware computational routines for datasets of any size, simplified version control, automatic linkage between analyses and data, streamlined data publication routines and result backups, and automatic provenance capture.

Conclusions We present a scalable, domain-agnostic data storage that can be distributed over multiple machines.

In conjunction with simplified data management workflows centered around DataLad datasets, researchers can employ reproducible, version-controlled, and FAIR scientific workflows on large-scale datasets with minimal technical overhead.

References Button, K.

, Ioannidis, J.

, Mokrysz, C.

et al.

Power failure: why small sample size undermines the reliability of neuroscience.

Nat Rev Neurosci 14, 365–376 (2013) doi:10.

1038/nrn3475 Bzdok, D.

& Yeo, B.

(2017).

Inference in the age of big data: Future perspectives on neuroscience, NeuroImage, 155, 549-564.

https://doi.

org/10.

1016/j.

neuroimage.

2017.

04.

061.

Halchenko, Yaroslav O.

, Hanke, Michael, Poldrack, Benjamin, Meyer, Kyle, Solanky, Debanjum Singh, Alteva, Gergana, .

Markiewicz, Christopher J.

(2019, October 20).

datalad/datalad 0.

12.

0rc6 (Version 0.

12.

0rc6).

Zenodo.

http://doi.

org/10.

5281/zenodo.

3512712 Poldrack, B.

& Hanke, M.

, ria-remote 0.

1, https://github.

com/datalad/git-annex-ria-remote Turner, B.

, Paul, E.

, Miller, M.

et al.

Small sample sizes reduce the replicability of task-based fMRI studies.

Commun Biol 1, 62 (2018) doi:10.

1038/s42003-018-0073-z.

Back

Drug store, as a special type of retail store, was moving through rapid evolution in Thailand as a result of retail globalization as well as increasing health consciousness of the ...

YODA: YODA's Organigram on Data Analysis

Although "incremental progress" is often referred to as the lesser kind of progress, it is arguably the true foundation of the scientific process. At the same time, building atop o...

Studi Kasus Strategi Komunikasi Pemasaran Pickers Store dalam Meningkatkan Penjualan

Abstract. Pickers Store is a store that has a custom concept for street fashion, culture, vintage, retro and do it yourself projects. Pickers Store has been established since 2012....

The Effect of Store Image on Store Loyalty Mediated by Customer Satisfaction and Trust

Purpose: This study aims to identify the influence of store image on store loyalty in the context of the Bangladesh retail market. It also observes the mediating roles of trust and...

Do Store Brands Aid Store Loyalty?

Do store brands aid store loyalty by enhancing store differentiation or merely draw price-sensitive customers with little or no store loyalty? This paper seeks to answer this quest...

A discussion framework of store image and patronage: a literature review

Purpose The purpose of this paper is to critically review the literature that explores the relationships between store image and store patronage behaviour, thereby providing the pr...

PENGARUH ATMOSFIR TOKO, LOKASI TOKO DAN DISPLAY PRODUK TERHADAP KEPUTUSAN PEMBELIAN KONSUMEN DI TOKO KELONTONG SRC HENDRO

This study aims to determine the effect of store atmosphere, store location, and product display on consumer purchasing decisions at SRC Hendro grocery store. The type of research ...

Pengaruh In-Store Display terhadap Impulse Buying pada Store Hammesrtout Kota Bandung

Abstract. The fashion industry is one of the rapidly growing sectors and contributes to the creative economy in Indonesia, especially in Bandung City which is known as the center o...

Email:
Password:

Email:

A model implementation of a scalable data store for scientific computing with DataLad

Related Results