Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Compressive structural bioinformatics

View through CrossRef
We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by one or more orders of magnitude. Our data representations allow scanning and analyzing the entire PDB archive in minutes or visualizing structures with millions of atoms in a web browser on a smart phone.   Compact and self-contained data representation – Existing text-based file formats for macromolecular data are slow to parse, are not easily extensible, and do not contain certain key data (e.g., all bonding information). For these reasons we have developed the Macromolecular Transmission Format (MMTF) ( http://mmtf.rcsb.org/ ). MMTF has three core benefits over existing file formats. First, through custom compression methods, the entire Protein Data Bank (PDB) archive can be stored in 7GB. This enables fast network transfer for visualization and in-memory processing of the entire PDB. Second, MMTF data are serialized into MessagePack ( http://msgpack.org ), a compact, extensible and efficient format, similar to JSON, but binary for faster parsing. Third, MMTF is user friendly, extensible and contains information not found in current formats. In this work we show that MMTF enables high-performance visualization and scalable structural analysis of the PDB archive. High-performance web-based visualization – The MMTF files are served directly from RAM using a RESTful service. This low latency service, combined with the reduced individual file size and the increased parsing speed of the binary format facilitates high performance web-based visualizations. Specifically we have seen a greater than 20x speedup over mmCIF in loading of PDB entries from sites across the USA, Europe, and Asia. Using the MMTF JavaScript API and NGL, a highly memory-efficient WebGL-based viewer ( https://github.com/arose/ngl ), even the largest structures in PDB can be visualized on a smart phone. High-performance distributed parallel workflows – The order of magnitude increase in parsing speed enables scalable Big Data analysis of 3D macromolecular structures. A Hadoop sequence file (binary flat file of key value pairs, optimized for parallel sequential access) of MMTF data is released and updated weekly for the entire PDB archive. Distributed, parallel processing is then possible from this file, using Big Data frameworks such as Apache Spark ( http://spark.apache.org/ ). As an example, we have used this file for ligand extraction. We extracted all ligands from the PDB using the MMTF Hadoop file with Apache Spark in about 3 minutes. In contrast, using mmCIF files as input, the same task took several hours. The MMTF file format enables a paradigm change for structural bioinformatics applications. It is now possible to store the entire PDB in memory to eliminate I/O bottlenecks, to rapidly visualize large structures over the web, and to trivially perform distributed parallel processing on laptops, desktops, and compute clusters. This project was supported by the National Cancer Institute of the NIH’s Big Data to Knowledge initiative (BD2K) under award number U01 CA198942.
Title: Compressive structural bioinformatics
Description:
We are developing compressed 3D molecular data representations and workflows (“Compressive Structural Bioinformatics”) to speed up mining and visualization of 3D structural data by one or more orders of magnitude.
Our data representations allow scanning and analyzing the entire PDB archive in minutes or visualizing structures with millions of atoms in a web browser on a smart phone.
  Compact and self-contained data representation – Existing text-based file formats for macromolecular data are slow to parse, are not easily extensible, and do not contain certain key data (e.
g.
, all bonding information).
For these reasons we have developed the Macromolecular Transmission Format (MMTF) ( http://mmtf.
rcsb.
org/ ).
MMTF has three core benefits over existing file formats.
First, through custom compression methods, the entire Protein Data Bank (PDB) archive can be stored in 7GB.
This enables fast network transfer for visualization and in-memory processing of the entire PDB.
Second, MMTF data are serialized into MessagePack ( http://msgpack.
org ), a compact, extensible and efficient format, similar to JSON, but binary for faster parsing.
Third, MMTF is user friendly, extensible and contains information not found in current formats.
In this work we show that MMTF enables high-performance visualization and scalable structural analysis of the PDB archive.
High-performance web-based visualization – The MMTF files are served directly from RAM using a RESTful service.
This low latency service, combined with the reduced individual file size and the increased parsing speed of the binary format facilitates high performance web-based visualizations.
Specifically we have seen a greater than 20x speedup over mmCIF in loading of PDB entries from sites across the USA, Europe, and Asia.
Using the MMTF JavaScript API and NGL, a highly memory-efficient WebGL-based viewer ( https://github.
com/arose/ngl ), even the largest structures in PDB can be visualized on a smart phone.
High-performance distributed parallel workflows – The order of magnitude increase in parsing speed enables scalable Big Data analysis of 3D macromolecular structures.
A Hadoop sequence file (binary flat file of key value pairs, optimized for parallel sequential access) of MMTF data is released and updated weekly for the entire PDB archive.
Distributed, parallel processing is then possible from this file, using Big Data frameworks such as Apache Spark ( http://spark.
apache.
org/ ).
As an example, we have used this file for ligand extraction.
We extracted all ligands from the PDB using the MMTF Hadoop file with Apache Spark in about 3 minutes.
In contrast, using mmCIF files as input, the same task took several hours.
The MMTF file format enables a paradigm change for structural bioinformatics applications.
It is now possible to store the entire PDB in memory to eliminate I/O bottlenecks, to rapidly visualize large structures over the web, and to trivially perform distributed parallel processing on laptops, desktops, and compute clusters.
This project was supported by the National Cancer Institute of the NIH’s Big Data to Knowledge initiative (BD2K) under award number U01 CA198942.

Related Results

Advancements in Biomedical and Bioinformatics Engineering
Advancements in Biomedical and Bioinformatics Engineering
Abstract: The field of biomedical and bioinformatics engineering is witnessing rapid advancements that are revolutionizing healthcare and medical research. This chapter provides a...
A large-scale analysis of bioinformatics code on GitHub
A large-scale analysis of bioinformatics code on GitHub
AbstractIn recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of sof...
Bioinformatics tool and web server development focusing on structural bioinformatics applications
Bioinformatics tool and web server development focusing on structural bioinformatics applications
This thesis is divided into two main sections: Part 1 describes the design, and evaluation of the accuracy of a new web server – PRotein Interactive MOdeling (PRIMO-Complexes) for ...
Design and Analysis of Three-Dimensional Printing of A Porous Titanium Scaffold
Design and Analysis of Three-Dimensional Printing of A Porous Titanium Scaffold
Abstract Objective To develop suitable structural designs for the three-dimensional (3-D) printing of a porous titanium scaffold to fill bone defects in knee joints. Pore d...
The Thermodynamic Impact of Compressive Fluctuations on the Solar Wind in the Inner Heliosphere
The Thermodynamic Impact of Compressive Fluctuations on the Solar Wind in the Inner Heliosphere
The solar wind plasma is observed to fluctuate over a broad range of space and time scales, extending from scales above the magnetic field correlation scale to below those associat...
The Role and Progress of Bioinformatics in Genomics Research
The Role and Progress of Bioinformatics in Genomics Research
With the rapid development of high-throughput sequencing technology, genomics research has entered the era of big data. Bioinformatics, as a bridge connecting biology, computer sci...

Back to Top