Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Exhaustive Molecular String Enumeration for Data Augmentation and Structure Exploration

View through CrossRef
Molecular strings such as SMILES and SELFIES are linear representations of molecular graphs. Due to the inherent nonlinear structure of molecular graphs, a priori, there is no unique string representation for a given molecular graph but rather many alternative strings correspond to the same graph. This allows for the randomization of molecular strings to be used both as a means of data augmentation for training sequence-based deep learning models and, when applied to SELFIES strings, for the systematic exploration of chemical space through random string mutations and interpolations. However, established algorithms for the randomization of molecular strings only account for the randomization of the starting atom and the main and sub branches. They do not randomize the spanning trees of the molecular graphs. This results in a substantial number of overlooked molecular strings for structures with at least one ring. Herein, we present TYCHE, an algorithm for the exhaustive enumeration of all spanning trees of molecular graphs and, thus, for the generation of many more randomized molecular strings. TYCHE shows a systematic and robust performance increase when used for data augmentation, string mutation, and string interpolation. Several case studies showcase the large potential of TYCHE for impact in cheminformatics and molecular machine learning.
Title: Exhaustive Molecular String Enumeration for Data Augmentation and Structure Exploration
Description:
Molecular strings such as SMILES and SELFIES are linear representations of molecular graphs.
Due to the inherent nonlinear structure of molecular graphs, a priori, there is no unique string representation for a given molecular graph but rather many alternative strings correspond to the same graph.
This allows for the randomization of molecular strings to be used both as a means of data augmentation for training sequence-based deep learning models and, when applied to SELFIES strings, for the systematic exploration of chemical space through random string mutations and interpolations.
However, established algorithms for the randomization of molecular strings only account for the randomization of the starting atom and the main and sub branches.
They do not randomize the spanning trees of the molecular graphs.
This results in a substantial number of overlooked molecular strings for structures with at least one ring.
Herein, we present TYCHE, an algorithm for the exhaustive enumeration of all spanning trees of molecular graphs and, thus, for the generation of many more randomized molecular strings.
TYCHE shows a systematic and robust performance increase when used for data augmentation, string mutation, and string interpolation.
Several case studies showcase the large potential of TYCHE for impact in cheminformatics and molecular machine learning.

Related Results

Plasma Cell Enumeration By Manual and Automated Methods to Establish a Standard Pictorial Reference
Plasma Cell Enumeration By Manual and Automated Methods to Establish a Standard Pictorial Reference
Background The diagnosis of plasma cell dyscrasias requires accurate, reliable enumeration of bone marrow plasma cell burden. This is typically assessed by manual...
Parameterized Strings: Algorithms and Applications
Parameterized Strings: Algorithms and Applications
The parameterized string (p-string), a generalization of the traditional string, is composed of constant and parameter symbols. A parameterized match (p-match) exists between two p...
Axial Excitation Tool String Modelling
Axial Excitation Tool String Modelling
Current types of axial excitation tool have been shown to produce beneficial results — in terms of load transfer to the bit, general reductions in string friction and reductions in...
18-3/4 in. FullBore Wellhead System
18-3/4 in. FullBore Wellhead System
Abstract This paper describes the development of full-bore wellheads, a new 18-3/4 in.15,000 psi W.P. system, from conception to field installation. The wellheadw...
Lateral Vibration Analysis of Oil Production Casing String in Deepwater Shallow Under Earthquake Excitations
Lateral Vibration Analysis of Oil Production Casing String in Deepwater Shallow Under Earthquake Excitations
Abstract The majority of deep-sea oil and gas exploration areas are located in seismic active zone, such as the South China Sea and Suez basin in Egypt. As casing st...
Efficient enumeration algorithms for minimal graph completions and deletions
Efficient enumeration algorithms for minimal graph completions and deletions
Algorithmes d'énumération efficaces pour les complétions et délétions minimales de graphes Cette thèse porte sur la théorie des graphes et plus particulièrement les...
Buckling of Concentric String Pipe-in-Pipe
Buckling of Concentric String Pipe-in-Pipe
Abstract During the design stage of concentric tubular strings, the outer string is always considered to be rigid. However, in reality, the outer string can become d...

Back to Top