SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors

Lambard, Guillaume and Gracheva, Ekaterina (2020) SMILES-X: autonomous molecular compounds characterization for small datasets without descriptors. Machine Learning: Science and Technology, 1 (2). 025004. ISSN 2632-2153

[thumbnail of Lambard_2020_Mach._Learn.__Sci._Technol._1_025004.pdf] Text
Lambard_2020_Mach._Learn.__Sci._Technol._1_025004.pdf - Published Version

Download (993kB)

Abstract

There is more and more evidence that machine learning can be successfully applied in materials science and related fields. However, datasets in these fields are often quite small (from tens to several thousands of samples). This means the most advanced machine learning techniques remain neglected, as they are considered to be applicable to big data only. Moreover, materials informatics methods often rely on human-engineered descriptors, that should be carefully chosen, or even created, to fit the physicochemical property that one intends to predict. In this article, we propose a new method that tackles both the issue of small datasets and the difficulty of developing task-specific descriptors. The SMILES-X is an autonomous pipeline for molecular compounds characterisation based on a {Embed-Encode-Attend-Predict} neural architecture with a data-specific Bayesian hyper-parameters optimisation. The only input to the architecture—the SMILES strings—are de-canonicalised in order to efficiently augment the data. One of the key features of the architecture is the attention mechanism, which enables the interpretation of output predictions without extra computational cost. The SMILES-X achieves state-of-the-art results in the inference of aqueous solubility (${\overline{{RMSE}}}_{{\rm{test}}}\simeq 0.57\pm 0.07$ mols/L), hydration free energy (${\overline{{RMSE}}}_{{\rm{test}}}\simeq 0.81\pm 0.22$ kcal/mol, which is ∼24.5% better than molecular dynamics simulations), and octanol/water distribution coefficient (${\overline{{RMSE}}}_{{\rm{test}}}\simeq 0.59\pm 0.02$ for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists. The source code for the SMILES-X is available at github.com/GLambard/SMILES-X.

Item Type: Article
Subjects: Article Archives > Multidisciplinary
Depositing User: Unnamed user with email support@articlearchives.org
Date Deposited: 30 Jun 2023 05:14
Last Modified: 29 Feb 2024 04:36
URI: http://archive.paparesearch.co.in/id/eprint/1741

Actions (login required)

View Item
View Item