Chemical topic modeling: Exploring molecular data sets using a common text-mining approach

Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
J. Chem. Inf. Model., 2017, 57(8), 1816-1831
https://doi.org/10.1021/acs.jcim.7b00249

Abstract

Big data is one of the key transformative factors which are increasingly influencing all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular datasets, we adopted a probabilistic framework called "topic modeling" from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to "chemical topics" and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts by using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 dataset to test its robustness and efficiency. In about 1h we built a 100-topic model of this large dataset in which we could identify interesting topics like "proteins", "DNA" or "steroids". Along with this publication we provide our datasets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

logo
logo