Blog #3: Molecules, Molecules, Molecules

I can’t believe it’s been a month and a half since I started this internship – time really flies by! Since the last blog post, I’ve mainly been working on converting the Monte Carlo Tree Search (MCTS) algorithm to work with molecules.

The way molecule generation would work with MCTS is through repeated rewrites to existing molecules until a suitable molecule is found. We start with an arbitrary molecule as the root node, change one functional group with every child node, and continue iterating until we’ve changed the right combinations of child nodes to find a molecule that suits our needs.

In order to achieve this, we need to write a few more classes. We need a Rewrites class that dictates which rewrites—changes to or additions of functional groups—are chemically allowed to be made, and a Gaussian Process Regression (GPR) class that tells us whether or not the newly create molecule is suitable enough for our needs.

For the rewrites class, I used the Python RDKit library, which provides us with a form to represent molecules, and allows us to simulate all sorts of chemical reactions with these molecules. Each rewrite can be either an addition or a substitution – to add a new functional group to the existing molecule, or to substitute an existing functional group to another. Using RDKit, I would simulate the addition/substitution, determine whether or not it is chemically allowed, and if so, return the new molecule as a node in the MCTS tree.

In creating the rewrites class, I learned a lot about the RDKit module. Previously, I hadn’t had much exposure working on Chemistry-related projects in Python. It was really fun to try out and learn more about a new branch of applications in the field. Also, in the process, I found a bug in RDKit, and submitted my first Git issue!

For the GPR class, I would load in a pre-trained GPR model to predict the molecular properties of the newly created molecules. If its properties satisfy the restrictions we make (for example, within a certain range of homo/lumo gap), then we have found our new molecule.

The GPR model uses a Marginalized Graph Kernel to compute similarities between different graphs—or, in this case, different molecules. However, it takes a while to compute the kernel matrix, a matrix that stores information about the similarities of each molecule, every time we load the GPR model. Therefore, I worked on saving the kernel matrix after training, then simply loading it back in to a pre-computed kernel which takes arbitrary indices to the kernel matrix as input and returns the similarities between each of the indexed molecules. Since we are simply loading in the kernel matrix instead of calculating it every time, it makes the prediction process much faster.

However, the GPR models trained before could be improved upon. Previously, I had trained the models without the use of optimizers, so the kernel parameters remain the same throughout training. Through optimization, a local minima of the cost function that produces the combination of kernel parameters with lowest cost in the nearby region is chosen, so a more accurate model can be trained.

The SciPy GPR module I was working on had a convergence issue with the optimizer, resulting in incredibly slow training times. I would leave the GPR model to train overnight and wake up to find that barely any progress had been made. My research mentors provided me with a pre-released update of the Python GraphDot module, which included its own GPR model. I’ll be working on training my classifier with GraphDot’s GPR model in the next few weeks!