Blog #4: Visualizations!

Woah, it’s been two months since the beginning of the internship! I’ve been working mostly on polishing the Monte Carlo Tree Search (MCTS) algorithm and getting better results in training GPR since the last blog post.

When I first designed the MCTS algorithm for molecules, I was directly taking it from the tic tac toe algorithm. Therefore, a lot of the calculations were very similar, including the way the best node to be chosen in the next iteration was computed. In tic tac toe, I simply calculated the win rate over several simulations – the more times the game was won from that node, the better that node is. However, there are a few key differences in the MCTS version.

For one, there is no “win” or “lose” when we’re talking about designing molecules. There is a set parameter, the Homo/Lumo (H/L) gap, that we are testing upon, and the closer a molecule is to the H/L gap we want, the better. However, not only do we need to take the molecule in the next level of the tree into account, we also need to consider how good the node’s children are.

Therefore, we decided to simulate the H/L gaps of the node and all its children as a normal distribution, with a mean H/L gap and a variance. The higher the PDF (probability density function) of that distribution is at the target H/L value, the better the node is. In addition, we also consider how many times that particular node has been simulated, in order to encourage exploration of different types of nodes.

Other than modifying the MCTS algorithm, I’ve also been testing different kernels for the Gaussian Process Regression (GPR) model. Now that I’m converting my molecular graphs from RDKit, there are a lot more graph properties that RDKit provides that I can take advantage of. For example, each atom—each node in the graph—now has characteristics like atomic number, charge, and adjacent hydrogen count. These new characteristics provide more information to the graph and will hopefully allow the model to train better.

Furthermore, I learned to scale each property with a Constant kernel, so that through training, the model can learn the importance of each characteristic. I also experimented with TensorProduct and Additive kernels (which multiply and add the values, respectively), as well as normalizing kernels before training. The TensorProduct kernels take an extremely long time to train because the gradients die out more easily, so it takes a longer time to reach a local minimum through training. Normalizing kernels provide the benefit of putting all the data on the same scale. Therefore, I ended up training with a normalized additive kernel.

In addition to that, I also learned about t-distributed stochastic neighbor embedding (t-SNE), a method of embedding high-dimensional data into a lower-dimensional space, usually of two or three dimensions, so that it can be visualized more clearly. The graph dataset is represented by the covariance matrix outputted by the Marginalized Graph Kernel. This covariance matrix is of size (dataset length, dataset length), and describes the similarity between every two molecules in the dataset. Passing that matrix into SciPy’s TSNE function, we get a result vector that we can plot to visualize each datapoint in a 2-dimentional graph.

However, there are a few hyperparameters to play around with in t-SNE. The “perplexity” parameter takes into account the number of nearest neighbours of each datapoint that are considered in the calculation, and different values can result in drastically different results. The learning rate is another hyperparameter to tune. According to the Scikit-learn documentation, “if the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbours. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers.” Therefore, I experimented with different learning rates and perplexity values.

I’ve included a final visualization of the dataset below, calculated with a learning rate and perplexity value of 50.