Interactive filtering of predicted ligands

There are many techniques to generate a ranked list of molecules having some desirable property, such as activity at a protein target of interest. The problem here is that the number of predicted ligands could easily exceed the synthesis and assay budget. Further issues are that some proportion of the ligands:

The list could go on, but basically ranked predictions should be accompanied by some method of filtering. A large-scale example of this is the Ultra-large library docking for discovering new chemotypes. It usually involves some visual inspection by a medicinal chemist or pharmacologist as well as automated filtering.

This page demonstrates how to do this in an interactive way using RDKit and a Jupyter notebook.

What it looks like:

The interactive parts don’t render on this page, but you can see a gif of the results below. You can also download the notebook and example SMILES file in this repo.

example

The code:

It all boils down to filtering a pandas dataframe using ipywidgets, so if you don’t have that then install and enable first:

pip install ipywidgets
jupyter nbextension enable --py widgetsnbextension

For synthetic accessibility scores, you’ll also need sascorer.py and fpscores.pkl.gz which are available here

Setup:

#imports:
import pandas as pd
import numpy as np

import ipywidgets as widgets
from ipywidgets import interact, interact_manual

import rdkit
from rdkit import Chem
from rdkit.Chem import Draw, AllChem, rdFingerprintGenerator, PandasTools, Descriptors, FilterCatalog
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem.FilterCatalog import *

import sascorer
from rdkit.Chem.Descriptors import qed

#for clustering (not essential:)
from sklearn.metrics import pairwise_distances
from sklearn.cluster import AgglomerativeClustering

Here I’m using a random selection of ligands as a stand-in for a real ranked list. I’m also using some randomly selected ligands for the known positives at an imaginary target-of-interest:

predicted = pd.read_csv('random_predicted.csv')
true_pos = pd.read_csv('random_known.csv')

##Adding a 'molecules' column using rdkit pandastools:
PandasTools.AddMoleculeColumnToFrame(predicted,'canonical_smiles','molecules')
PandasTools.AddMoleculeColumnToFrame(true_pos,'canonical_smiles','molecules')

The below adds a bunch of descriptors to the predicted dataframe, such as:

you can easily add your own like logp, molwt, fragments (i.e. carboxylic acids using Descriptors.fr_COO), tpsa, etc…

#generate fingerprints of predicted ligands and known ligands:
gen_mo = rdFingerprintGenerator.GetMorganGenerator(fpSize=2048, radius=2)
predicted_fps = [gen_mo.GetFingerprint(mol) for mol in predicted['molecules']]
true_fps = [gen_mo.GetFingerprint(mol) for mol in true_pos['molecules']]

#create a list holding the highest similarity to a known ligand.
similarities= list()
for count, mol in enumerate(predicted_fps):
    tanimoto_values = ([DataStructs.TanimotoSimilarity(mol, i) for i in true_fps])
    index_of_highest = np.argmax(tanimoto_values)
    similarities.append(tanimoto_values[index_of_highest])
    
#create a list holding the 'synthetic accessibility score'
#reference: https://doi.org/10.1186/1758-2946-1-8
#module code is in: https://github.com/rdkit/rdkit/tree/master/Contrib/SA_Score
sa_score = [sascorer.calculateScore(i) for i in list(predicted['molecules'])]


#create a list holding the QED drug-likeness score
#reference: https://doi.org/10.1038/nchem.1243
qeds = [qed(mol) for mol in predicted['molecules']]



#create a list holding logp:
logp = [Descriptors.MolLogP(m) for m in predicted['molecules']]


#filter catalog usage instructions are here: https://github.com/rdkit/rdkit/pull/536
params = FilterCatalogParams()
params.AddCatalog(FilterCatalogParams.FilterCatalogs.BRENK)
catalog = FilterCatalog(params)
brenk = np.array([catalog.HasMatch(m) for m in predicted['molecules']])


#add these lists as columns to the 'predicted' pd.DataFrame
predicted['similarities']=similarities
predicted['sa_score']=sa_score
predicted['qeds']=qeds
predicted['logp']=logp



predicted.head()
instance_id canonical_smiles molecules similarities sa_score qeds logp
0 CHEMBL1384221 FC(F)(F)c1cccc(c1)C(=O)c2c[nH]c(c2)C(=O)NCCCN3CCOCC3 Mol 0.326923 2.398588 0.544857 2.71660
1 CHEMBL3319491 CC(Nc1cc(F)cc(Cl)c1)C2=CC(=CN3C(=O)C=C(N=C23)N4CCOCC4)C(=O)N(C)C Mol 0.305085 3.340177 0.612784 3.19850
2 CHEMBL183903 C[C@@H]1OC(=O)[C@H]2C[C@H]3CCCC[C@@H]3C(CCC4CCC(C)(C)N4C)[C@@H]12 Mol 0.203883 4.735916 0.696426 4.64340
3 CHEMBL3702250 CN(C)CC#CCNC(=O)c1sc2ncnc(Nc3ccc(F)cc3OC(CF)CF)c2c1C Mol 0.310078 3.218455 0.442811 3.86362
4 CHEMBL62778 COc1ccc2c(c1)nc(N3CCNCC3)n2Cc4ccccc4 Mol 0.326087 2.069884 0.801327 2.50280

Make it interactive

This boils down to building a mask based on your chosen level of the above parameters, which is then used to filter the predicted list and return the remaining ranked hits. The function returns the rdkit Draw.MolsToGridImage to visualize the remaining molecules.

Instructions:

After this, it’s up to the user to select their desired top x ligands.

@interact
def get_top_ligands(similarity=(0.2, 1, 0.01), sa_score=4, qeds=0.25, logp=6, num_mols=(1, 25, 1)):
    shortlist_mask = ((predicted['similarities']<similarity) & 
                      (predicted['sa_score']<sa_score) &
                      (predicted['qeds'] > qeds) &
                      (predicted['logp']<logp) &
                      (~brenk) )

    return Draw.MolsToGridImage(predicted[shortlist_mask]['molecules'].iloc[:num_mols],
                                maxMols=num_mols,
                               molsPerRow=6)
interactive(children=(FloatSlider(value=0.6000000000000001, description='similarity', max=1.0, min=0.2, step=0…

Bonus - cluster and return only the top-ranked scaffold per cluster

You don’t want the hit list to be full of a single scaffold with different R groups. Better to test multiple scaffolds and increase the probability of discovering a new chemotype. Clustering first, and returning only the top-ranked hit, removes closely-related analogs.

I use sklearn’s AgglomerativeClustering because it gives nice, single-scaffold clusters. It’s not so great on the randomly selected example becaus there aren’t as many analogs but it’s an OK demo.

Instructions:

pairwise_distances_all = pairwise_distances(np.array(predicted_fps), metric='dice')
/Users/lmar3213/miniconda3/envs/lew_conda/lib/python3.7/site-packages/sklearn/metrics/pairwise.py:1575: DataConversionWarning: Data was converted to boolean for metric dice
  warnings.warn(msg, DataConversionWarning)
##This creates a new column with the cluster labels each time you reload the function below. 
def get_cluster_ids(clusterer, shortlist_mask):
    cluster_labels = list()
    count = 0
    for idx in range(len(shortlist_mask)):
        if shortlist_mask[idx]==True:
            cluster_labels.append(clusterer.labels_[count])
            count+=1
        else:
            cluster_labels.append(None)
    return cluster_labels


@interact
def get_top_ligands(similarity=(0.2, 1, 0.01), 
                    sa_score=4, 
                    qeds=0.25, 
                    logp=6, 
                    num_mols=(1, 25, 1),
                   dist_threshold=(0.1,1, 0.01)):
    
    shortlist_mask = ((predicted['similarities']<similarity) & 
                      (predicted['sa_score']<sa_score) &
                      (predicted['qeds'] > qeds) &
                      (predicted['logp']<logp) &
                      (~brenk) )
    
    clusterer = AgglomerativeClustering(n_clusters=None,
                                distance_threshold=dist_threshold, 
                                affinity='precomputed', 
                                linkage='average')
    clusterer.fit(pairwise_distances_all[shortlist_mask])
    predicted['cluster_labels']=get_cluster_ids(clusterer, shortlist_mask)

    return Draw.MolsToGridImage(predicted[shortlist_mask].drop_duplicates(subset='cluster_labels',keep='first')['molecules'].iloc[:num_mols],
                                maxMols=num_mols,
                               molsPerRow=6)
interactive(children=(FloatSlider(value=0.6000000000000001, description='similarity', max=1.0, min=0.2, step=0…