Download active ligands from ChEMBL in python

Github repo

People often ask "How many / what kinds of ligands are known to hit Protein X?" Answering this involves querying a database of ligand activities like ChEMBL, BindingDB, or PubChem. These offer .csv downloads after a manual search, or large database downloads for doing SQL queries.

Alternatively it's nice to have an automated approach that can slot into a workflow. I've re-written this code enough times that it's worth writing up as a reference even for myself. The code below queries the ChEMBL server using their web services API and downloads the associated active ligands, where 'activity' is defined by a pchembl value. It doesn't require installing any uncommon packages and runs in about a minute (depending on the number of ligands), resulting in a pandas dataframe with SMILES codes, pchembl values, and some other ligand properties.

Just Ctrl-C + Ctrl-V, set the chembl accession code and the pchembl value, and run:

import requests
import json
import pandas as pd

pchembl = 5 #equivalent to 10uM
chembl_accession = 'CHEMBL4333' #S1p receptor
n = 10_000 #try loading n first. If there's more than 10,000 ligands, they will be queried again.

print('Counting bioactivities:')
urlString = "https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl"
urlString += "_id__exact=%s&pchembl_value__gt=%s&limit=%s" % (chembl_accession, pchembl, n)
webQuery = json.loads(requests.get(urlString).content)
total_count = webQuery['page_meta']['total_count']
print(f'{total_count} ligands listed')


if total_count>n:
    print('Lots of ligands, loading more...')
    urlString = "https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id"
    urlString += "__exact=%s&pchembl_value__gt=%s&limit=%s" % (chembl_accession, pchembl, total_count)
    webQuery = json.loads(requests.get(urlString).content)
    print('Done')
    
activities = webQuery['activities']

df = pd.DataFrame(activities)

while len(df)<total_count:
    urlString = "https://www.ebi.ac.uk"+webQuery['page_meta']['next']
    print(urlString)
    webQuery = json.loads(requests.get(urlString).content)
    activities = webQuery['activities']
    df = pd.concat([df, pd.DataFrame(activities)])
    print(f'Loaded {len(df)} of {total_count}')