Download active ligands from ChEMBL in python
Github repo
People often ask "How many / what kinds of ligands are known to hit Protein X?" Answering this involves querying a database of ligand activities like ChEMBL, BindingDB, or PubChem. These offer .csv
downloads after a manual search, or large database downloads for doing SQL
queries.
Alternatively it's nice to have an automated approach that can slot into a workflow. I've re-written this code enough times that it's worth writing up as a reference even for myself. The code below queries the ChEMBL server using their web services API and downloads the associated active ligands, where 'activity' is defined by a pchembl value. It doesn't require installing any uncommon packages and runs in about a minute (depending on the number of ligands), resulting in a pandas dataframe with SMILES codes, pchembl values, and some other ligand properties.
Just Ctrl-C + Ctrl-V, set the chembl accession code and the pchembl value, and run:
import requests
import json
import pandas as pd
pchembl = 5 #equivalent to 10uM
chembl_accession = 'CHEMBL4333' #S1p receptor
n = 10_000 #try loading n first. If there's more than 10,000 ligands, they will be queried again.
print('Counting bioactivities:')
urlString = "https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl"
urlString += "_id__exact=%s&pchembl_value__gt=%s&limit=%s" % (chembl_accession, pchembl, n)
webQuery = json.loads(requests.get(urlString).content)
total_count = webQuery['page_meta']['total_count']
print(f'{total_count} ligands listed')
if total_count>n:
print('Lots of ligands, loading more...')
urlString = "https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id"
urlString += "__exact=%s&pchembl_value__gt=%s&limit=%s" % (chembl_accession, pchembl, total_count)
webQuery = json.loads(requests.get(urlString).content)
print('Done')
activities = webQuery['activities']
df = pd.DataFrame(activities)
while len(df)<total_count:
urlString = "https://www.ebi.ac.uk"+webQuery['page_meta']['next']
print(urlString)
webQuery = json.loads(requests.get(urlString).content)
activities = webQuery['activities']
df = pd.concat([df, pd.DataFrame(activities)])
print(f'Loaded {len(df)} of {total_count}')