Fetch PDBs and bound ligands programmatically

See the gist on Github.

A common starting point in any contemporary research related to protein-ligand interactions is to take stock of all the X-ray or cryo-EM structural data deposited in the public database, RCSB. While there are advanced search options, normal RCSB search often brings up 'incorrect' results due to non-standard naming protocols (i.e. proteins with multiple common names), and it's not clear how deeply you need to search to identify all results.

An alternative is to do this programmatically (python) using the RCSB REST API by searching for gene name. This can fit into existing code workflows, is repeatable, and takes ten seconds. In addition, being in python you can integrate with RDKit and parse the co-crystallized ligands directly.

The code below is a copy-and-pastable snippet to do this. First, set the gene name to the protein of interest. The script queries the RCSB REST API for all associated protein structures, including co-crystallized ligands (called non-polymer entities), then queries the RCSB GraphQL interface for the SMILES. The output is a pandas dataframe with PDB codes, SMILES codes, and ligand pictures for your perusal.

This workflow could easily be extended by, for example, filtering ligands for drug-like properties to remove things like lipids or sulfates, or intergrating with py3Dmol to visualize binding sites. Although the latter is done really well on the RCSB website, it's a pain to hold 15 structures in tabs while trying to visualize them together.