Preparing protein PDBs for docking

Repo
Notebook

Recently 'DockStream' was released, some software from a team at AstraZeneca that is used for driving reinforcement learning models to generate molecules that dock well. Part of the pipeline is preparing PDB files for docking.

This post is to share my workflow for preparing PDB files. Like DockStream, it uses PDBFixer. In addition to that, it aligns the target coordinates along the principal moment of inertia of the ligand - doing this means the docking box axes are aligned to the xyz axes, so you can cover the ligand cavity totally with a smaller volume, reducing docking time slightly. The actual conversion to PDBQT is a relatively small step, since OpenBabel does this reliably in a one-liner from the CLI.

In addition, it also demonstrates further processing with OpenMM. PDB files often have issues such as missing atoms and bonds, and PDBFixer interprets those correctly based on known bonding templates for protein residues. However, sometimes the coordinates are nonphysical, which requires minimization via a forcefield. I've noticed that minimization in the absence of a bound ligand can 'crush' the binding cavity, so it helps to have the ligand present. This then requires new parameterization of the ligand for treatment by the forcefield! Luckily the OpenMM folks thought of that and provide openmmforcefields. The script chooses GAFF as the parameterization algorithm.

The system prepared in this way could also go directly into MD, and this is a perfectly valid way to take a docked ligand into MD as well, right after Matching Bond Topology to PDBQT Files.