PDF transcription in OpenCode

Gist link to transcription 'skill'.

One of my biggest use-cases for LLMs has been PDF summary and review. For example "did this crystal structure publication report a potency? What assay did they use? Is it commercial or bespoke? Cellular or cell-free?" etc... Sure Claude is great for this - drop in related open-access PDFs and you can summarise a whole field in a few hours. For whatever reason, though, Anthropic's content filter is now too sensitive, and legitimate research into chemistry and drug discovery topics end up killing these review sessions, quite a lot.

So it would be nice to have a backup approach. Internally, Claude would convert an academic PDF into text before tokenizing the text and holding that in its context for downstream summary or review. Remarkably, the best way to extraction the text is not to search inside the PDF for the text that's already there! But to simply take a PNG snapshot of each page (using, e.g. pdftoppm) and then to run these image files, one by one, through a vision model. Then just ask that vision model for a transcription. It sounds like this would be really noisy, but somehow open weights/code models have reached a level where it's not. Using a vision model also helps to handle the strange PDF-specific arrangements of multiple columns, text boxes, intervening page numbers and footers, and of course they can summarise figure contents.

For my own particular setup I chose OpenCode, with whom I have no affiliation, but I can recommend that it works. They advertise that their cloud providers are processing your data in the US and have a zero data retention policy. Nonetheless it's worth saying: Don't upload non-public data that would break a CDA! OpenCode offers open weights models, but the data still leaves your computer. I set up billing as pay-per-token (which they call Zen) to avoid being throttled, and for PDF (PNG) transcription have been using the model Kimi K2.5, which is a vision model. This approach requires setting up a 'SKILL.md' to instruct the model how to perform the transcription from the pre-processed PNG files - I posted my version of this 'skill' in the gist above. Extracting the PDF pages to PNG is very easy to script and doesn't require natural language processing, so I do it outside of the skill, and request opencode to transcribe the PNG files on the fly, and then clean up, like this:


for pdf in ./pdfs/*.pdf; do
    base=$(basename "$pdf" .pdf)
    echo $base
    [ -f "$base.md" ] && continue   # skip if already processed
    mkdir -p "$base"
    pdftoppm -png -r 150 "$pdf" "$base/page"
    opencode run --model opencode/kimi-k2.5 "transcribe the PDF contents in ./$base"
    rm -r $base
done
    


The transcribed output looks good to me and, anecdotally, provides a much more pleasant reading experience compared to a PDF. As an example, I picked my first published paper - martin2014, rendered on sdoc here. It's also available on github here.

Of course the real utility is in what you do next with the transcribed works. I've found that ingesting several papers as markdown format fits well within the context budget of the open weights models on OpenCode, so a natural next step is to ask for a Q&A document that answers all your questions, pointing to quotes or locations in the source text, for posterity. What's particularly nice about this is that it's not a hypothetical use case - even if LLMs don't improve at all from today, it works well, now, and has been a huge timesaver when tackling a large body of new literature.