GlaxoSmithKline and Cerebras are advancing the state of the art in AI for drug discovery

,

GlaxoSmithKline and Cerebras are advancing the state of the art in AI for drug discovery

Training complex epigenomic models with a previously prohibitively large dataset was made possible for the first time by a partnership with Cerebras Systems

By Kim Branson, Senior VP & Global Head of Artificial Intelligence and Machine Learning, Meredith Trotter, and Stephen Young, GSK.

By Natalia Vassilieva, Director of Product, Machine Learning, and Rebecca Lewington, Technology Evangelist, Cerebras Systems.

Artificial intelligence has the potential to transform the speed, sophistication and safety of drug discovery, yielding better medicines and vaccines. AI models can help us understand the biological “languages” that govern gene regulation and function. Understanding this language is key to working out what protein in a cell a medicine should target.

AI plays a key role at GSK and we have invested heavily in the intersection of human genetics, functional genomics and AI. AI is what allows us to analyze and understand the data from genetic databases and this means we can take a more predictive approach. Strong evidence has shown that drug targets with genetic validation are twice as likely to succeed. [i] And that’s good news for patients because the more targets we can validate, the more potential medicines we can make.

An exciting result of this work is our new paper “Epigenomic Language Models Powered by Cerebras” which describes a novel technique which allowed us to train more sophisticated AI models for genetic data than was previously practical. To do this, we used some serious AI computing muscle – the Cerebras CS-1 system, which is powered by the largest silicon chip ever made.

Using epigenomics in drug discovery

Human beings are incredibly complex organisms: the human genome contains about 30,000 genes. We used to think of the genome as a complete blueprint, but that picture was incomplete. How those genes translate to about 200 different types of cells, which are organized into sentient, mobile beings made up of 30 billion cells, is far more complex.

We know that, for example, the immune systems of people that share the same genes don’t work the same way. Some people get sick and others don’t. And we know that the same medicine can affect people differently. The big question is, why?

The answer involves nuances such as the way our DNA is folded, which allows some genes to be expressed, while excluding others. Different cells fold our DNA differently, on something called a histone. This means in different cell types some sections of the DNA are open, and some tightly wrapped around these histones. The sections of DNA that are open allow genes to be expressed and the genes on sections that are wrapped up on the histones are not. The cell determines which sections of DNA should be open or closed by modifying the DNA code, we call these modifications epigenetic (epi meaning “above,” and genetic referring to the genome) modifications. We term the modified sequence the epigenome.

Unlike the DNA sequence, the epigenetic modifications the cell makes are reversible. Understanding the epigenome is key to understanding which genes can be expressed in which parts of the body. We need to understand the epigenome to help us understand the genetic data we have in databases like the UK BioBank. These biobanks give us clues about which genes may be involved in a disease, and the epigenetics help us understand which cell types (i.e. skin, eyes liver) a gene may be expressed in. This information along with other data helps us work out what our medicine should do, which genes it should target to hopefully treat a disease.

Speeding up the process

Trying to write a computer program to accurately describe these intricate processes from first principles would be a Herculean, possibly even futile task. Fortunately, AI gives us a shortcut. We have enough real-world examples of the effects of epigenomics to teach a computer to do the same thing, creating a model that can then be used to predict many important biological processes. This moves us closer to that perfect digital twin.

It’s one of those happy accidents of science that the same algorithms used to build Natural Language Processing systems, which underpin search engines and machine translators, can also be used to model biological structures like proteins and DNA. For this work, GSK researchers are repurposing a family of neural network models called BERT (because “Bidirectional Encoder Representation from Transformers” takes too long to say). The new model is called “epigenomic BERT”, or EBERT for short.

We know that more complex NLP models, trained using more data, give more accurate predictions. Our hypothesis is that the same will be true in our field: that epigenomic models will give us more accurate genetic validation than simpler genome-only models.

However, it’s an unfortunate fact of AI life that more complexity inevitably requires more computing horsepower. Much more. Until now, it has not been practical to train models using massive datasets. Using conventional computing systems comprising clusters of graphics processing units (GPUs) takes too long. And building bigger clusters doesn’t help much: there’s a law of diminishing returns at work that means that trying to go ten times faster might take hundreds of extra GPUs and a major reprogramming effort. Cerebras has a better way.

At the heart of the Cerebras system is the Wafer-Scale Engine. The Engine at GSK has a whopping 400,000 AI-optimized compute cores. They’re housed on one enormous chip, running one program. It’s not surprising this is faster than trying to break up a program among many smaller processors with long and slow communication paths between them.

As the paper says, “The training speedup afforded by the Cerebras system enabled us to explore architecture variations, tokenization schemes and hyperparameter settings in a way that would have been prohibitively time and resource intensive on a typical GPU cluster.”

How much faster? We were able to train the EBERT model in about 2.5 days, compared to an estimated 24 days with a GPU cluster with 16 nodes. This dramatic reduction in training time makes the new models actually useful in a real-world research environment, which is very exciting.

Now that we have the compute horsepower to train our new models, we can test our hypothesis and ask: does the new model work? Does EBERT give us more accurate genetic validation? The answer, like everything else in our field, is complicated. The fine-tuned EBERT model achieved the highest prediction accuracy on four of the 13 datasets in an industry benchmark called ENCODE-DREAM. This is a strong performance, so the results are very promising.

Future work

What’s next for this work? More speed! Tests on the latest CS-2 system, which features more than twice the compute cores and memory of the CS-1, demonstrate double the pre-training throughput for their smaller EBERTBASE model. In addition, we’ve shown that the CS-2 will be able to pretrain their ambitious EBERTLARGE model at approximately the same throughput as EBERTBASE on the CS-1. We can’t wait.

To learn more about the study, read the paper ​“Epigenomic language models powered by Cerebras” by Meredith Trotter, Cuong Nguyen, Stephen Young, Rob Woodruff and Kim Branson from the Artificial Intelligence and Machine Learning group at GlaxoSmithKline.

[i] Nelson et al. “The support of human genetic evidence for approved drug indications” in Nature Genetics, 2015. https://www.nature.com/articles/ng.3314