
Published as a workshop paper at ICLR 2019
GENERATIVE MODELS FOR GRAPH-BASED PROTEIN
DESIGN
John Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola
CSAIL, MIT
ABSTRACT
Engineered proteins offer the potential to solve many problems in biomedicine,
energy, and materials science, but creating designs that succeed is difficult in prac-
tice. A significant aspect of this challenge is the complex coupling between pro-
tein sequence and 3D structure, and the task of finding a viable design is often
referred to as the inverse protein folding problem. We develop generative mod-
els for protein sequences conditioned on a graph-structured specification of the
design target. Our approach efficiently captures the complex dependencies in pro-
teins by focusing on those that are long-range in sequence but local in 3D space.
Our framework significantly improves upon prior parametric models of protein se-
quences given structure, and takes a step toward rapid and targeted biomolecular
design with the aid of deep generative models.
1 INTRODUCTION
A central goal for computational protein design is to automate the invention of protein molecules
with defined structural and functional properties. This field has seen tremendous progess in the past
two decades (Huang et al., 2016), including the design of novel 3D folds (Kuhlman et al., 2003),
enzymes (Siegel et al., 2010), and complexes (Bale et al., 2016). However, the current practice often
requires multiple rounds of trial-and-error, with first designs frequently failing (Koga et al., 2012;
Rocklin et al., 2017). Several of the challenges stem from the bottom-up nature of contemporary
approaches that rely on both the accuracy of energy functions to describe protein physics as well as
on the efficiency of sampling algorithms to explore the protein sequence and structure space.
Here, we explore an alternative, top-down framework for protein design that directly learns a con-
ditional generative model for protein sequences given a specification of the target structure, which
is represented as a graph over the sequence elements. Specifically, we augment the autoregressive
self-attention of recent sequence models (Vaswani et al., 2017) with graph-based descriptions of the
3D structure. By composing multiple layers of structured self-attention, our model can effectively
capture higher-order, interaction-based dependencies between sequence and structure, in contrast to
previous parameteric approaches (O’Connell et al., 2018; Wang et al., 2018) that are limited to only
the first-order effects.
The graph-structured conditioning of a sequence model affords several benefits, including favorable
computational efficiency, inductive bias, and representational flexibility. We accomplish the first
two by leveraging a well-evidenced finding in protein science, namely that long-range dependen-
cies in sequence are generally short-range in 3D space (Marks et al., 2011; Morcos et al., 2011;
Balakrishnan et al., 2011). By making the graph and self-attention similarly sparse and localized in
3D space, we achieve computational scaling that is linear in sequence length. Additionally, graph
structured inputs offer representational flexibility, as they accomodate both coarse, ‘flexible back-
bone’ (connectivity and topology) as well as fine-grained (precise atom locations) descriptions of
structure.
We demonstrate the merits of our approach via a detailed empirical study. Specifically, we evaluate
our model at structural generalization to sequences of protein folds that were outside of the training
set. Our model achieves considerably improved generalization performance over the recent deep
models of protein sequence given structure as well as structure-na
¨
ıve language models.
1