Aired:

July 25, 2024

Category:

Podcast

Using AI to Make Biology Easier to Engineer

In This Episode

Host Dr. Amar Drawid engages with Ankit Gupta, Head of AI and Senior Director of Machine Learning Research at Ginkgo Bioworks, to delve into the groundbreaking advancements in bioengineering beyond traditional healthcare applications. Tune in to discover how Ginkgo is leveraging generative AI and Large Language Models (LLMs) to push the boundaries of bioengineering, particularly through its collaboration with Google's Vertex AI.

Episode highlights

Spotlight on Ginkgo Bioworks and its recent collaboration with Google to develop large language models that read DNA.
Gain insights into how these innovations are set to transform the field, offering rapid and predictable methods to program biology for diverse applications, from producing goods and ensuring food security to detecting emerging threats.
Delve into broader applications of bioengineering beyond healthcare.

Transcript

Daniel Levine (00:00)

The Life Sciences DNA podcast is sponsored by Agilisium Labs, a collaborative space where Agilisium works with its clients ranging from early stage biotech to pharmaceutical giants to co -develop and incubate POCs, products and solutions that improve patient outcomes and accelerate the development of therapies to the market. To learn how Agilisium Labs can use the power of its generative AI for life sciences analytics,

to help you turn your visionary ideas into realities, visit them at labs agilisium .com.

You're tuned to Life Sciences DNA with Dr. Amar Drawid.

Amar, we've got Ankit Gupta on the show today. For viewers not familiar with Ankit, who is he? Ankit is the head of AI/ML Advancement and Senior Director of Machine Learning Research for Ginkgo Bioworks. He joined Ginkgo through its acquisition of Reverie Labs, which he built and used AI/ML tools to accelerate drug discovery. Ankit supports Ginkgo's growing AI/ML infrastructure and future development and application of AI/ML -based techniques to engineer biology.

He was a founder and CTO of Reverie Labs, and he has a bachelor's and a master's in computer science from Harvard. And for people not familiar with Ginkgo Bioworks, what is Ginkgo Bioworks? Ginkgo Bioworks is a leading platform for cell programming. It provides end -to -end services that solve challenges for organizations across diverse markets, from food and agriculture to pharmaceuticals to industrial

and specialty chemicals. Ginkgo's biosecurity is building global infrastructure for biosecurity to help governments, communities, and public health leaders to prevent, detect, and respond to a wide variety of biological threats. And what are you hoping to hear from Ankit today? Ginkgo is doing some interesting things to incorporate generative AI into what it does.

Notably, this includes a collaboration with Google's Vertex AI that seeks to harness its large language models. As biotechnology pushes beyond healthcare to virtually every industry, I'm hoping to get a sense of how AI can transform biotechnology into rapid and predictable means of programming biology to produce goods, provide food security, and detect threats. Before we begin, I just want to...

remind our viewers if they'd like to keep up on the latest episodes to hit the subscribe button. If you enjoy the content, hit the like button, let us know your thoughts in the comments section. And also if you want to listen to the Life Sciences DNA podcast on the go, an audio only version is available on your preferred podcast platform. With that, let's welcome out Ankit to the show.

Ankit, thanks for joining us. We're going to talk today about bioengineering, Ginkgo Bioworks, and its recent collaboration with Google to build out large language models that read DNA. We've talked a lot on this show about the application of AI to drug discovery and other aspects of the biopharmaceutical value chain. But we haven't really explored bioengineering and the application of biotechnology beyond healthcare. Perhaps you can begin by explaining to listeners

What is bioengineering? Yeah, absolutely. It's a pleasure to meet you and it's a pleasure to be on here. Yeah, I can start by talking a bit about what bioengineering is. The way I think about bioengineering is it's about applying engineering principles to biology. And so it encompasses a very wide range of potential things. And ultimately, the goal is about applying those engineering principles to biology to make useful products. And what is interesting about the current state of biology is a huge amount

of what people do and interact with and use and build involves a biologically engineered system, whether that's the food they eat, medicines they take, the chemicals that are used for their clothing or for natural processes. We have a wide variety of biological systems that are involved in what we do. And so fundamentally, I think about bioengineering as the sets of methods that allow us to build these types of products using a biological system along the way. And of course, there's more nuance to that.

But that's how I think about it.

Okay, so when you talk about that, right, so like we know a lot about like, you know, a biology, making new medicines or like medical devices and stuff, right? But outside of that, like I've heard about even like some bacteria that can clean up like plastic and stuff like that. So like what are like some of the like the big applications that you see of bioengineering outside of the...

the drugs and the medical devices. Yeah, there's a huge number of applications. So, you know, maybe even just to begin with drugs and medical devices for one second there, even within what drugs look like today, so many drugs have taken an incredibly different form from what drugs looked like 30 or 40 years ago, right? There was a time when the vast majority, the only type of drug people took were small molecule therapeutics, where it was a chemical entity that was largely made through a chemical process being

dosed into human being. That remains the case. That's still the dominant drug that's taken. And it's a fantastic modality for drugs, one I have a lot of experience in. But there's increasingly new modalities that look more like biologically engineered systems, where you have a small protein being delivered and that protein itself was manufactured via a bioengineering process. You might have a cell therapy where you in fact took, you've re -engineered a cell itself to have a different behavior, like a CAR T. You might have a gene therapy where you're delivering a gene into a human being or

into an ex vivo format, meaning into cells taken out of a human being. And that gene therapy may have been engineered via bioengineered system in itself. So even within medicine, we're seeing this sort of Cambridge explosion of the types of ways we can design medicines. And a lot of those look like bioengineering. Then across outside of medicine, there's huge opportunities to use bioengineering as well. And Ginkgo is involved with many of these, as an example in biosecurity. We think about how we can build tools to allow us to detect new pathogens

and be able to generate data sets that allow us to have information about new variants of concern, for example. In agriculture, we work with Bayer, for example, on research development around crops to make more efficient crops and more effective ones. In food, being able to design new types of food that might allow us to have cultured meats, for example. And then also in sustainability, I think you mentioned an example of being able to clean up plastics. I think an example would be an enzyme that you can use to...

clean up and sort of ingest forever chemicals like PFAS. One of our collaborations works along those lines. And so there's an incredible number of ways you can build a biological system to solve a problem that's useful. And yeah, biopharma is really just the beginning of it. And I think although an area with incredible value and one that I think about quite a lot. Okay. So, and where like is, what is the kind of like the status of bioengineering in terms of transforming

the different ranges of industries that you talked about, right? So where, like, where is it kind of like until like in the beginning stages, like advanced stages, and how is that like really setting the stage of a new bioeconomy? Yeah, great question. I mean, I think there's two aspects to that. One is to what degree do each of these industries use bioengineering at all? And then two is to what degree do they use it in an efficient way? And I think the former, actually many of them, are very mature.

Like I mentioned in pharma, there's a huge number of modalities that I would call bioengineered, in crops it's not like we're the first people to think about how to make better crops. I think people like Bayer have been thinking about that for a very long time. And I think for a long time, we've been thinking about things like genetically modified organisms. I would consider those to be bioengineered systems for how we go about production. I think at the same time, a lot of how companies have approached this has largely been internal facing R &D organizations that largely...

focus on the one problem that is most important to them and solve that problem by engineering a particular system. And then when they've solved that problem, most likely dismantling the tools they built to engineer that particular system and moving on with it. A great example happens in therapeutics. We see this all the time where people have built a platform to design therapeutics. And then ultimately they take their molecule into clinical development and all of the company's money goes into clinical development because that costs a ton of money. It makes sense.

And so this is not actually a critique of the industry, it's sort of a reflection of the reality of the industry, which is that people tend to make bespoke tools and then carry them forward and not necessarily have reusable systems in the way we think about an engineering discipline, by building reusable systems that you can apply from one domain to the other. And I think that's really how we think about the opportunity in bioengineering is if we can really make this an engineering discipline, we should be able to have technology stacks that you can copy from one domain to another,

from one pharma company to another, and ideally outside of one vertical to another, where next generation sequencing is next generation sequencing. We should be able to use that for all sorts of different application areas. Okay. So you've started hinting at what Ginkgo is doing, but for people who are not familiar with Ginkgo Bioworks, can you explain a bit more about how it sits in the world of bioengineering and how...

you are helping to drive this transformation in the various fields of bioengineering. Yeah, absolutely. I mean, I think of Ginkgo's mission, as we say, is to make biology easier to engineer. And so we really think of ourselves as a bioengineering company. It's literally our mission. And fundamentally, when we think about what it means to be a biology user engineer, it comes to understanding biological systems, building tools that allow us to inspect and measure biological systems, and specifically build them in a way that allows us to make it

feel like engineering, meaning you have systems that are repeatable and composable that can be used from one domain to another. And for us, what that looks like functionally is a significant amount of robotic automation we've built in our laboratory here in Boston, a massive investment in AI that I can speak much more to around a team that I'm leading to build out our capacity to build models for biological systems. And then a partnership -based business model that allows us to work with external parties across the world.

all the industries I mentioned to use those roboticized assays, those computational methods I mentioned and combine those ultimately to solve problems. And so we don't have any therapeutic products of our own. Our job is to be an enabler for the industry and to really push the industry towards one in which they don't have to build their own lab with the first $50 million they have as a startup. They can come to us instead and we can help enable them and get to the next set of their milestones. So you are building these different

kind of like engineering systems that then your partner companies can use them. That's exactly right. And so a company may come to us and say, here's a protein that we've found. We want to make this protein 3x better along these two or three axes. Ginkgo can you help us do that? And we'll apply our technology, our scaled up assays, our machine learning methods to come approach that problem and solve that for one of our potential customers. And then increasingly, what our customers are coming to us for is just scalable data.

In the modern era of machine learning, customers are often very data hungry and we're looking to provide to them a kind of lab data as a service where they can just come to us, say, here are the types of data I want at this particular scale and we can just crank our robots basically and get that data from them. So when you talk about the biological systems, I mean, is that like the, you're using something like systems biology to create, let's say,

biological systems of like specific microbes that can help with like, you know, cleaning up pollution or more on like when you're talking about like medical devices or so more like the some human organ. Like can you give some examples about what are the type of biological systems that you're engineering and how easy or hard it is, right? Because I mean, like the human systems are extremely complex. Can you tell us a bit about...

Absolutely. We have a pretty significant history of working with a variety of microbial systems historically that have a lot of nice properties. They're relatively easy to engineer, they're relatively easy to modify, they grow pretty fast. It allows us historically to build a pretty significant practice around, especially around protein expression and enzyme optimization in the context of those types of microbial systems. As an example, this might be one of the ones I mentioned earlier, we want to design an enzyme

that can clean up PFAS. I don't know the exact details of how we're approaching that particular problem, but I would imagine what that looks like ultimately is taking a particular microbial system probably that expresses a particular enzyme of interest and then optimizing that system to express an enzyme that is better along two or three axes that we care about. And we can do that because we can express a number of these systems at really large scale on our robotic automation and then measure those readouts and then also guide that process with computational methods like machine learning.

And so there's a sort of mix of factors that would come in that allow us to express those. Now, as you're alluding to, as you move towards mammalian cell engineering, things get a lot more complicated. Mammalian cells are just more complex systems. They have more going on in them. They're harder to grow. They're often slower to grow. They don't often have the same properties that we're used to in microbial systems. But at the same time, understanding them is incredibly impactful because as fellow mammals, they're very useful for solving problems in humans and related species. And so...

we're now investing a significant amount of resources and being able to apply a lot of what we know how to do well in microbial systems, in mammalian systems. So as an example of this, we are investing really heavily in genetic engineering. So being able to do large scale, let's say CRISPR based knockouts of these cellular systems in a number of cell types of interest and being able to use that to understand the effect of large scale perturbations. And I think one of the things that's really interesting about that particular example is that

sort of a paradigm shift that's happening in this industry around how people think about what you can do with some of these tools. In the past, people were very individual target hypothesis driven. They may have said, I want to know what happens if this particular gene in this particular cell type was modified. Today in kind of AI hungry companies, they may still be interested in that. They should be if they're very target driven, but they might first be interested in something like, what if I were to knock out all of the potential genes one at a time in these five cell types of interest?

What can I learn about the readouts that I measure accordingly with some kind of language model, let's say. And so we're thinking really hard about how to enable those kinds of companies as well as build those types of models ourselves so that we can really be able to use models to ask these hypothesis -driven questions and use the model to say, it looks like these three genes might be implicated in a pathway together. Let's now do a targeted knockout and understand what those do. And so.

Unfortunately, you know, these mammalian systems are very challenging to scale up in that way, but that's one of the things we're dedicating a lot of resources to be able to do so that we can tackle these really interesting problems in biology. So even, let's say, if we take these simple microbial systems with our current engineering expertise, to what extent are we able to replicate that and engineer that? When you say replicate that, what exactly do you mean?

So, like the way, you know, these microbes, right? So the way they work, so you're trying to see if the protein expression from those, like, you know, can we build systems where we're able to maybe generate on moss, like some of those microbes that can, let's say, you know, solve a particular problem. So to what extent are we able to like create that system by engineering that in your lab and then scaling that up? So where is, you know, what's the maturity for that?

Yeah, I would say that aspect of business is actually quite mature. That's very much been our kind of bread and butter business for the last five, 10 years. And I think one thing that's really interesting about that particular problem is that exact question of scale up, right? Is how do you design a small scale system, a small scale experiment that gives you good confidence about the much large larger fermentation experiment that you might want to do to scale up. And so one of the things we often do is actually try to model

basically full -scale fermentation scaled down so that it can be run at higher throughput. So we can run basically, I forgot what the names of all these tools are, but we have these capsules basically that simulate the environment of a full -scale fermentation tank. And it allows us to multiplex the readouts associated with a number of different conditions. So we can still do our arrayed understanding and mapping of a system and understand what the impact of a particular set of genetic variants is, for example,

but do that in a context that we have high confidence will scale to a larger system. And so that's one of the areas that we think we thought hard about in the context of microbial fermentation, how we can enable that kind of scale.

Okay, wow. So you mentioned earlier that GinkgoBio is very involved in the world of biosecurity. Now over the last 20 years we've heard a lot about cyber security, but biosecurity might be a new concept for our audience. So can you explain what that means and what you guys are doing there? Yeah, totally. I mean, so I think of biosecurity as the set of measures

that allow us to protect against the spread of infectious disease, whether in humans or increasingly we see a lot of diseases come from animals or plants originally. And so really broadly speaking across the kind of biological systems we care most about. And so it includes some practices like being able to quarantine and hygiene and whatnot, but also surveillance systems that allow us to be able to capture when there's variants of concern in a particular area and be able to monitor those. But critically, I think as a society, we have learned a lot from

COVID -19 that shows how those places in the world that invested heavily in surveillance technologies were able to weather that storm substantially better than ones who didn't. And I think we're really thinking about how we can apply the methods, the lessons from COVID -19 and think about how we can use those for whatever the next generation of threats are. Many of them will be natural. Some of them might be non -natural and driven by malicious actors. And a lot of the ultimate ability to deter those though, comes from being able to continuously monitor

and respond to those types of biological changes. Okay. And so then you, like you talked about the surveillance systems, what other kind of things do we need to do for biosecurity? I think of it as a few things. I mean, surveillance is a massive part of it. When we say surveillance, you know, I'm not talking about like NSA, spooky surveillance. I mean, things like wastewater monitoring, right? Being able to, across the world, you know, one of the things we work on, for example, is coming out of literal airports, we take the wastewater coming out of airplanes.

and we sample it and it's in a way that is sort of collective across lots of airplanes. So there's no individual human beings assigned to that. But as a result, you can take that de -anonymized data and get a hugely impactful understanding of the types of viruses that are spreading in a particular population and kind of where they're coming from and where they're going. And so we started to expand that particular business across the world. And I think that's, I think if you look broadly at...

the world's ability to detect this kind of thing, you might intuitively think that, we must be monitoring for viruses everywhere. Like, of course, after what we've seen the last half decade, we would be doing that. But in fact, we're not. Basically, almost nowhere in the world are we actively monitoring, or is anyone actively monitoring the viruses that are spreading that far in the world? And that's why when a new virus pops up, we usually have no clue where it came from. And so this is a big challenge. And I think that's where we're looking to expand our possibilities. And then I think on top of that,

I think there's the opportunity for building tools that allow people to respond to that. So you might discover there's a virus, but I think, can you actually predict with any reasonably high accuracy how bad this is? How viral is it going to be? How deadly could it be? How different is it from things you know? How similar is it to things that you know in the past? And being able to use that to allow policymakers or NGOs or whoever to be able to respond appropriately to what those concerns are. And the reality is those new viruses and new mutants of things every single day.

The question is, they're always worth doing something about. Evolution can kind of sort itself out most of the time, and our immune systems can sort itself out most of the time. But occasionally, you end up with COVID, or something much worse. You can imagine if COVID were 30 % more deadly and 2x more viral, it would probably have killed dozens x more people. And so I think thinking about how we can prevent and detect those types of things is really important. Yeah. So if, let's say, something like COVID, God forbid, would be

coming as a pandemic in 2025, do you think we are much better prepared now than in 2020 at this point? I would say no, we are not. I think the world is still incredibly under -invested in mechanisms to be able to detect, deal with the political courage to respond to and whatnot these types of threats. I think that we're better off in that in the last five years, we all at least have the

clearer memory of what happened and there may be more willingness to respond at a faster speed. But in terms of have we built the fundamental monitoring infrastructure at sufficient scale that we could reasonably expect that we would be able to mitigate that, not even close. I think it was an incredible opportunity to expand that. I think that's where I'm excited we're playing a part in that, but I think that will be ultimately a commitment by the world's governments to make that happen. Yeah, absolutely. Now that makes sense. And it's kind of scary to say that like,

we are still not prepared, right? I mean, even after this experience that we should be investing in this area. Moving to another thing that you said, which was about data. So Ginkgo's one big strength is the data that you have amassed. Can you give us some sense about the scope and the scale of the data that you have and how that differentiates you?

Absolutely. So I would say there's two aspects of what I think of as the strength of our data. One is the data we have, and two is the data we can generate. And so I'll talk a little bit about both of those, because I think they're important in kind of different ways. So one is we've generated over the years of our own programs, where we're allowed to keep for the programs which we're allowed to retain data, which is a significant portion of our programs, as well as data that we've sourced from a variety of partners that we have acquired data from. We have a very large collection of data about

protein sequences, DNA sequences, what DNA expresses and what systems effectively, for example. And to give you a sense of scale, we have many billions of sequences of protein readouts that we have of protein sequences occurring in the wild, for example. And so there is an incredibly rich pool of data that we could use and that we are using to train really large scale machine learning models that allow us to understand some of these core fundamental biological systems. At the same time, my belief is very much that...

the future of biology will involve a lot more data than the present of biology. I think if you look at the trend lines in this field, we're inventing new ways to inspect biological systems every day. I think it's pretty clear that there's much more complexity than the existing data that anyone has could really understand. And so we're thinking hard about how we can build really scalable data generation tools. In biology, there's always going to be a physical measurement aspect to what you're doing that is going to limit what you can do. But we think hard from first principles about...

What are the things we can actually scale up and get tons of data about? And I think that's going to be really important for the next generation of biology, which I think will be driven by much less of people generating data for whatever their programs were and much more people generating data purely for the purpose of building models on top of them. I think that's going to be a really exciting period of exploration and scientific discovery. And I'm assuming that you're generating a lot of data in all the different types of model systems that you're dealing with.

That's exactly right. Yeah. OK. Now, all right, that's data. What about machine learning? So how are you using machine learning to make engineering biology easier? Yeah, this is definitely very near and dear to me because that's what I think about pretty much every day. I think of what it takes to make a machine learning system work well and have impact as sort of this trifecta of properties. You need to have a lot of data.

More importantly, your marginal cost per new data point needs to be low. And it has to be that if you can generate data for a relevant property, you can use that model for something interesting and useful. And if you have all three, then I think there's an opportunity to build some really interesting things. And I think that is how we, as a machine learning team, are approaching the problems we work on, is we say, where do we have data? And perhaps more interestingly, what is the marginal cost of a new data point relative to the complexity of the thing we're trying to model?

So as an example, in small molecules, the world that came from where our previous company was, in that world, small molecule data sets, very expensive to generate. Synthesizing a new small molecule, best case scenario, costs several hundred dollars. If it's an on -demand molecule, it's probably more like several thousand dollars to make one molecule. And on the flip side, small molecules are incredibly complex landscape to understand, just sort of computationally, they have lots of weird properties. Like you can flip one atom and it has a completely different effect. And so it's almost the most challenging domain to work in, where it's...

the most complex and the most expensive per data point. Conversely, on certain types of genetic engineering, for example, on certain types of protein engineering, you have incredibly high throughput assays that can generate very large amounts of data. For example, making tons of genetic edits and then measuring something like RNA expression profiles, doing something like phage display to generate antigen information. There's a number of these where you have a really large throughput associated with what you can do, similarly with proteomics readouts too. And so...

we're now thinking from first principles that are around what are the data sets we can generate at largest scale on a per dollar basis relative to complexity. And then also sometimes the data you might want to generate for training machine learning models is not the same as what a biologist might care about. You might be okay with slightly more noise or slightly less accuracy in exchange for way more data scale. And so we're really thinking hard about what those data sets look like. And I think where I'm most excited is

in protein engineering and in genetic medicine design. I think there's incredible opportunity to design better proteins for all sorts of properties. I think in antibody engineering is a lot of excitement right now, for example, and then in genetic medicine design. So being able to design RNA therapies, for example, I think is a really exciting domain where a lot of the pieces are there for all the machinery to make sense. And hopefully we'll have a lot to share along those lines very soon. And the last thing I'll say is a big area we think as well is around how we can enable other AI driven companies.

In the current world, like I was mentioning maybe a little bit earlier, a very small fraction of dollars are spent on data for machine learning in the broader biopharma pool of dollars. Probably less than 1 % is spent on that. I mean, almost certainly, probably less than 10 % is spent on discovery relative to clinical and within discovery, a really tiny is spent on data for machine learning models. I think in the last couple of years though, if you look at the trend lines we're seeing, there's incredibly powerful machine learning tools

that make biology feel like much more of a computationally driven design discipline in some subsets of it at least. So a great example of being a particular model called RF diffusion in the context of protein engineering. I think it is a one model of a class of models that I think is making a lot of people excited about the potential to use machine learning to design protein therapies. And what that means is that if the bet is that the future companies in pharma, in protein engineering and biologic design and antibody design are going to be

machine learning driven, it follows that they're also going to be very data hungry. And we're thinking hard about how we can enable those types of partners. And so in this case, you know, we're not competing with them. We're looking to be the data provider, the kind of scale AI to their Tesla or open AI so that they are looking to us for how to define those types of experiments and be able to scale them up effectively. And so that's where I, and sort of leading the ML team are working really closely with our mammalian engineering teams

to think hard about what that kind of joint data and machine learning package might be to appeal to some of these really machine learning companies. And I think that's really exciting. I think the future of Pharma especially is really exciting in that domain. Yeah. So in 2023, GinkgoBio announced a partnership with Google Cloud to build a generative AI system for engineering, biology, and biosecurity. What role do you see generative AI playing in accelerating this transformation?

Yeah, absolutely. I mean, I think that there's incredible possibilities with regenerative machine learning. So, you know, first on that partnership itself, I think the critical thing this partnership enables us to have is access to immense amounts of compute, both GPU and TPU compute that allows us to really scale up our ability, excuse me, to train the really top -in -class machine learning models. I think one thing that's notable is that most academic labs right now do not have the resources, and frankly, most pharma labs probably don't have the resources to train

the largest types of machine learning models that are making the biggest difference. One of the most popular open source protein models was trained by Facebook of all people called ESM. And so I think this is where we see a real opportunity to make a huge impact in our space by being able to take our biological insights, our data availability,

also new data we can generate and combine that with tons of compute to generate lots of useful systems. And working with the Google team has been great in terms of getting access to their experts to really think about how we can, for example, scale up training on TPUs, for example, which has been a lot of fun. In terms of the broader question of generative AI, I think if you look at what has made those machine learning models really effective, a lot of aspects of that apply very effectively to the life sciences domain. You're often dealing with sequences of letters.

Right? In language, it's text, in DNA, in biology, it's usually DNA or protein sequences that you're working with. You have a lot of data. So we have billions of them, just like we have billions of tech sequences. And most of what we do, a lot of what we do, I shouldn't say most, a lot of what we do in biological engineering is sequence design. We are designing a DNA sequence or designing an amino acid sequence is ultimately what we often want to do to express a particular system. And so...

That is where all the pieces look right. Now, if you look more carefully, it's a lot more, it's like, it's quite complicated, right? DNA has a complex 3D folding structure that involves long range dependencies between elements that are very far apart. Text sort of has that, but not really to the same extent and not with the same kind of context windows. And so you often have to innovate to think about how you can apply some of those models from different domains into the domain of biology. Sometimes they just transfer. A lot of times you think about new innovations that make them actually work.

I'm very curious. So as you're pioneering these large language models in areas, as you said, like the genomics, protein function, synthetic biology, these large language models we have so far seen being used for English, but now you're talking about using that in the language of DNA, which is, you know, only four letters, right? A, G, T, C, or even for proteins, it's 20 letters, like the 20 amino acids.

I'm very curious to know how these large language models work that you're training and using for this very kind of limited vocabulary type of sequences. So can you tell us more about that? Yeah, so I guess there's sort of two parts to that. One is on the one hand, in a sense, nothing really changes. Yeah, you can have smaller vocabulary, but nothing really breaks about that.

Inherently nothing breaks, especially if you have enough data to understand the underlying patterns of them. That said, and it's worth noting that even on English, while we often use words, there's really usually these kind of bi -word featurizations that people use. Often times people have character level tokenizers too that actually do pretty shockingly well. And so even in that case, that's a character, that's a language in 26 characters, where you'd be surprised the power of these models to make powerful predictions even on a relatively small vocabulary. That said,

as you are alluding to, correctly, this can be quite challenging as well. And so there's a number of ways you might deal with this. You might have n -gram featurizations where you're actually tokenizing at the level of multiple characters. There's ways you can be even more clever and tokenize the level of other information we have, like gene level information, for example, and the practice of of particular sort of substrings of that protein in a smart way. But I think one of the things to note, I think one of the big lessons of the generative machine learning world in text

is that often taking a simpler approach to your representation and scaling up the amount of data you have is better than having to do a lot of manual feature engineering with a very clever and smart representation that doesn't necessarily transfer well. And so as an example, the ESM model from Facebook, the one I was mentioning earlier, the protein building model, operates on amino acid characters directly. And it's basically just a transformer. There's some small tweaks, but it's basically a Birch -style transformer. And so...

It shows that even in a very complex domain, you can get really great results by basically taking the same architectures and applying them here. Interesting. But then as you train on, let's say, these DNA sequences, are the large language models able to learn from them in terms of, OK, well, here are transcription factor binding sites for a specific transcription factor, or here is regulatory regions, or here are intron -exon boundaries? What

information is it able to learn and then as it's doing the prediction during the generation like what is it that it's able to like how can you like ask it to basically spell out like sequences for specific functions? Yeah so a few things I can speak to there. One is I think of a lot of the examples you gave there around candidate transcription factor binding sites for example as ways we might design a validation experiment as opposed to

something I can explicitly prompt it as a result of training. So I may not expect it to, I may not have a really great way for us to be able to say, give me a transcription factor binding site, because the model doesn't have any understanding of English, it's only looked at DNA sequences. But I might be able to hope that it can do something like if I took a general purpose DNA foundation model and then fine tuned it on data about transition factor binding expression, let's say,

with a classic data set might look like, you have some region of, you might have a crop of DNA and you need to predict where on that DNA sequence is the transcription factor binding set, let's say. You would hope a model that has been pre -trained on a larger data set might do better on that downstream task than one that has only gotten the downstream task data, presumably because it has some broader understanding of the nature of DNA sequences that allow it to do better with smaller amounts of fine tuning data.

That is one way we go about measuring those types of things. And we are very deep in those experiments. I think we'll have a lot of exciting things to report along those lines. But I think that is exactly the promise that we care, that makes us excited about these types of models. One thing I'll add is a huge thing that is challenging about DNA specific models is context. With proteins, there's sort of this assumption that a protein, an amino acid sequence will fold pretty much the same way regardless of cell contact. I guess there's like

thermostability and stuff, but largely speaking under normal conditions, we'd expect it to fold similarly whether a particular amino acid sequence is expressed in one cell type versus another. Definitely not the case for DNA. We know for sure that is not the case for DNA. There's incredible structural variability of DNA from one cell type to another, and that is what is believed to be what defines the differences between those cell types. That's our modern cell theory. This lays some challenges.

How do you encode that cell type specific information into these sequences? And I'll be totally frank with you that this is something we're having very hard in trying to work out now. I think if you look at the DNA foundation models in the open literature, they're also kind of grappling with this question. Many of them don't have cell type specific information or focus on kind of like lower species like prokaryotes where there's lower variability that's looking to encode there as opposed to the kind of full diversity of mammalian cell systems. And so obviously for us and thinking about how to engineer mammalian cells,

it's really important to be able to think about how do you encode that cell type specific information. And that is where we believe some focused data generation can be incredibly powerful. So if we can basically modulate and sequence the results of lots of cell types of interest in lots of interesting ways, we should be able to use that to generate useful machine learning models. Okay. And so then tell us like, as you said, you know, it'd be great to have the structured information there.

What about like functional information? Like have you seen that supplementing your large language models with functional information, will that give much better results when you're generating the sequences? Great question. The way I think about most functional readouts, the challenge about them, it depends on what kind of functional readout you're talking about, of course, is that often we don't have the data scale to train a large language model with functional readouts alone. For example, if I wanted to have a...

antibody thermo stability readout, let's say. I might have a thousand or if I'm lucky, 10 ,000 or a hundred thousand data points about something like antibody thermo stability. I probably don't have a billion. I probably don't have a trillion. And this is where it gets really challenging to directly train a large -language model with data like that. But again, to my earlier point, I think this is where pre -training on one of these sort of more sequence -driven datasets

can be really useful at doing well on these types of downstream tasks. And that's actually exactly how we evaluate our large language models. As we come up with functional readouts that we care most about often as downstream tasks. And then those are how we gauge the progress of our pre -trained model training. And what you generally find and what we are looking to do more and more of is build these types of pre -trained models that'll allow us to do better at these functional tasks. Because ultimately, my ability to predict next character in a DNA sequence is not really that useful if I can't design

a useful DNA molecule with it. And so thinking about those functional readouts ends up being very important. Okay. Now, one of the things that Ginkgo has said is that, especially of the partnership, is that the partnership reflects the company's conviction that AI tools and biological data should be developed in tandem. Can you explain what that means?

That's what, in tandem with what, sorry, maybe I missed out on that. So that. So tools and biological data, as you're generating biological data, AI tools should be generated at the same time. And so they should be generated at the same time. So I'm just wondering, why does the signal play on that? Sorry, I missed the first part of that. I think the way I think about this is, I think we often think about machine learning models as trained on data and not

trained on data that already exists as opposed to trained on data intended for training those machine learning models. And to some degree, that's true. When you look at what OpenAI did, for example, it's true that they started by training a corpus on the entirety of the internet. But then critically, they also did reinforcement learning with human feedback to iterate on that model to get it to produce human -like responses. And there's this incredible impact that you can have by having focused data sets that can be really impactful for your biological...

domain of interest, especially in biological domains where the amount of data that is out there, while it is interesting, often has a wide variety of data biases. They don't look like independent and identically distributed samples from a Gaussian distribution across all of data space. They tend to look like focus samples around particular little nuggets where people found the most biological interest. But what's most biologically interesting to a human being is not necessarily the most informative for a machine learning model. And so when we think about...

data generation and AI is that we really want to be thinking about what are the types of data that allow a model to capture the underlying data distribution well. I think what we've learned in the last few years is that these especially generative AI methods are really good at being able to generate samples from a given data distribution, which means that if your data distribution is biased in some particular way, it's going to be good at generating samples from that biased data distribution, not the way you might want it. And so we think really hard about how to build

data that follows a distribution that you actually want to be able to generate samples out of. And so I think that's probably the best way to think about why it's really important to think about those in tandem. It's like the model is really good at replicating whatever data distribution you get. Yeah, absolutely. And so you're collaborating with Google Cloud's Vertex AI platform. What is Vertex AI?

So Vertex AI is Google Cloud's sort of a system for internal model building and development. We actually work with Google Cloud pretty broadly. So Vertex AI is a component of where we work with them in the context of data scientists having access to a Jupyter Notebook. Much more so actually, we're working closely with them on TPU -based training. And so we have a really strong relationship with them that we're looking to build and expand. And I think being able to take models built on PyTorch, so the same kind of...

engineering system that everyone else uses, the same libraries that everyone else uses, but then have them compile so they can run on Google's TPUs at really significant scale in a very cost effective manner, ends up allowing us to be able to train much bigger models than we would otherwise basically have budget to train. And so even with a company with a really large multimillion dollar budget, it's very easy to 5x that and become too big. And so we're really looking to maximize what we can do with those types of systems. And so how do you expect generative AI to change bioengineering in the long term.

I mean, I think in the long term, a lot of bioengineering will look like generative machine learning. Already, machine learning has been very central to biology. It's not like we're the first people to do machine learning in biology. There's been hidden markup models, part of biology for a very long time. I think that, as I mentioned earlier, I think a lot of engineering biology looks like engineering sequences. It looks like designing sequences. And I think as we've seen in other domains, especially if you start to think about

new types of generative design techniques like diffusion models, there's a wide variety of ways to do conditional generation of sequences. I think that is a really common set of tasks that we care about in biology is, I want to generate a sequence that matches this data distribution, but has these four conditions, such as folds in this way, binds to this particular target, expresses to this degree, and so on. And what's very exciting about generative machine learning is that it allows us to...

define both parts of that, sampling from a particular data distribution and conditioning in a particular set of ways. And I think if we can do that well, it makes sense that a lot more bioengineering is going to evolve, generate machine learning. Okay, and do you think the way we program computers in the long term will be able to program biology? I hope so. I mean, I think that there's components of biology that are always be challenges, as much as I would love to just say like, hell yeah, for sure. I also understand that many biologists would

listen to that and kind of roll their eyes. I think biological systems are incredibly complex. I think what I can say for sure is that we're at the start of an incredible wave of data about biological systems. Every two months, there's a paper in Nature Methods, like here's another way we can at scale generate a pooled readout of this complex biology with this particular way. There was one recently about...

using AAV to do edits in vivo inside a mouse and do a pulled readout of the responses, for example. And so the point being that I think we have incredibly increasingly powerful ways to understand existing biological systems. And at the same time, there's this revolution happening in ways to design biological systems and express them at scale and measure the responses. And so it follows that we are going to be able to get much better at making it feel more like a programming discipline. Now, will it literally be deterministic

code that I write that always function the same way? I think we're some ways away from that. But will it be ways where I can think of a biological system as consisting of components that can each be engineerable? I think that's the exact kind of mission we're looking to get out to do. So I'm not sure if it's going to be computer code, but I would love it if it got closer. Looking forward to that. So Ankit Gupta, head of AI/ML, and senior director of machine learning research for Ginkgo Bioworks. Ankit, thank you for your time today.

Absolutely. Thanks so much for the time. Well, Amar, what did you think? It was pretty interesting to understand from him about the different applications of bioengineering and where the different industries where it can be used. Biosecurity, we talked about quite a bit, which is, you know, we're coming from the pandemic. We are very well aware that we need to have something about it. And it's also like just the incredible

kind of transformation that is happening there in terms of now using generative AI and large language models, applying them to gene and protein sequencing. I mean, that's definitely a new world. You asked about Ginkgo's data and Ankit talked about data from its own programs and partners that it can use to train large scale AI systems. In the world of AI, does the value of data change?

I mean, I think many of these new newer techniques in AI, which are more neural network -based transformer -based and even all these large language models, they just require more and more data. I mean, I know they're giving us incredible results. But see, one of the reasons why some of these techniques were not developed 20 years ago and are being developed now is that there just wasn't enough data 20 years ago

and the limited data was the limitation for the progress of neural networks 20 or 30 years ago. But now because so much data is becoming available, these neural network -based techniques or deep learning -based techniques are now really driving the way. So yes, data is essential. Incredible amount of data is just needed. So I think that's just going to be needed. And then

as the models get sophisticated, I believe we're just going to continue to need more data. Well, it does seem to feed on itself. You talked about the changing nature of machine learning and protein design and how that's going to make them more data hungry. What did you think about that? Yeah, that's true. And it's also, I mean, I would also say like, you know, one thing that you said about like the data that needs to be produced.

We need to produce more and more data that is needed to train the machine learning systems. And this is something where I believe still in the industry, of course, in the pharma industry, what I find is people just generate data and they're like, okay, well, use it for your machine learning systems. That's not the best way of doing it because in machine learning systems, you're looking for certain types of data, certain high quality data that is needed

to solve a specific problem. So I think this paradigm is going to start changing where the data that you need is you're going to start generating more and more data that is needed for the machine learning system than just what the biologist is looking for. We often talk about data and the right data and the quality of data. One of the things Ankit talked about with regards to the Ginkgo Google agreement is

the massive amount of computing power to train machine learning for generative AI that that will give Ginkgo. How rate limiting is a lack of computing power for academic labs and pharma labs today? Is that an issue? I don't think the pharmas right now have access to the kind of computing that's needed if they want to do large scale identification of the new drugs and the new targets or so.

So what I'm seeing the pattern with the pharma companies is they are actually very actively doing partnerships with some of these large computing powerhouses so that they get access to the compute. So the pharmas have realized that they need that access. And I've seen that quite a bit over the last three, four months that the pharmas

doing partnerships with companies like Nvidia and so that they start getting access to this computing power. So yeah, so they don't have that kind of computing power and they know that. But I think we're moving in the right direction where they will get access to it through these kind of partnerships. You also asked about the application of natural language models to DNA and proteins. How do you think that's going to change what's possible with AI?

Will change a lot. So my PhD was in doing computational analysis of genomic sequences. And this was using traditional machine learning methods. And now, and of course, that was a lot of natural language processing because dealing with sequences, that's a language and you're processing a lot of that text. But now what we're seeing is that instead of the traditional

natural language processing or NLP, as we're using the large language models, we're seeing tremendous amount of the new generation, which was not possible with the traditional natural language processing in any of the fields that I've seen. So I think the same is going to be applicable here, where what we were able to produce with the traditional natural language processing, we're going to be able to see a lot more

insights that can be generated. I'm actually very curious to hear about how these large language models are learning about the different aspects and different functions of the gene sequences or protein sequences. I believe what you're going to see is that some of these new insights are something that we don't know yet. So it's going to be interesting about how they program the large language models and how the large language models are able to

gain insights from these sequences and then provide those insights to us and then create new sequences with those insights so that you can see some new functional modalities that will be really helping in terms of creating not only new drugs but also in the different aspects of bioengineering that Ankit talked about. So I do believe that there's a tremendous potential in using the large -enriched models,

really gaining insights that we have not been able to get with the traditional techniques. It's exciting to see how Ginkgo is pushing generative AI, not just in medicine, but beyond that into all sorts of new applications for biotechnology. Looking forward to our next discussion. Absolutely. Thank you, Danny.

Thanks again to our sponsor, Agilisium Labs. Life Sciences DNA is a bi -monthly podcast produced by the Levine Media Group with production support from Fullview Media. Be sure to follow us on your preferred podcast platform. Music for this podcast is provided courtesy of the Jonah Levine Collective. We'd love to hear from you. Pop us a note at danny at levinemediagroup .com. For Life Sciences DNA and Dr. Amar Drawid,

I'm Daniel Levine. Thanks for joining us.

‍

Our Host

Dr. Amar Drawid, an industry veteran who has worked in data science leadership with top biopharmaceutical companies. He explores the evolving use of AI and data science with innovators working to reshape all aspects of the biopharmaceutical industry from the way new therapeutics are discovered to how they are marketed.

Our Speaker

Ankit came to Ginkgo through its acquisition of Reverie Labs, a company he co-founded and served as CTO. At Reverie Labs, he leveraged AI/ML tools to accelerate drug discovery. At Ginkgo, Ankit supports the company’s expanding AI/ML infrastructure and drives the development and application of advanced AI/ML techniques in bioengineering. He holds a bachelor's and a master's degree in computer science from Harvard, showcasing his extensive expertise in the field.

Recent Episodes

Thank you! your form has been submitted.

Oops! Something went wrong while submitting the form.

Using AI to Make Biology Easier to Engineer

In This Episode

Transcript

Our Host

Dr. Amar Drawid

Our Speaker

Ankit Gupta

Recent Episodes

Driving Insights from Real World Oncology Data with AI

Real-World Evidence and the Future of Clinical Research

Building the AI Playbook for Clinical Trials

Transforming Cancer Diagnostics and Care with AI

Moving Beyond the Hype of AI in Biopharma

Why AI’s Most Promising Near-Term Value Is in Clinical Operations

Teaching the Biopharma Workforce to Coexist with AI

Unlock Life Sciences

Innovation with
Agentic AI