The Computational Biologist - a short story

A new beginning
It was Becky's first week at MidFerment Corp., which she'd joined as a computational biologist in a small team of five, including herself, within a larger organization of around 200 employees, mostly wet lab. She'd been headhunted for the position after her boss saw her presentation at a conference. After a few days of onboarding and project management, she was ready for her first lead project.
The big project
The company had been developing an internal pipeline to bring functional proteins to market faster. Given a target protein function, the pipeline would return a protein sequence having the desired function, and the strain and media conditions to achieve a high titer of that secreted protein.
A separate team led the downstream processing (DSP), although they expressed interest in any additional information about what byproducts to expect. Very high priority was placed on ensuring protein secretion, to make DSP more cost effective.
Becky was tasked with developing a project charter and leading the entire pipeline development. Nate, the existing bioinformatician, would support genomic data processing, and software development.
The problem was, Becky had never performed this exact project before. Sure, she'd used various machine learning models using protein sequences, and had taken a graduate course on genome-scale metabolic modeling. She could also perform differential expression and pathway enrichment with transcriptomic data.
But this was the first time she'd had to develop a pipeline end-to-end. She also didn't know which of the techniques from grad studies would apply at each step of the pipeline. It was daunting.
Then, she remembered the ebook she'd picked up a week ago, "The Computational Biologist". She'd received an early kickstarter edition, with the full release to include working code samples and more chapters. (She thought the title could be pithier but guessed the author would find a better title upon official launch.)
The Trinity of Techniques
She flipped through the PDF, and rather quickly, an early chapter caught her eye:
Chapter 1: Anatomy of a computational biology pipeline
We can distill computational biology to three approaches, no matter the goal.
This we refer to as the trinity of techniques (or just the trinity):
1. Design optimal experiments: collect the most informative data in the most intelligent way possible.
2. Use a data-driven model: leverage data to make informed decisions.
3. Use a mechanistic model: leverage known mechanisms to make informed decisions.
Wow, she thought. That's simple enough.
She started by breaking down the pipeline into two big goals:
- Given function → design protein sequence
- Given protein sequence → design strain
At each component, she applied the trinity of techniques:
- Given function → design protein sequence
- Design optimal experiment: appropriate if high-throughput protein expression and rapid screening assay already developed. An assay was available but with a throughput of 20 proteins per day.
- Use a data-driven model: appropriate if either lots of data could be generated for training, or a database already exists. A quick literature search revealed several databases mapping natural protein sequences to function.
- Use a mechanistic model: appropriate if physico-chemical mechanisms known, and a model exists. Becky knew docking simulators were available but wasn't sure if the extra fidelity from physics-based simulation was appropriate for the initial screening stages.
- Given host cell and protein sequence → design strain and media
- Design optimal experiment: the lab could construct and grow 4 strains per day. After 3 days, the titer was measured.
- Use a data-driven model: Becky knew of several published machine learning models that predicted expression from sequence. She wasn't sure if they would work in the context of their in-house strains.
- Use a mechanistic model: annotated genome sequences of several host strains were available. She also knew that many of the host organisms had close relatives on NCBI. She also suspected that published metabolic network models for either the in-house organism or a close relative would be available.
From this initial overview, Becky could now begin planning out an angle of attack.
Function to sequence
Becky already decided that some kind of predictive model would get built. It wouldn't make much sense to collect data, test performance, keep the best sequence, and then start from scratch for every new target function. There needed to be a model that would learn from every experiment and get increasingly accurate at designing sequences that worked.
Experiments would inevitably be performed when testing candidate sequences. The question was, how many of the 20 proteins per day would be reserved toward improving the model versus testing the protein function? If too many samples were used strictly for model training that might seem 'wasteful' without some guarantee that training the model would eventually improve performance.
Meanwhile, only constructing samples hypothesized to improve prediction, without addressing the underlying model uncertainties could cause those uncertainties accumulate and prevent the model from ever being truly predictive.
This was a challenging dilemma. Especially since Becky needed to present a strong case for allocating limited resources either way.
Luckily, the book had a chapter dedicated to Designing Experiments.
The chapter said thusly:
There are three key techniques a computational biologist should be familiar with:
- Active Learning
- Optimal Experiment Design (OED)
- Bayesian Optimization
They all share the same goal:
achieve your project outcome using provably fewer experiments.
She particularly liked the phrase "provably fewer experiments".
This terminology suggested the author understood her predicament – of not only developing a pipeline that would work but being accountable for modeling decisions in a budget-concious environment. Perhaps he'd been in her shoes...
After reading the chapter, she learned several concepts that helped her decision making. She decided to focus for now on Active Learning and OED, as her main goal was to design experiments to improve either a data-driven or mechanistic model:
- Active Learning is flexible and works well for a variety of machine learning models.
- OED provides rigorously optimal experiment designs when applicable but requires more theoretical model constructs, or require approximations to the model
In a nutshell, both approaches could be used to iteratively improve the model.
For a mechanistic model, OED would typically be used to estimate parameters in the most optimal way. However, unless she was using a relatively simple model, she'd likely need additional derivations or simplications to benefit from the rigor of using OED. For a machine learning model, OED would not typically be used because the information matrix is not readily constructed, or would require additional approximation steps, like local approximations and linearization.
While the mathematical rigor of OED was enticing, Becky wanted to keep things simple for now. More importantly, she didn't want to feel constrained in her choice of model because of potential concerns of complicating the OED steps later.
So, she decided to base her experient design on Active Learning.
She found that pool-based sampling fit her situation naturally: suggest what sequences to test in a batch of 20 samples, get results, update model, design the next round.
She also determined that the biggest challenge was how precisely the model could predict activity. In other words, how certain was the model about the predicted function? And if the model was highly uncertain, experiments would ideally be designed to minimize prediction uncertainty with the fewest experiments.
For this reason, she chose uncertainty sampling as the active learning objective.
Becky came up with the following experimental plan:
- five iterations total of experiments, with 20 samples tested each iteration
- first four iterations focused on model improvement through uncertainty sampling
- last iteration focused on performance, testing the top 20 predicted proteins.
From a preliminary analysis of the company's historical data, Becky estimated 40-60% fewer samples to reach 80% model accuracy compared to random sampling. This estimation could be demonstrated using a simple retrospective, in silico experiment.
Strain design
For strain design, Becky knew she wanted to use a mechanistic, genome-scale metabolic network. For one, she'd found a published metabolic model for the microbe the wet lab team preferred to use. The model seemed to be of high-quality, which was confirmed when she downloaded and ran the model through MEMOTE to assess model quality.
Having chosen to use a mechanistic metabolic model, she now turned to the strain design strategy.
From the book, Becky recalled the chapter on strain design. She knew she could apply the Wishful Thinking workflow. This meant starting with the outcomes. She could think of several:
- maximize titer
- maximize protein expression
- optimize protein secretion
- maintain host cell fitness
What wasn't clear yet was whether the goals were all aligned or if any competed with each other. She decided that maximizing titer would be the top-level goal. To maximize titer, the cell would need to increase protein expression, secretion, and cell fitness. And in order to optimize all of these objecctives, the cell would need to allocate limited cellular resources. She also knew that titer depended on the rate of both biomass synthesis and protein secretion for a defined culture time.
Based on all these considerations, Becky decided that a dynamic, proteome allocation model would be ideal. She also needed to incorporate protein secretion pathways.
The only problem: there was no detailed multi-scale model available for the chosen host organism. She decided to map out the expected outcomes and required time and resources for each option.
Table 1. Requirements for models. FBA: flux balance analysis, pcFBA: protein constrained FBA, ME: metabolism and macromolecular expression (multi-scale)
Factor | FBA | pcFBA | ME |
---|---|---|---|
Time to build model | Available now | 2-4 weeks | 3-6 months |
Time per simulation | 0.1 sec | 1 sec | 10 min |
Uniqueness of asset | Common | Less common | Very rare |
Data needed | model available | gene-protein-reaction association, rate constants | pcFBA data + protein complex stoichiometry, translocation information, proteostasis network, transcription unit organization |
The requirements increased sharply across the options. Notably, pcFBA could be used very quickly using "basal" rate constants for most reactions. That said, additional time could be spent to estimate rate constants more precisely.
The multi-scale ME-model, while the most comprehensive in scope, clearly required additional time to reconstruct that expanded scope.
Table 2. Scope of mechanistic models
Mechanism | FBA | pcFBA | ME |
---|---|---|---|
Metabolism | ✔ | ✔ | ✔ |
Proteome allocation | ✘ | ✔ | ✔ |
Protein expression | ✘ | ✘ | ✔ |
Protein modification | ✘ | ✘ | ✔ |
Protein folding | ✘ | ✘ | ✔ |
Protein secretion | ✘ | ✘ | ✔ |
Transcriptomics analysis | Indirect | Semi-direct | Direct |
Proteomics analysis | Indirect | Direct | Direct |
RNA vs. protein bottleneck differentation | ✘ | ✘ | ✔ |
In terms of mechanstic scope, the winner was clear: multi-scale ME models.
The question now was whether the additional predictive capability was worth 3-6 months of development!
Becky knew it would be a hard sell.
It was exceedingly rare to get 3-6 months of protected time to build any platform technology. Especially if it was decoupled from wetlab data.
Wait. Decoupled from data.
That was the key – whichever model was being developed, it had to interface tightly with the experiments.
And it had to be developed iteratively, as results and insights were being generated.
The breakthrough
Brimming with excitement, Becky drafted a plan for strain design, inspired by the trinity of techniques:
- Design optimal experiments using pcFBA and active learning. This would require a lead time of 2 week for initial model testing.
- Build a draft ME model and compare with pcFBA model. The ME model draft woudl share much of the same kinetic parameters as the pcFBA model, and the gene expression pathways would be semi-automatically generated.
Design experiments to identify parameters that the two models agree least on. - Update both models using newly collected fermentation data.
- Schedule RNA-Seq experiments at iteration 3.
- Perform another iteration, having updated both models with fermentation and RNA-Seq data.
- Perform final iteration to maximize strain performance, based on a "committee" of the two models.
Becky was excited to pitch this project charter to her boss.
She'd always wanted to upgrade her own skills by developing a cutting-edge modeling technology. But she'd always been wary of the time commitment and alignment with her company's milestones. Now, she'd found a way to achieve both.
She was also confident that she'd arrived at an iterative pipeline with distinct advantages:
- provably efficient experiments to improve model accuracy
- reaching predictable performance improvements sooner
- achieving project outcomes while developing a unique asset – dual value creation
Becky bolded these statements in the executive summary and emailed her charter describing the combined pipeline.
Closing
Despite it being her second week on the job, she'd already figured out a creative way to build an asset that would help the company and that excited her. She wondered what other useful tidbits were in "The Computational Biologist." She promised herself she'd read another chapter tomorrow. She couldn't wait for the full release.
Curious about The Computational Biologist ebook that helped Becky thrive in her role from week 1?
Join the newsletter
Get access to exclusive chapter previews, computational biology insights, and be first to know when the full book launches.