The Side Project - a short story

Two months had passed since Becky had joined MidFerment Corp. as a Computational Biologist. Her project charter for the biotherapeutics titer improvement using active learning, data-driven, and mechanistic modeling had been approved.
The project was moving well, with the first round of experiment designs being built and tested. Becky was eager to see how well her experiment designs would reduce her model's prediction uncertainty.
In the meantime, a new side-project had been placed on her desk.
Her company had been experimenting with a new chassis strain, which could convert industrial waste carbon into valuable compounds. The organism's metabolic network was characterized, and a genome-scale metabolic network model was already available.
The challenge was that the transcriptional regulatory network (TRN) was largely uncharacterized. Their strain had over 100 transcription factors, and multiple sigma factors, all organized in a complex hierarchical network.
Because of this, the wetlab team was always a little suspicious about any model-based designs. "Your model doesn't account for regulation. Regulation is the main factor," they'd often say.
Becky knew that wasn't entirely true. Even though the TRN wasn't explicitly modeled, the metabolic simulations still accounted for the 'outcome' of regulation – the fittest phenotype.
But Becky wasn't content with this argument. She wanted to model the TRN, and combine it with the metabolic model.
She'd read up on the latest advances in TRN modeling. She found a vibrant field in studying regulation using single-cell transcriptomics in human cells. A lot of exciting work was being done with generative AI over there.
She was envious because she knew there was little chance the team could get single-cell omics for this organism. Their distinct morphology alone created R&D challenges for both droplet and probe-based single cell work.
Nonetheless, the company had accumulated around 8 samples of bulk transcriptomic data across historical projects. Of course, none of the historical data had been collected for the purpose of model development, so Becky knew they wouldn't be optimally informative. But it was still probably enough to create a proof of concept model.
Using the "Wishful thinking" paradigm, Becky wrote her high-level program for the side project.
Stakeholder outcomes
- suggest which transcription factors to edit to increase product yield
High-level program
def suggest_TF_edits(strain, omic_data, fermentation_data):
best_TF_edits = optimize_TF_edits(strain, omic_data, fermentation_data)
yield = predict_yield(strain, best_TF_edits)
return best_TF_edits, yield
def optimize_TF_edits(strain, omic_data, fermentation_data):
# First, learn the TRN model from available data
trn_model = learn_TRN(strain, omic_data, fermentation_data)
# Combine TRN model with metabolic model
# Suggest optimal TF edits to maximize yield
def predict_yield(strain, TF_edits):
expression = predict_gene_expression(strain, TF_edits)
def learn_TRN(strain, omic_data, fermentation_data):
pass
Becky knew that the crux of the problem would be to learn the TRN from the limited transcriptomics and fermentation data. She was excited to see what advanced AI methods could be applied to model regulatory networks.
Microbial regulatory network inference
With the help of Perplexity, Becky quickly came up with a few candidate approaches for learning the TRN from omics and fermentation data.
Perplexity's Output using "DeepSeek model hosted in US"
Here's a comparative analysis of the transcriptional regulatory network inference methods from the provided research, focusing on sample/feature scalability and methodological innovations:
Method | Samples Tested | Features Tested | Key Innovation |
---|---|---|---|
PGBTR[1] | 50 conditions | ~100 TFs | CNN architecture with PDGD input generation for stable TRN prediction |
ROSE [2] | 128 samples | 2,146 genes | In vitro transcription coupled with multi-omic validation for promoter characterization |
GRNN[3] [15] | 8 experimental setups | 22 genes | Neural network mapping of gene interaction dynamics in P. aeruginosa |
CoryneRegNet 7[4] | 224 strains | 82,268 regulatory interactions | Genome-scale TRN transfer across Corynebacterium species |
mosGraphGPT[5] | 3,592 (pre-train) 128 (fine-tune) |
2,121 genes 26,305 relations |
Foundation model for multi-omic graph integration using attention mechanisms |
SPREd[6] | 50-100 conditions 250 GRNs |
100-5,000 features 500 genes/GRN |
Simulation-supervised neural network with combinatorial regulation modeling |
Key Observations:
- Scale Variance: Methods range from focused studies (GRNN's 22 genes) to large-scale analyses (CoryneRegNet's 82k interactions)
- Data Strategies:
- PGBTR and SPREd use synthetic data augmentation (50-100 conditions)
- mosGraphGPT employs transfer learning from cancer to Alzheimer's data
- ROSE combines in vitro/in vivo validation with multi-omic integration
- Architectural Trends:
- Shift from traditional ML (SVM in earlier methods) to deep learning (CNN/GNN)
- Emerging use of foundation models (mosGraphGPT) and neural network analogies (GRNN)
- Biological Validation:
- ROSE achieved 67.4% agreement with RegulonDB
- GRNN showed 89% accuracy matching experimental pyocyanin production data
- PGBTR demonstrated 15-20% performance improvement over existing methods
Recent methods show improved handling of:
- Feature-to-sample ratios up to 1:75 (5k features:65 samples)[13]
- Non-linear regulatory relationships through deep learning architectures
- Multi-omic integration capabilities
- Cross-species regulatory pattern transfer
A few methods caught Becky's eye.
- The transformer-based method (mosGraphGPT) looked exciting – she'd been meaning to apply transformers in her projects. However, the foundation model had been trained on human data – The Cancer Genome Atlas, and fine-tuned for Alzheimer's Disease. She was quite certain there wasn't enough data to train her own. Skip for now.
- Several methods (PGBTR, GRNN, SPREd) used neural networks, in some cases leveraging graph neural networks. That seemed intuitive. Also, some of the studies used data sets having similar size to her own. For example, 50 conditions and 100 TFs. While she had fewer data points (8), she was optimistic it could work with some adaptation.
Excited to have found a promising avenue, Becky downloaded the open-source code from the papers. She was determined to have a working prototype in a week, and then start training in earnest in two weeks.
The time crunch
On Wednesday, Becky arrived at her home office (a short 20 steps from the kitchen – it was a remote work day) and opened up her email.
She found an email from Luke, the lead strain developer, with the subject line,
Urgent: Please send your list of TF edits by Friday.
Becky froze. She had spent all of last night fighting with python, torch_geometric, and other dependencies. She had allotted all of today to reproducing the GRNN paper, and after verifying the code, she'd apply it to the real company data set tomorrow.
Even she knew this was an optimistic plan. She'd worked with open-source publication code enough times to know her real timeline for the first step could range between 2 days and 2 weeks. And then, making the method work with in-house data could another week or two.
There was no way she'd get suggestions in by Friday.
Becky was disheartened. She really didn't want to let the team down. She also wanted to show them that AI could extract fresh insights from historical data sets. That through the eyes of AI, they'd learn something new that the human eye hadn't picked up.
That was the key.
Becky realized she just had to provide a minimum additional insight (MAI) from the data set in hand. After all, any new insight had to come from analyzing all the data sets in combination. That's what was so challenging to do manually, and what AI excelled at, extracting patterns from big data.
With this new angle, Becky thought, "What if I use something simpler?"
And then she remembered a conversation she'd had a few days ago with Ray, a friend and kind of mentor, when they met for coffee at the Crow Rock Cafe by the beaches.
Understanding simple
"What's the most advanced machine learning method you can think of?" Ray asked Becky while sipping his coffee.
"Umm..." Becky thought for a moment. "Maybe DeepSeek?"
"I see." said Ray. "Why do you believe it's the most advanced method?"
"Oh..." Becky said. "Probably because of its reasoning capabilities. And the fact that it could beat benchmarks with fewer resources."
"Okay. What do you think makes DeepSeek so advanced?" Ray asked.
"Oh, umm..." Becky thought hard. "It certainly has a lot of parameters. And like the o1 model, it thinks through problems in chunks using chain-of-thought reasoning. I also read that it uses a mixture of experts system."
Becky admitted, "I'm actually not entirely sure."
I'll have to read the paper one of these days," she chuckled.
Ray smiled. "Of course."
"Now, what's the simplest machine learning method you can think of?" he asked.
"Hmm..." Becky paused.
"Probably a linear model", Becky said, this time with more certainty.
"And what makes it simple?" said Ray.
"All the parameters appear linearly." Becky said, this time with confidence.
"That's right." said Ray. "You know exactly what makes the linear model a linear model. So you know why it works, and when it doesn't work – when linear isn't enough."
He continued. "Compare that to our discussion about the advanced models. We see that they work. But do we really know why, as users? More importantly, if it doesn't work for some problem, do we know why not? And what we can do about it?"
"As a computational biologist, you're job is to apply the ideal tool that creates value for your organization. Sometimes this means vetting a new and exciting tool like DeepSeek. Other times, it means knowing exactly why a technique isn't working, and finding the shortest path to a solution."
That day, Ray imparted a valuable lesson to Becky:
- It's hard to know why a complex model works, and easy to know why a simple model works.
- It's harder to know why a complex model does NOT work. It's easier to know why a simple model does NOT work.
If you understand a model deeply enough, when it doesn't work, you can find the shortest path to a working solution.
Leveraging simplicity
Becky knew what to do now. She decided to come up with the simplest model of the TRN that could be learned from the available data. She'd then use that simple model to come up with predictions she could explain fully.
What was the simplest model of a TRN? A graph convolutional network? An autoencoder? A transformer? Probably not.
Perhaps a Bayesian network. She'd had some experience with those in a paper she'd authored.
She then remembered.
A probabilistic Boolean network. It didn't get much simpler than that.
The nodes would represent genes and TFs. The edges the regulatory interactions. The network was Boolean because genes and TFs could be 0 (off) or 1 (on). Not expressed or expressed. Active or not active.
And the probabilistic nature came from the edges, which represented the probability of an interaction between nodes.
This was it. She could build one of these quickly, and using only the 8 samples she had.
Best of all, the method was lightweight enough that she could integrate it readily with the metabolic model – Chandrasekaran and Price had shown this in 2010.
She proceeded to build her basic model.
import torch
import numpy as np
class PROBBaseline:
def __init__(self):
self.tf_states = {}
self.edge_probs = {}
self.expression_thresholds = {}
def fit(self, data):
num_nodes = data.x.shape[0]
# Global threshold same for all genes
expression_threshold = torch.median(torch.abs(data.x))
# Calculate continuous probabilities
for i in range(num_nodes):
for j in range(num_nodes):
if i != j:
# Calculate probability
# This is a negative edge
# P(target=on | TF=off)
numerator = ((
torch.abs(data.x[i]) < expression_threshold) & (
torch.abs(data.x[j]) >= expression_threshold
)).sum()
denominator = (torch.abs(data.x[i]) < expression_threshold).sum()
P_on_off = numerator / denominator
# This is a positive edge
# P(target=on | TF=on)
numerator = ((
torch.abs(data.x[i]) >= expression_threshold) & (
torch.abs(data.x[j]) >= expression_threshold
)).sum()
denominator = (torch.abs(data.x[i]) >= expression_threshold).sum()
P_on_on = numerator / denominator
# Combined edge weight
edge_weight = P_on_on - P_on_off
if torch.isnan(edge_weight):
edge_weight = 0.
self.edge_probs[(i,j)] = edge_weight
def predict(self, data):
num_nodes = data.x.shape[0]
adj_pred = torch.zeros((num_nodes, num_nodes))
for i in range(num_nodes):
for j in range(num_nodes):
if i != j:
adj_pred[i,j] = self.edge_probs.get((i,j), 0.0)
return adj_pred
To validate the model, Becky loaded an E. coli TRN data set, randomly sampled smaller subsets of the data, and computed the AUROC and precision-recall curve.
In two hours, she had a verified prototype.
It was only 1pm on a Wednesday. She felt elated. She'd have all of tomorrow to analyze the in-house data and generate new TF edits.
She anticipated challenges, of course. For example, she guessed class imbalances would arise and throw a wrench in the precision-recall curve because of the sparse nature of TRNs. And she'd need to either do a lot of samling or run some kind of metaheuristic optimization to find "optimal" edits to suggest.
Becky was confident she could address these challenges. After all, these were all known unknowns. And she knew the computational biology techniques to address them, not in small part due to The Comptational Biologist book.
Most importantly, she knew that fresh insights that had gone untapped for months were now within reach.
Today was a win for simplicity.
Enjoyed the article? Curious about The Computational Biologist ebook that helped Becky cut through complexity and leverage simplicity?
Join the newsletter
Get access to exclusive chapter previews, computational biology insights, and be first to know when the full book launches.