The Side Project - a short story

The Side Project - a short story
Story cover - the side project.

Two months had passed since Becky had joined MidFerment Corp. as a Computational Biologist. Her project charter for the biotherapeutics titer improvement using active learning, data-driven, and mechanistic modeling had been approved.

The project was moving well, with the first round of experiment designs being built and tested. Becky was eager to see how well her experiment designs would reduce her model's prediction uncertainty.

In the meantime, a new side-project had been placed on her desk.

Her company had been experimenting with a new chassis strain, which could convert industrial waste carbon into valuable compounds. The organism's metabolic network was characterized, and a genome-scale metabolic network model was already available.

The challenge was that the transcriptional regulatory network (TRN) was largely uncharacterized. Their strain had over 100 transcription factors, and multiple sigma factors, all organized in a complex hierarchical network.

Because of this, the wetlab team was always a little suspicious about any model-based designs. "Your model doesn't account for regulation. Regulation is the main factor," they'd often say.

Becky knew that wasn't entirely true. Even though the TRN wasn't explicitly modeled, the metabolic simulations still accounted for the 'outcome' of regulation – the fittest phenotype.

But Becky wasn't content with this argument. She wanted to model the TRN, and combine it with the metabolic model.

She'd read up on the latest advances in TRN modeling. She found a vibrant field in studying regulation using single-cell transcriptomics in human cells. A lot of exciting work was being done with generative AI over there.

She was envious because she knew there was little chance the team could get single-cell omics for this organism. Their distinct morphology alone created R&D challenges for both droplet and probe-based single cell work.

Nonetheless, the company had accumulated around 8 samples of bulk transcriptomic data across historical projects. Of course, none of the historical data had been collected for the purpose of model development, so Becky knew they wouldn't be optimally informative. But it was still probably enough to create a proof of concept model.

Using the "Wishful thinking" paradigm, Becky wrote her high-level program for the side project.

Stakeholder outcomes

  • suggest which transcription factors to edit to increase product yield

High-level program

def suggest_TF_edits(strain, omic_data, fermentation_data):
  best_TF_edits = optimize_TF_edits(strain, omic_data, fermentation_data)
  yield = predict_yield(strain, best_TF_edits)
  return best_TF_edits, yield

def optimize_TF_edits(strain, omic_data, fermentation_data):
  # First, learn the TRN model from available data
  trn_model = learn_TRN(strain, omic_data, fermentation_data)
  # Combine TRN model with metabolic model
  # Suggest optimal TF edits to maximize yield

def predict_yield(strain, TF_edits):
  expression = predict_gene_expression(strain, TF_edits)

def learn_TRN(strain, omic_data, fermentation_data):
  pass

Becky knew that the crux of the problem would be to learn the TRN from the limited transcriptomics and fermentation data. She was excited to see what advanced AI methods could be applied to model regulatory networks.

Microbial regulatory network inference

With the help of Perplexity, Becky quickly came up with a few candidate approaches for learning the TRN from omics and fermentation data.

Perplexity's Output using "DeepSeek model hosted in US"

Here's a comparative analysis of the transcriptional regulatory network inference methods from the provided research, focusing on sample/feature scalability and methodological innovations:

Method Samples Tested Features Tested Key Innovation
PGBTR[1] 50 conditions ~100 TFs CNN architecture with PDGD input generation for stable TRN prediction
ROSE [2] 128 samples 2,146 genes In vitro transcription coupled with multi-omic validation for promoter characterization
GRNN[3] [15] 8 experimental setups 22 genes Neural network mapping of gene interaction dynamics in P. aeruginosa
CoryneRegNet 7[4] 224 strains 82,268 regulatory interactions Genome-scale TRN transfer across Corynebacterium species
mosGraphGPT[5] 3,592 (pre-train)
128 (fine-tune)
2,121 genes
26,305 relations
Foundation model for multi-omic graph integration using attention mechanisms
SPREd[6] 50-100 conditions
250 GRNs
100-5,000 features
500 genes/GRN
Simulation-supervised neural network with combinatorial regulation modeling

Key Observations:

  1. Scale Variance: Methods range from focused studies (GRNN's 22 genes) to large-scale analyses (CoryneRegNet's 82k interactions)
  2. Data Strategies:
    • PGBTR and SPREd use synthetic data augmentation (50-100 conditions)
    • mosGraphGPT employs transfer learning from cancer to Alzheimer's data
    • ROSE combines in vitro/in vivo validation with multi-omic integration
  3. Architectural Trends:
    • Shift from traditional ML (SVM in earlier methods) to deep learning (CNN/GNN)
    • Emerging use of foundation models (mosGraphGPT) and neural network analogies (GRNN)
  4. Biological Validation:
    • ROSE achieved 67.4% agreement with RegulonDB
    • GRNN showed 89% accuracy matching experimental pyocyanin production data
    • PGBTR demonstrated 15-20% performance improvement over existing methods

Recent methods show improved handling of:

  • Feature-to-sample ratios up to 1:75 (5k features:65 samples)[13]
  • Non-linear regulatory relationships through deep learning architectures
  • Multi-omic integration capabilities
  • Cross-species regulatory pattern transfer

A few methods caught Becky's eye.

  • The transformer-based method (mosGraphGPT) looked exciting – she'd been meaning to apply transformers in her projects. However, the foundation model had been trained on human data – The Cancer Genome Atlas, and fine-tuned for Alzheimer's Disease. She was quite certain there wasn't enough data to train her own. Skip for now.
  • Several methods (PGBTR, GRNN, SPREd) used neural networks, in some cases leveraging graph neural networks. That seemed intuitive. Also, some of the studies used data sets having similar size to her own. For example, 50 conditions and 100 TFs. While she had fewer data points (8), she was optimistic it could work with some adaptation.

Excited to have found a promising avenue, Becky downloaded the open-source code from the papers. She was determined to have a working prototype in a week, and then start training in earnest in two weeks.

The time crunch

On Wednesday, Becky arrived at her home office (a short 20 steps from the kitchen – it was a remote work day) and opened up her email.

She found an email from Luke, the lead strain developer, with the subject line,

Urgent: Please send your list of TF edits by Friday.

Becky froze. She had spent all of last night fighting with python, torch_geometric, and other dependencies. She had allotted all of today to reproducing the GRNN paper, and after verifying the code, she'd apply it to the real company data set tomorrow.

Even she knew this was an optimistic plan. She'd worked with open-source publication code enough times to know her real timeline for the first step could range between 2 days and 2 weeks. And then, making the method work with in-house data could another week or two.

There was no way she'd get suggestions in by Friday.

Becky was disheartened. She really didn't want to let the team down. She also wanted to show them that AI could extract fresh insights from historical data sets. That through the eyes of AI, they'd learn something new that the human eye hadn't picked up.

That was the key.

Becky realized she just had to provide a minimum additional insight (MAI) from the data set in hand. After all, any new insight had to come from analyzing all the data sets in combination. That's what was so challenging to do manually, and what AI excelled at, extracting patterns from big data.

With this new angle, Becky thought, "What if I use something simpler?"

And then she remembered a conversation she'd had a few days ago with Ray, a friend and kind of mentor, when they met for coffee at the Crow Rock Cafe by the beaches.

Understanding simple

"What's the most advanced machine learning method you can think of?" Ray asked Becky while sipping his coffee.

"Umm..." Becky thought for a moment. "Maybe DeepSeek?"

"I see." said Ray. "Why do you believe it's the most advanced method?"

"Oh..." Becky said. "Probably because of its reasoning capabilities. And the fact that it could beat benchmarks with fewer resources."

"Okay. What do you think makes DeepSeek so advanced?" Ray asked.

"Oh, umm..." Becky thought hard. "It certainly has a lot of parameters. And like the o1 model, it thinks through problems in chunks using chain-of-thought reasoning. I also read that it uses a mixture of experts system."

Becky admitted, "I'm actually not entirely sure."
I'll have to read the paper one of these days," she chuckled.

Ray smiled. "Of course."
"Now, what's the simplest machine learning method you can think of?" he asked.

"Hmm..." Becky paused.
"Probably a linear model", Becky said, this time with more certainty.

"And what makes it simple?" said Ray.

"All the parameters appear linearly." Becky said, this time with confidence.

"That's right." said Ray. "You know exactly what makes the linear model a linear model. So you know why it works, and when it doesn't work – when linear isn't enough."

He continued. "Compare that to our discussion about the advanced models. We see that they work. But do we really know why, as users? More importantly, if it doesn't work for some problem, do we know why not? And what we can do about it?"

"As a computational biologist, you're job is to apply the ideal tool that creates value for your organization. Sometimes this means vetting a new and exciting tool like DeepSeek. Other times, it means knowing exactly why a technique isn't working, and finding the shortest path to a solution."

That day, Ray imparted a valuable lesson to Becky:

  • It's hard to know why a complex model works, and easy to know why a simple model works.
  • It's harder to know why a complex model does NOT work. It's easier to know why a simple model does NOT work.

If you understand a model deeply enough, when it doesn't work, you can find the shortest path to a working solution.

Leveraging simplicity

Becky knew what to do now. She decided to come up with the simplest model of the TRN that could be learned from the available data. She'd then use that simple model to come up with predictions she could explain fully.

What was the simplest model of a TRN? A graph convolutional network? An autoencoder? A transformer? Probably not.

Perhaps a Bayesian network. She'd had some experience with those in a paper she'd authored.

She then remembered.

A probabilistic Boolean network. It didn't get much simpler than that.

The nodes would represent genes and TFs. The edges the regulatory interactions. The network was Boolean because genes and TFs could be 0 (off) or 1 (on). Not expressed or expressed. Active or not active.

And the probabilistic nature came from the edges, which represented the probability of an interaction between nodes.

This was it. She could build one of these quickly, and using only the 8 samples she had.
Best of all, the method was lightweight enough that she could integrate it readily with the metabolic model – Chandrasekaran and Price had shown this in 2010.

She proceeded to build her basic model.

import torch
import numpy as np

class PROBBaseline:
    def __init__(self):
        self.tf_states = {}
        self.edge_probs = {}
        self.expression_thresholds = {}
        
    def fit(self, data):
        num_nodes = data.x.shape[0]
        
        # Global threshold same for all genes
        expression_threshold = torch.median(torch.abs(data.x))

        # Calculate continuous probabilities
        for i in range(num_nodes):
            for j in range(num_nodes):
                if i != j:
                    # Calculate probability
                    # This is a negative edge
                    # P(target=on | TF=off)
                    numerator = ((
                        torch.abs(data.x[i]) < expression_threshold) & (
                        torch.abs(data.x[j]) >= expression_threshold
                    )).sum()
                    denominator = (torch.abs(data.x[i]) < expression_threshold).sum()
                    
                    P_on_off = numerator / denominator
                    
                    # This is a positive edge
                    # P(target=on | TF=on)
                    numerator = ((
                        torch.abs(data.x[i]) >= expression_threshold) & (
                        torch.abs(data.x[j]) >= expression_threshold
                    )).sum()
                    denominator = (torch.abs(data.x[i]) >= expression_threshold).sum()
                    
                    P_on_on = numerator / denominator

                    # Combined edge weight
                    edge_weight = P_on_on - P_on_off                    
                    
                    if torch.isnan(edge_weight):
                        edge_weight = 0.

                    self.edge_probs[(i,j)] = edge_weight
    
    def predict(self, data):
        num_nodes = data.x.shape[0]
        adj_pred = torch.zeros((num_nodes, num_nodes))
        
        for i in range(num_nodes):
            for j in range(num_nodes):
                if i != j:
                    adj_pred[i,j] = self.edge_probs.get((i,j), 0.0)
        
        return adj_pred

To validate the model, Becky loaded an E. coli TRN data set, randomly sampled smaller subsets of the data, and computed the AUROC and precision-recall curve.

In two hours, she had a verified prototype.

It was only 1pm on a Wednesday. She felt elated. She'd have all of tomorrow to analyze the in-house data and generate new TF edits.

She anticipated challenges, of course. For example, she guessed class imbalances would arise and throw a wrench in the precision-recall curve because of the sparse nature of TRNs. And she'd need to either do a lot of samling or run some kind of metaheuristic optimization to find "optimal" edits to suggest.

Becky was confident she could address these challenges. After all, these were all known unknowns. And she knew the computational biology techniques to address them, not in small part due to The Comptational Biologist book.

Most importantly, she knew that fresh insights that had gone untapped for months were now within reach.

Today was a win for simplicity.


Enjoyed the article? Curious about The Computational Biologist ebook that helped Becky cut through complexity and leverage simplicity?

Join the newsletter

Get access to exclusive chapter previews, computational biology insights, and be first to know when the full book launches.