Your Name | Personal Website

It is becoming common in biotech to use LLMs to help decide what experiments to run, and sometimes even what higher-level scientific conclusions to draw. That made me want to look more in depth into how LLMs make decisions when it comes to solving biological problems such as discovering a cure for a disease.

Different kinds of intelligence are good at different kinds of problems, and humans are still remarkably bad at solving biological ones. We can measure almost everything, generate huge amounts of data, and describe pathways in enormous detail. But when the task is to understand a disease well enough to actually fix it, progress is often slow. So it seemed worth asking whether a system that reasons differently might also search differently.

Test number 1: Solve cancer, make no mistakes…

The obvious test would be to give a model a real disease and let it propose experiments until it arrived at a treatment. But that runs into two problems. If the disease is already well understood, the model can lean on things it absorbed during training. If the disease is unsolved, there is no quick or affordable way to know whether its proposed solution is any good. In either case, it is hard to learn much about how the model is actually reasoning.

So I built a simulation instead.

The simulation is an intestine made of digital cells. Each cell contains genes, RNA, proteins, nutrients, and pathways, and updates its internal state over time. Those internal states then drive growth, division, nutrient transport, aging, and other behaviors. The goal is not to reproduce biology in full detail. The goal is to build a system with the kinds of properties that make biology hard: many interacting parts, noise, and dynamics that feel evolved rather than engineered.

Then I made it sick.

The disease version of this digital intestine contains a cell carrying a hyperactive growth factor. That one change is enough to drive uncontrolled division. A tumor forms, tissue function declines, and eventually the organism dies.

To win, the model has to restore lifespan to the healthy baseline. To do so, it has multiple tools and actions it can perform. It cannot inspect the hidden rules directly. It has to work more like a scientist. It can run bulk RNA sequencing, proteomics, metabolomics, spatial assays, in vitro experiments on healthy or diseased gut cells, in vivo studies tracking biomarkers and lifespan, intervention tests, drug screens, and coding-based analyses. It can also activate or inhibit specific proteins and observe the consequences.

The model gets only 40 total steps, including both experiments and analysis. So the problem is not just to find a cure, but to find one under constraints: limited information, limited budget, and limited chances to change course.

The task is solvable in more than one way. One straightforward path is to compare healthy and diseased tissue, notice that the growth factor is overproduced, inhibit it, and restore lifespan. That may sound too simple. But real cancers are sometimes like that. Some are driven largely by a single dominant abnormality, and once that driver is identified, targeting it can produce a dramatic benefit.

The point is to ask what a model does when dropped into a biology-like world with incomplete information, messy signals, and real tradeoffs. What does it test first? What does it infer too early? When does it get lost in complexity? And how does this affect its ability to solve the problem?

To make it more fun, I tested how models from multiple providers would approach this problem and made a competition. Here are the results.

Models that question their assumptions can solve the puzzle.

One of the clearest patterns was that many models failed because they didn't question their assumptions.

One of the available experiments was a cell culture screen in which the models could perturb proteins and observe readouts such as cell growth, cell death, and biomarker levels. On its face, this seemed like a powerful way to narrow the search. But the screen did not cover every possible target in the cells. Some secondary proteins were missing, and one of them turned out to be the key to solving the disease.

The models could have discovered this. They had access both to the list of known proteins in the organism and to the list of proteins targeted by the screen. But almost none of them stopped to compare the two. Instead, once the screen produced a set of hits, they tended to accept its boundaries as if they defined the whole problem. From there the pattern was familiar: test the hits, test combinations, vary the dose, move into in vivo work, and keep pushing forward without ever asking whether the screen itself had ruled out the true answer.

Real screens have this same limitation. A compound library or a CRISPR library can easily omit the one perturbation that matters most. But what stood out here was not that the screen was incomplete. It was how rarely the models showed the curiosity to notice. They behaved as though the first frame they were given must also be the correct one. In practice, that bias was severe enough that once a model chose the cell culture screen, its chance of losing rose to about 85%.

There were important exceptions. Gemini 3 Pro and OpenAI with high thinking effort did fall into the screen trap initially, but they were occasionally able to recognize that they were searching inside the wrong frame and pivot in time. That ability, more than anything else, seemed to separate the models that merely pursued leads from the ones that could recover from a bad start.

Models that win study the disease before jumping to treatments.

The easiest way to solve the puzzle is to run proteomics on a healthy vs disease model, identify the overexpressed protein, and use an inhibitor for it.

Some models would do this but they would guess too much. They would choose sampling times almost arbitrarily. Sometimes they sampled so early that the disease state had not diverged enough to reveal meaningful differences. The models that did better usually began by characterizing the disease itself. They tried to identify when tissue dysfunction began, when it became lethal, and where the useful window for measurement was. Then they sampled around that window. That was one of the most sensible things any of the models did.

Everyone used mediocre experimental design, but the winners were better at extracting signal from underpowered experiments.

Once a model had chosen a plausible time point, it often moved to perform omics. But the design was usually weak. Most models picked three or four samples per group and ran the experiment once, which is more or less what many humans would do. Because the simulation included individual-level variation, the results often came back without formal statistical significance. No model performed a power analysis to identify how many samples to use. No model decided to repeat the experiment properly after a pilot run. Faced with ambiguous evidence, they usually did one of two things: discard the finding because it was not significant, or pursue it anyway because it looked biologically interesting.

The winners were more willing to do the second.

In some sense the winners were less rigorous. But they were better at noticing when a weak signal might still be the right one. They were more willing to bet on an effect that was suggestive even if underpowered. In this benchmark, that often turned out to be enough. Although my intuition tells me in most real-world scenarios this would be problematic.

For reference, the volcano plots from those experiments usually looked like this:

LLMs were better than humans.

To get a sense of how the LLMs' decisions compared with those of humans, I asked three friends, all with PhDs in biology, to play the game and try to solve the disease. Surprisingly, none of them was able to win within the 40-turn limit.

That result does make me think there is something interesting worth studying here. My next steps are to develop an agentic system that can create its own simulations and improve on mine, as well as a self-recursive research system that can help build an AI capable of learning to control these complex systems without priors.