Anthropic shows Claude as a bioinformatics partner, but governance is the real story
Anthropic published BioMysteryBench on April 29, a new evaluation of how far Claude has moved into practical bioinformatics work.
This is not a normal multiple-choice model benchmark. BioMysteryBench contains 99 questions created by domain experts on real bioinformatics data. Claude works in a container with standard bioinformatics tools, can install packages through pip and conda, and can access databases such as NCBI and Ensembl.
Fact: Anthropic says current Claude generations perform on par with human experts on several tasks, and that the newest models solved many problems a panel of experts could not. After quality control, Anthropic had 76 human-solvable and 23 “human-difficult” tasks. On the difficult set, Claude Mythos Preview reached a 30 percent solve rate.
The leadership point is not the number alone. The point is that AI is moving from text assistant to active research worker: it can read data, choose methods, write analysis code, fetch reference resources, and combine several lines of evidence before answering. For Norwegian and European organizations in healthcare, aquaculture, industrial biotech, universities, and public R&D, AI needs to be managed as research infrastructure, not as an IT side project.
Anthropic also shows the limitation. On hard tasks, the wins were less stable. Opus 4.6 solved 86 percent of the human-solvable problems it got right at least four out of five times, but only 44 percent on the human-difficult set. For Sonnet 4.6, the same reliability measure fell from 75 to 22 percent. That is the governance signal: a model may find an answer a human expert missed, but the answer is not automatically robust.
The CIO consequence is concrete. Build closed research sandboxes with logging, data lineage, approved tools, export controls, and human scientific accountability before agents are allowed to work on internal research data. Do not measure the value only in saved hours. Measure reproducibility, validation, source use, and whether the model actually improves decision quality.
The board and CEO consequence is equally clear. If the organization holds biological data, health data, or sensitive R&D, the AI policy should be updated before researchers bring their own tools. This is about productivity and risk at the same time: wrong conclusions, data leakage, dual-use exposure, and unclear IP can matter as much as model cost.
Assessment: BioMysteryBench is a strong signal that frontier models are becoming relevant in specialized R&D. It is not proof that AI can replace scientists. It is proof that leaders need a controlled workspace where researchers, data, models, and validation are managed together.
📬 Likte du denne?
AI-nyheter for ledere. Kuratert av en CIO som bygger det selv. Daglig i innboksen.