BiomniBench: How Do We Know When AI Agents Do Good Biomedical Science?
A landmark 2026 benchmark from Stanford and Harvard evaluates LLM agents on 100 real biomedical tasks. Key finding: agent architecture matters more than model choice — and all frontier models consistently fail at method selection and biological reasoning.