Harvard Study: LLMs Match or Beat ER Doctors in Triage — What Hospital Leaders Should Do

In Harvard study, AI offered more accurate diagnoses than emergency room doctors

A Harvard study found that modern large language models (LLMs) matched or exceeded attending physicians on first‑pass ER triage in a narrow, retrospective test — and that should change how hospitals plan AI pilots.

What the researchers did and what they found

Researchers from Harvard Medical School and Beth Israel Deaconess Medical Center compared two OpenAI model variants (called o1 and 4o) to attending emergency physicians using 76 real emergency department cases. The models were fed the exact, unedited text available in each patient’s electronic medical record (EMR) at the time clinicians made their initial triage decisions — no curated prompts, no images, and no extra data.

Diagnoses from the two LLMs and two attending physicians were blind-rated by two other attending physicians. On the earliest diagnostic touchpoint (triage), the o1 model produced an exact or very close diagnosis in 67% of cases. The two human baselines reached that same standard in 55% and 50% of cases. Lead author Arjun Manrai summarized the results as the model outperforming earlier versions and the physician baselines across nearly every benchmark the team used.

“The model outperformed earlier models and the physician baselines across nearly every benchmark the team used.”
— Arjun Manrai, lead author, Harvard Medical School

The authors emphasize these are retrospective results using text‑only inputs and call for prospective, real‑world trials and governance before clinical deployment. Coauthor Adam Rodman noted the absence of formal accountability frameworks and that patients still want human clinicians to guide major treatment decisions.

“There is currently no formal accountability framework for AI diagnoses, and patients still want human clinicians to guide critical treatment decisions.”
— Adam Rodman, study coauthor, Beth Israel Deaconess Medical Center

Why this matters for hospital leaders

For CIOs, CMOs and COOs, the finding is not a permit to turn over triage to AI agents. It is, however, an operational alarm bell and an opportunity. A model that matches clinician performance on initial triage — the moment decisions are made with sparse information — can safely augment throughput, prioritize scarce resources faster, and reduce clinician cognitive load when deployed correctly.

Think of an LLM as a very fast second pair of eyes: it can surface likely diagnoses from pattern recognition and fast access to medical knowledge, especially when notes are short or incomplete. That advantage may matter most during busy shifts, mass-casualty surges, or understaffed nights.

Practical business impacts to evaluate now include potential reductions in time‑to‑diagnosis, improved triage prioritization (and downstream capacity gains), and staffing efficiencies that can be modeled for ROI. But these gains depend on careful pilot design, prospective validation, and a governance program that addresses safety, liability and patient trust.

Limitations and important caveats

The study’s design contains several constraints that limit generalizability:

  • Retrospective and text‑only: Researchers looked back at past cases and used only the written EMR notes — no scans, ECGs, continuous monitors or bedside exams. Real‑time care combines multimodal data.
  • Small sample: The dataset was 76 patients. Effect sizes may change in larger, more diverse populations or different health systems.
  • Scope of “match”: “Exact or very close diagnosis” was the rating metric — how that maps to clinical outcomes, treatment changes or harm avoidance needs prospective outcome data.
  • Missing context: The report does not fully describe case mix (age, acuity, rare vs. common conditions) that drive how models perform across subgroups.
  • Known LLM failure modes: Hallucination (making up facts), overconfidence, poor calibration, and vulnerability to distributional shift (different documentation styles across hospitals) remain risks.

Concrete micro‑scenario

Example: a patient arrives with brief chest discomfort and non‑specific notes. An LLM flags the case as “possible acute coronary syndrome” and escalates priority. Clinician review confirms high risk and speeds ECG and troponin testing, shortening door‑to‑diagnosis. That single early nudge can materially change throughput and outcomes — but if the model produces false positives too often, it could create alert fatigue or resource waste. Pilot metrics must measure both sensitivity and operational burden.

Pilot checklist for hospital leaders

  • Define scope and success metrics: Set explicit endpoints (e.g., sensitivity for high‑risk conditions, time‑to‑first‑assessment, clinician override rate, patient outcomes).
  • Start in shadow mode: Run the model in parallel to clinicians (no live actions) to measure concordance, false negatives, and workflows without patient exposure.
  • Require multimodal testing before expansion: After text‑only validation, test with imaging, ECG feeds and vitals streams to assess real‑world performance.
  • Human‑in‑the‑loop by default: Design workflows so clinicians see model suggestions with provenance and can accept, modify or dismiss recommendations.
  • Logging and feedback: Implement real‑time logging, clinician reporting of errors, and a feedback loop for continuous improvement.
  • Legal and compliance: Engage legal counsel early to address liability, HIPAA data flow, and Business Associate Agreements (BAAs) with vendors.
  • Equity testing: Stratify performance by age, race, language and socioeconomic factors to detect bias.
  • Budget for governance: Allocate resources for prospective trials, monitoring, red‑team testing, and a post‑market surveillance plan.

Vendor due‑diligence questions

  • What is the model’s training data provenance? Can you document sources and de‑identification practices?
  • Do you provide calibration metrics and confidence scores tied to model outputs?
  • What known failure modes exist, and how were they stress‑tested (red‑team, adversarial testing)?
  • Can the system run within our secure environment (on‑prem or VPC) to satisfy HIPAA/BAA requirements?
  • What audit logs are available for inputs, outputs, timestamps and clinician interactions?
  • Do you commit to post‑market monitoring and timely model updates with documented change control?

Recommended KPIs to track in pilots

  • Sensitivity and specificity for pre‑defined high‑risk conditions.
  • False negative rate for life‑threatening diagnoses (primary safety metric).
  • Clinician override/acceptance rate and reasons for overrides.
  • Time‑to‑decision or time‑to‑treatment (e.g., door‑to‑ECG).
  • Operational impact: ED throughput, left‑without‑being‑seen (LWBS) rates.
  • Patient‑facing metrics: satisfaction and trust measures for AI‑augmented care.

Regulatory, privacy and liability considerations

Using patient data with third‑party AI requires HIPAA compliance and likely BAAs with vendors that handle protected health information. The FDA and other regulators are actively shaping approaches to AI/ML in medical devices and clinical decision support; expect evolving guidance and the need for structured post‑market surveillance. Legal liability for AI‑augmented decisions is unresolved in many jurisdictions — institutions should work with counsel to define responsibility and informed‑consent language for pilots.

Risks and counterpoints

Several hard questions remain. Models trained and validated at one institution may underperform elsewhere because documentation language and patient populations differ (distributional shift). Performance across demographic subgroups is often unreported; unknown biases could worsen health disparities. Clinician adoption is not guaranteed — a helpful model that interrupts workflows or produces low‑value alerts will be ignored. Finally, retrospective diagnostic parity does not automatically translate to better patient outcomes; only prospective trials can show that.

Practical next steps for leaders

If you run a hospital or health system, begin with a controlled plan: commission a shadow‑mode pilot, require multimodal validation, build an audit and feedback loop, budget for governance, and involve legal and compliance from day one. Make patients and clinicians part of the design to preserve trust; where decisions are life‑critical, maintain clinician final authority.

LLMs and AI agents are no longer theoretical curiosities for healthcare: they are approaching clinically meaningful performance for triage and decision support. That creates a narrow window for pragmatic pilots that can safely explore value while building the governance frameworks that regulators, clinicians and patients will soon demand.

Questions for leaders

Can current LLMs replace clinicians in the ER?
No. The study shows promising diagnostic parity on a narrow, retrospective, text‑only task. Prospective, multimodal validation and accountability frameworks are required before any autonomous use.

Should hospitals begin pilot deployments now?
Yes — but only controlled pilots in shadow mode or with human‑in‑the‑loop designs, clear success metrics, and strong logging and governance.

What should vendors be required to disclose?
Training data provenance, calibration metrics, known failure modes, red‑team results, audit logs, and commitments to post‑market monitoring and secure deployments.

For C‑suite leaders evaluating AI for business and clinical automation: act deliberately and fast. Commission a shadow pilot, require vendor transparency, budget for governance, and keep clinicians and patients at the center of design. The technology is moving from lab to clinic — your governance and pilot choices will determine whether it becomes a safe productivity multiplier or a costly liability.