The study evaluates safety and accuracy in emergency medicine

The study evaluates safety and accuracy in emergency medicine

In a recent study published in investigators developed and evaluated the accuracy, security, and usability of LLM-generated emergency medicine (EM) notes to reduce the burden of medical documentation without compromising patient safety.

The key role of handovers in health care

Switches are critical communication points in healthcare and a known source of medical errors. As a result, numerous organizations, such as the Joint Commission and the Accreditation Council for Graduate Medical Education (ACGME), are advocating for standardized processes to improve safety.

Transferring patients from an EM hospital to an (IP) hospital presents unique challenges, including medical complexity, time constraints, and diagnostic uncertainty; however, they are poorly standardized and implemented inconsistently. Electronic health record (EHR)-based tools have attempted to overcome these limitations; however, they remain understudied in emergency situations.

LLMs have emerged as potential solutions to streamline clinical documentation. However, concerns about factual inconsistencies require further investigation to ensure security and reliability in critical workflows.

About the study

This study was conducted at an 840-bed urban academic hospital in New York City. EHR data was analyzed for 1,600 EM patient encounters that resulted in emergency hospitalizations between April and September 2023. Due to the implementation of an updated EM-to-IP handover system, only encounters after April 2023 were included.

Retrospective data were used subject to waiving of informed consent to ensure minimal risk to patients. Handover notes were generated using a combination of fine-tuned LLM and rule-based heuristics while adhering to standard reporting guidelines.

The handover note template closely resembled the current manual structure, integrating rule-based components such as laboratory tests and vital signs, and LLM-generated components such as history of current illness and differential diagnoses. Informatics experts and EM physicians collected data to fine-tune the LLM to improve its quality while excluding racial attributes to avoid bias.

Two LLM models were used for content selection and abstract summarization: robustly optimized bidirectional encoder representations from the Transformers approach (RoBERTa) and the Meta AI large language model (Llama-2). Data processing included heuristic prioritization and severity modeling to address potential model limitations.

Researchers evaluated automated metrics such as Recall-Oriented Study for Significance Assessment (ROUGE) and bidirectional encoder representations based on the Transformers Score (BERTScore), as well as a novel framework focusing on patient safety. A clinical review of 50 handover notes assessed completeness, readability, and security to ensure rigorous validation.

Research results

Among the 1,600 patient cases included in the analysis, the mean age was 59.8 years with a standard deviation of 18.9 years, and 52% of patients were women. Automated assessment metrics have shown that LLM-generated summaries are superior to those written by physicians in several respects.

ROUGE-2 scores were significantly higher for LLM-generated summaries compared to physician summaries, at 0.322 and 0.088, respectively. Similarly, BERT precision scores were higher at 0.859 compared to 0.796 for physician summaries. In contrast, the Source Fragmentation Approach for Large-Scale Inconsistency Assessment (SCALE) yielded a score of 0.691 compared to 0.456. These results indicate that LLM-generated summaries showed greater lexical similarity, greater fidelity to source notes, and provided more detailed content than their human-written counterparts.

In clinical assessments, the quality of LLM-generated summaries was comparable to physician-written summaries, but slightly inferior on several dimensions. On a Likert scale of one to five, LLM-generated summaries scored lower for usefulness, completeness, accessibility, readability, accuracy, and patient safety. Despite these differences, automated summaries were generally found to be acceptable for clinical use, and none of the identified issues were considered life-threatening or patient-safety-threatening.

When assessing worst-case scenarios, clinicians identified potential Level 2 security risks that included incompleteness and faulty logic at rates of 8.7% and 7.3%, respectively, for LLM-generated summaries compared to clinician-written summaries that were not associated with them . risk. Hallucinations were rare in LLM-generated summaries, and the five identified cases received safety ratings of four to five, suggesting the safety risk was low or negligible. Overall, LLM-generated notes had a higher error rate of 9.6% compared to physician-generated notes at 2%, although these inaccuracies rarely had significant safety consequences.

Interclass reliability was calculated using intraclass correlation coefficients (ICC). The ICCs showed good agreement between the three expert raters in terms of completeness, accessibility, correctness, and usefulness at 0.79, 0.70, 0.76, and 0.74, respectively. Readability achieved satisfactory reliability with an ICC of 0.59.

Conclusions

In the current study, EM-to-IP handover notes were successfully generated using an enhanced LLM and rule-based approach within a user-developed template.

Traditional automated assessments have been associated with excellent LLM performance. However, manual clinical assessments showed that although most LLM-generated notes achieved promising quality ratings of four to five, they were generally inferior to physician-written notes. Identified errors, including incompleteness and faulty logic, sometimes posed a moderate security risk, with less than 10% likely to cause serious problems compared to doctors’ notes.

Magazine number:

Leave a Reply

Your email address will not be published. Required fields are marked *