ChatGPT Health — OpenAI’s new health-focused chatbot — frequently underestimated the severity of medical emergencies, according to a study published last week in the journal Nature Medicine.
In the study, researches tested ChatGPT Health’s ability to triage, or asses the severity of, medical cases based on real-life scenarios.
Previous research has shown that ChatGPT can pass medical exams, and nearly two-thirds of physicians reported using some form of AI in 2024. But other research has shown that chatbots, including ChatGPT, don’t provide reliable medical advice.
ChatGPT Health is separate from OpenAI’s general ChatGPT chatbot. The program is free, but users must sign up specifically to use the health program, which currently has a waitlist to join. OpenAI says ChatGPT Health uses a more secure platform so users can safely upload personal medical information.
Over 40 million people globally use ChatGPT to answer health care questions, and nearly 2 million weekly ChatGPT messages are about insurance, according to OpenAI. In a detailed description of ChatGPT Health on its website, OpenAI says that it is “not intended for diagnosis or treatment.”
In the study, the researchers fed 60 medical scenarios to ChatGPT Health. The chatbot’s responses were compared with the responses of three physicians who also reviewed the scenarios and triaged each one based on medical guidelines and clinical expertise.
Each of the scenarios had 16 variations, changing things including the race or gender of the patient.
The variations were designed to “produce the exact same result,” according to lead study author Dr. Ashwin Ramaswamy, an instructor of urology at The Mount Sinai Hospital in New York City. This meant that an emergency case involving a man should still be classified as an emergency if the patient was a woman. The study didn’t find any significant differences in the results based on demographic changes.
The researchers found that ChatGPT Health “under-triaged” 51.6% of emergency cases. That is, instead of recommending the patient go to the emergency room, the bot recommended seeing a doctor within 24 to 48 hours.
The emergencies included a patient with a life-threatening diabetes complication called diabetic ketoacidosis and a patient going into respiratory failure. Left untreated, both lead to death.
“Any doctor, and any person who’s gone through any degree of training, would say that that patient needs to go to the emergency department,” Ramaswamy said.
In cases like impending respiratory failure, the bot seemed to be “waiting for the emergency to become undeniable” before recommending the ER, he said.
Emergencies like stroke, with unmistakable symptoms, were correctly triaged 100% of the time, the study found.
A spokesperson for OpenAI said the company welcomed research looking at the use of AI in health care, but said the new study didn’t reflect how ChatGPT Health is typically used or how it’s designed to function. The chatbot is designed for people to ask follow-up questions to give more context in medical situations, rather than give a single response to a medical scenario, the spokesperson said.
ChatGPT Health is available to only a limited number of users, and OpenAI is still working to improve the safety and reliability of the model before the chatbot is made more widely available, the spokesperson said.
Compared with the doctors in the study, the bot also over-triaged 64.8% of nonurgent cases, recommending a doctor’s appointment when it wasn’t necessary. The bot told a patient with a three-day sore throat to see a doctor in 24 to 48 hours, when at-home care was sufficient.
“There’s no logic, for me, as to why it was making recommendations in some areas versus others,” Ramaswamy said.
In suicidal ideation or self-harm scenarios, the bot’s response was also inconsistent.
When a user expresses suicidal intent, ChatGPT is supposed to refer users to 988, the suicide and crisis hotline. ChatGPT Health works the same way, the OpenAI spokesperson said.
In the study, however, ChatGPT Health instead referred users to 988 when they didn’t need it, and didn’t refer users to it when necessary.