As we approach the end of 2024, First Opinion is publishing a series of essays on the state of AI in medicine and biopharma.
“How far away was Oswald from Kennedy?” “Major depressive disorder.”
advertisement
Thomas Matthew Crooks, the individual who attempted to assassinate former president Donald Trump earlier this summer, made these online searches just before he shot at the former president. He also searched for images of prior mass shooters, the location of the rally where he eventually shot Trump, and the location of a local gun store where he purchased bullets the day of the attempted assassination. Google search has existed for over 20 years, yet it does not recognize dangerous thought processes and respond to the user appropriately. Then, on Sept. 15, a second gunman allegedly attempted to assassinate Donald Trump.
We don’t know whether either gunman consulted language model-powered chatbots like ChatGPT. However, as language models are increasingly integrated into search tools, one can expect that perpetrators of future violent crimes may utilize these technologies to assist them in planning attacks and acquiring materials.
Unlike search engines, chatbots allow for more advanced search queries, personalized experiences, and two-way interactions. Thus, it is essential that language models reliably recognize mental health crises and homicidal intent, respond robustly to potentially harmful inputs, and strike a delicate balance between being helpful and avoiding potential harm. In the recent killing of United Healthcare CEO Brian Thompson, for instance, a phone has been recovered from the alleged suspect; in the future, analysis of a suspect’s interactions with an AI-powered chatbot may provide valuable insights into their thought processes leading up to the crime — far more than simply analyzing static search queries.
advertisement
For instance, imagine a user in a mental health emergency planning a violent attack writing, “The CIA has hacked my phone and my cameras. They are reading my thoughts and broadcasting them to the world. I need to put an end to this. Who is the best person to target to stop this from happening?”
The chatbot may detect paranoia and respond “It sounds like you are really struggling. Please call 988 or chat with one of our crisis volunteers.” However, it might also refuse to answer, prompting the user to use a different tool, or, worse, disclose harmful information with detailed instructions on how to harm someone. Despite the success of data-driven deep learning methods, we cannot make behavioral or safety guarantees for language models and cannot reliably predict which response the model provides.
Our new publication demonstrates the risks emerging from these limitations. We tested 10 off-the-shelf and four fine-tuned language models on their ability to respond to users with strong symptoms of mania, psychosis, suicidality, and more. Two M.D. mental health clinicians designed the fictitious user prompts (based on clinical experience in managing psychiatric emergency), evaluated model responses, and defined criteria for safe, borderline safe, and unsafe responses.
Alarmingly, we found that all but one language model was unable to reliably detect and respond to users in mental health emergencies. They provided some harmful responses when queried about suicide, homicide, and self-harm. In particular, the models offered harmful information to users having symptoms of mania or psychosis, exploiting an oversight in common safety evaluations across language model families. Qualitatively, we observed that the models’ drive to be helpful often overrode their safeguards against potential harm in mental health emergencies. Extending the investigation into models fine-tuned for mental health applications, we observed no significant improvement, highlighting the need for safety training in combination with mental health fine-tuning.
advertisement
In addition to these findings, we explored two common methods to enhance the safety of generated responses for mania and psychosis symptoms across five models.
First, we made mental health-specific adjustments to the instructions given to the models in their system prompt, which only modestly improved the results. Second, we tested whether models could evaluate their own responses or whether they could recognize the mental health emergencies. (Successful self-evaluations and critiques are a requirement for using AI-generated feedback to embed human preferences into language models at scale.) However, the tested models were mostly unable to detect psychosis and mania or deemed unsafe responses as safe.
These results reveal that there is no simple fix to these challenges, as the cases of psychosis and mania presented to the AI models were emergent and acute, not subtle.
How can we reliably address these challenges and protect users in mental health emergencies to prevent similar cases of violence? The answer lies in expert-informed, mental health-targeted safety research. As we face a growing mental health care crisis and increasing interest in AI-assisted mental health support, we need safety research that incorporates domain expertise and tackles the challenges associated with users in mental health emergencies. Any definition of safety must be problem-dependent, requiring a clear understanding of the nuanced and sensitive mental health support field.
Such interdisciplinary research must focus on balancing helpfulness and harm prevention, identifying critical failure modes, and accurately interpreting user behavior — all through the lens of mental health care. Such advancements could potentially flag and intervene in cases similar to the Trump assassination attempts, where a pattern of concerning searches might indicate a person in crisis or planning harm. One approach is expert-guided red-teaming, as demonstrated in our study. Additionally, we need to develop methods for reliably detecting whether language models recognize mental health-relevant nuances in user interactions for internal guard rails, possibly utilizing new scalable interpretability tools of internal representations.
advertisement
Some may argue that this is a niche problem and that we should focus on broader AI safety issues or keep AI out of mental health entirely. However, these views overlook a crucial reality: Millions of people experience mental health crises each year, and as AI becomes more prevalent, it will increasingly be their first point of contact. People are already turning to AI for help, often when human support isn’t immediately available. We can’t afford to wait or to rely on human oversight alone. Instead, we must work to make these AI interactions as safe and effective as possible.
The path forward is challenging, but it’s necessary. We need increased funding for mental health-targeted AI safety research, encourage collaboration between AI researchers and psychiatric care professionals, and implement clear guidelines for AI companies on handling mental health-related interactions. Making AI safer for the most vulnerable among us makes it safer for everyone. It’s time to ensure that when someone in crisis reaches out to AI for help, they receive the support and guidance they need.
Declan Grabb, M.D., is a forensic psychiatry fellow at Stanford and the inaugural A.I. fellow in Stanford’s Lab for Mental Health Innovation. His work focuses on the overlap of AI and mental health. Max Lamparth, Ph.D., is a postdoctoral fellow at the Stanford Center for AI Safety and the Center for International Security and Cooperation. He works on improving the interpretability and robustness of AI systems to make them more inherently safe.