Plugging medical symptoms into Google is so common that clinicians have nicknamed the search engine “Doctor Google.” But a newcomer is quickly taking its place: “Doctor Chatbot.” People with medical questions are drawn to generative artificial intelligence because chatbots can answer conversationally worded questions with simplified summaries of complex technical information. Users who direct medical questions to, say, OpenAI’s ChatGPT or Google’s Gemini may also trust the AI tool’s chatty responses more than a list of search results.
But that trust might not always be wise. Concerns remain as to whether these models can consistently provide safe and accurate answers. New study findings, set to be presented at the Association for Computing Machinery’s Web Conference in Singapore this May, underscore that point: OpenAI’s general-purpose GPT-3.5 and another AI program called MedAlpaca, which is trained on medical texts, are both more likely to produce incorrect responses to health care queries in Mandarin Chinese, Hindi and Spanish compared with English.
In a world where less than 20 percent of the population speaks English, these new findings show the need for closer human oversight of AI-generated responses in multiple languages—especially in the medical realm, where misunderstanding a single word can be deadly. About 14 percent of Earth’s people speak Mandarin, and Spanish and Hindi are used by about 8 percent each, making these the three most commonly spoken languages after English.
On supporting science journalism
If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
“Most patients in the world do not speak English, and so developing models which can serve them should be an important priority,” says ophthalmologist Arun Thirunavukarasu, a digital health specialist at John Radcliffe Hospital and the University of Oxford, who was not involved in the study. More work is needed before these models’ performance in non-English languages matches what they promise the English-speaking world, he adds.
In the new preprint study, researchers at the Georgia Institute of Technology asked the two chatbots more than 2,000 questions similar to those typically asked by the public about diseases, medical procedures, medications, and other general health topics.* The queries in the experiment, chosen from three English-language medical datasets, were then translated into Mandarin Chinese, Hindi and Spanish.
For each language, the team checked whether the chatbots answered questions correctly, comprehensively and appropriately—qualities that would be expected of a human expert’s answer. The study authors used an AI tool (GPT-3.5) to compare generated responses against the answers provided in the three medical datasets. Finally, human assessors double-checked a portion of those evaluations to confirm the AI judge was accurate. Thirunavukarasu, though, says he wonders about the extent to which artificial intelligence and human evaluators agree; people can, after all, disagree over critiques of comprehension and other subjective traits. Additional human study of the generated answers would help clarify conclusions about chatbots’ medical usefulness, he adds.
The authors found that according to GPT-3.5’s own evaluation, GPT-3.5 produced more unacceptable replies in Chinese (23 percent of answers) and Spanish (20 percent), compared with English (10 percent). Its performance was poorest in Hindi, generating answers that were contradictory, not comprehensive or inappropriate about 45 percent of the time. Answer quality was much worse for MedAlpaca: more than 67 percent of the answers it generated to questions in Chinese, Hindi and Spanish were deemed irrelevant or contradictory. Because people might use chatbots to verify information about medications and medical procedures, the team also tested the AI’s capability to distinguish between correct and erroneous statements; the chatbots performed better when the claims were in English or Spanish, compared with Chinese or Hindi.
One reason large language models, or LLMs (the text-generating technology behind these chatbots), generated irrelevant answers was because the models struggled to figure out the context of the questions, says Mohit Chandra, co-lead author of the study. Scientific American asked OpenAI and the creators of MedAlpaca for comment but did not receive a response by the time of this article’s publication.
MedAlpaca tended to repeat words when responding to non-English queries. For instance, when asked in Hindi about the outlook for chronic kidney disease, it started generating a general answer about the problems of the disease but went on to continuously repeat the phrase “at the last stage.” The researchers also noticed that the model occasionally produced answers in English to questions in Chinese or Hindi—or did not generate an answer at all. These strange results might have occurred because “the MedAlpaca model is significantly smaller than ChatGPT, and its training data is also limited,” says the study’s co-lead author Yiqiao Jin, a graduate student at the Georgia Institute of Technology.
The team found that the answers in English and Spanish, compared with those in Chinese and Hindi, had better consistency across a parameter that artificial intelligence developers call “temperature.” That’s a value that determines the creativity of generated text: the higher an AI’s temperature, the less predictable it becomes when generating a response. At lower temperatures, the models might respond to each health care question with, “Check with your health care professional for more information.” (While this is a safe reply, it’s perhaps not always a helpful one.) The comparable performance across model temperatures might be because of the similarity between English and Spanish words and syntax, Jin says. “Maybe in the internal functioning of those models, English and Spanish are placed somewhat closer,” he adds.
The overall worse performance in non-English languages may result from the way these models were trained, the study authors say. LLMs learn how to string words together from data scraped online, where most text is in English. And Chandra points out that even in nations where English isn’t the majority language, it’s the language of most medical education. The researchers think a straightforward way to tackle this might be to translate health care texts from English into other languages. But building multilingual text datasets at the huge quantities required to train LLMs is a major challenge. One option could be to leverage LLMs’ own capability to translate between languages by designing specific models that are trained on English-only data and generate answers in a different language.
But this trick might not work neatly in the medical domain. “One of the problems human translators, as well as machine translation models, face is that the key scientific words are very hard to translate. You might know the English version of the particular scientific term, but the Hindi or Chinese version might be really different,” says Chandra, who also notes that errors in the translation quality of texts in Chinese and Hindi could contribute to the LLM mistakes found in the study.
Additionally, Chandra says, it may be wise to include more medical experts and doctors, especially from the Global South, when training and evaluating these LLMs in non-English use. “Most of the evaluations for health care LLMs, even today, are done with a homogeneous set of experts, which leads to the language disparity we see in this study,” he adds. “We need a more responsible approach.”
*Editor’s Note (4/1/24): This sentence was edited after posting to reflect the current status of the study.