One of the strengths of artificial intelligence is that the structure of the coded neural nets that make machine learning possible is based on the layout of the neurons of our own brains, a branching, neurological-like network capable of learning from trial and error. However, a new study has discovered that older AI models can suffer from cognitive decline, similar to how human brains can begin to fail with age.

While this issue mightn’t necessarily be a serious problem when it comes to generating text or images for most users, it could be a deal-breaker when it comes to more serious applications such as medical diagnoses, which is why the study, titled Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis, was published in the British Medical Journal.

The study applied a widely-used benchmark called the Montreal Cognitive Assessment (MoCA) test to a number of AI models that included Open-AI’s ChatCPT 4 and GPT-4o; Anthropic’s Claude 3.5 Sonnet; and Alphabet’s (Google) Gemini versions 1 and 1.5 to “evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment,” according to the study.

MocA is typically used to uncover early signs of Alzheimer’s or dementia in human patients, using simple cognitive tasks that evaluate an individual’s attention, executive mental function, language, memory, and spatial capabilities. For humans, the test takes about 10 minutes to complete, and is evaluated on a score with a maximum of 30 points; 26 or more points indicates that the participant has normal cognitive functions, with the average score amongst unimpaired individuals being 27.4; individuals with mild impairment average 22.1, while Alzheimer’s patients average 16.2.

Being a newer model, OpenAI’s GPT-4o scored a (barely) passing grade of 26, but the other models involved showed varying states of mental decline, and the researchers found that the age of the AI corresponded to how low it scored. ChatGPT 4 and Claude came just under the wire with a score of 25; Gemini 1.5 scored 22 points, indicating mild impairment, while its older 1.0 counterpart only scored 16, showing a more severe decline in its capabilities.

“With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment,” the study concluded. “Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test.

“These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence.”

The study authors point out that their findings should be taken with a grain of salt: although neural networks have a structure based on biological layouts, there are necessary differences in how they work, and AI models such as these were not designed to mimic certain human functions that we take for granted; for instance, “all large language models showed impaired visuospatial reasoning skills,” illustrated in their difficulty in certain tasks on the test, such as correctly drawing a clock face, or tracing a line between sequential points on the trail making B task (TMBT), a simple connect-the-dots test.

But the authors do warn that these findings, especially “in tasks requiring visual abstraction and executive function highlights” significant shortfalls in AI’s ability to operate reliably and consistently in a clinical setting, where a patient’s well-being could be on the line.

“The inability of large language models to show empathy and accurately interpret complex visual scenes further underscores their limitations in replacing human physicians,” the paper concludes.

“Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients—artificial intelligence models presenting with cognitive impairment.”

Dreamland Video podcast
To watch the FREE video version on YouTube, click here.

Subscribers, to watch the subscriber version of the video, first log in then click on Dreamland Subscriber-Only Video Podcast link.

5 Comments

  1. Hang on, the title of this article suggests that the score of a given AI model worsens over time, implying the AI version of neurological degeneration…but reading the article, all it seems to be saying is that newer and newer versions of the model get a progressively higher score.

    If I am correct in the above, then this was perilously close to being clickbait 🤔

    1. The article was intended to communicate that there is a degradation of models over time, and that this mimics cognitive decline. I don’t think it qualifies as clickbait.

      1. I would agree with the premise that there is a degredation of models over time, if that was the testing that had actually been done…but I don’t see anywhere it mentioning repeated testing of the same model over an extended period of time, in this news article. All they are saying is that previous versions are worse, which is not the same at all. An analogy might be that subsequent models of a particular car are designed to have a higher and higher top speed. That is not the same as saying the top speed of a given car model declines as it ages.

        1. This is on me, and I should have thought to include a paragraph addressing the study’s ambiguity on this:

          Going back through the study itself, the paper keeps alluding to cognitive declines, including test failures that one would expect a fresh LLM wouldn’t be prone to (such as Gemini 1.5 being unable to remember any words from the five-word memory test), but nowhere do the authors specifically state that each individual model’s performance is worse than before.

          Re-evaluating this, the misleading part appears to be the use of the word “impairment”, a word implying a previously-more capable state; while I can’t tell whether or not this is deliberate on the part of the authors, if they meant to illustrate the models’ current cognitive capabilities, something like “deficiency” might have been a more appropriate word choice.

          If their intent was to point out that AI isn’t presently ready for important work, then you’re definitely right Sherbet, and I should have been paying better attention to what the study wasn’t saying; if they did mean to illustrate a measured cognitive decline, it wasn’t communicated at all well.

          It has been demonstrated that LLMs will go off the rails after cannibalizing their own data when trained in less than ideal data environments–another form of mental decline–but again, if this study was meant to add to that, it wasn’t put across very well.

          1. Thanks for the additional comments and clarification Matthew – it’s much appreciated.

Leave a Reply