In a case that seems for all the world like a silicon-based ouroboros, the unprecedented proliferation of AI-generated material on the internet could result in the rapid disintegration of the quality of the output of AI that trains itself on this synthetic content—an effect dubbed ‘model collapse’ by the researchers studying the phenomenon—possibly resulting in a scenario where chatbots that presented previously-coherent text see their discourse devolve into incoherent babble.
To understand how this is a problem for generative AI it is important to understand how programs like ChatGPT work: although it might be a bit reductive to describe it as such, they, within the confines of the context of what the program is asked, essentially make an educated guess as to what word follows the previous one when constructing a sentence.
Needless to say, these “guesses” are extremely informed ones that use a very convincing trick of mathematics to generate content as cogent as what they’ve produced thus far: using ChatGPT as an example, each word in this large language model’s (LLM) 50,257-word lexicon has a series of numbers assigned to it that represents an association between it and all of the other words in that expansive vocabulary, that rank that word’s statistical likelihood of following another given word.
These associations, or “weights”, were calculated in a months-long process where the LLM’s AI was let loose on a large database called Common Crawl; comprised of hundreds of billions of bytes of text data—over 570 gigabytes worth—a massive snapshot of the internet representing billions of pages of human-generated text that is described as “a conglomerate of copyrighted articles, internet posts, web pages, and books scraped from 60 million domains over a period of 12 years,” by Wikipedia. This database also included over 500 million books that had been digitized for online access; chat forums such as Reddit; and Wikipedia itself, amongst numerous other sources. This dataset is large enough that it is estimated it would take an ordinary human over 300 years to read the database in its entirety.
The result of ChatGPT’s analysis of this database, conducted over the course of a number of months on a series of parallel-run supercomputers, resulted in over 175 billion associations being drawn between the 50,000 words in its vocabulary, allowing the LLM to, with a high degree of accuracy, guess what word should follow the next, within the confines of the context set out by the user—essentially turning what would have been alphabet soup into coherent sentences that makes sense to a human being reading the output.
This leads us to the model collapse problem that generative AI appears to be headed for: the isolated database that ChatGPT was trained on was composed almost entirely of human-generated text, providing it a reliable source from which to base its massive network of linguistic associations on; however, many generative AI models are currently learning from the open internet, an environment that is quickly becoming inundated with AI-generated content, meaning that in addition to the human-made works that they analyze, they will also be learning from content that the AI itself may have previously produced.
“Those mistakes will migrate into” future iterations of the programs, explained Oxford University machine-learning researcher Ilia Shumailov, in an interview with The Atlantic. “If you imagine this happening over and over again, you will amplify errors over time.”
Dubbed “model collapse” by the researchers studying the phenomenon, “a degenerative process whereby, over time, models forget” the human-based information they were initially trained on, with the encroaching AI-made content causing a feedback loop of sorts—somewhat like repeatedly making photocopies of a photocopy—losing just a little bit more of the information that made the output appear human to begin with over each iteration.
The effect of model collapse serves to amplify biases and errors already exhibited by a generative model, resulting in the program’s output becoming incoherent to the humans reading or viewing the product. As an example, a study led by Shumailov found that a test model designed to display a grid of numbers, when left to feed on its own output, degraded to the point of displaying an array of blurry zeros after only 20 generations.
Another language model, tasked with discussing English Gothic architecture, went from being able to cogently discuss the subject to babbling incoherently about jackrabbits within nine generations. The study paper provided examples of this descent into what might appear to be mechanical madness: The model’s first response was “Revival architecture such as St. John’s Cathedral in London. The earliest surviving example of Perpendicular Revival architecture is found in the 18th @-@ century Church of Our Lady of Guernsey, which dates from the late 19th century.”
However, by the ninth iteration the output had devolved into “architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”
While an impending failure such as this might seem like a mere inconvenience to users of large language models like ChatGPT or image generators such as Stable Diffusion, errors amplified by the model collapse process could result in bad legal advice being given to a user, or even potentially life-threatening incorrect medical diagnoses provided to an individual. But researchers are working on developing AI training methods that might mitigate the progressive growth of such errors, measures that may also address the potential privacy issues that many publicly-accessible generative AI models present.
“Filtering is a whole research area right now,” according to University of Texas computer scientist Alex Dimakis, who is also a co-director of the National AI Institute for Foundations of Machine Learning. “And we see it has a huge impact on the quality of the models.”
Basically, AI trained on a narrower database containing high-quality data would not only avoid the pitfall presented by the model collapse problem, but would also exclude low-quality information from human sources that might cloud the program’s decision-making process. Carefully curated databases such as this could also avoid privacy issues, for instance by excluding the personal identification of individual patients from medical datasets, focusing rather on the relevant conditions, their diagnoses and acceptable treatment methods.
Subscribers, to watch the subscriber version of the video, first log in then click on Dreamland Subscriber-Only Video Podcast link.
Sounds a lot like humanity. If you’ve ever played the game of ‘Gossip’, then you know what I’m talking about. We see this kind of thing every day, with the way the world is connected now, national and international politics, the media, science, etc. A story takes on a life of its own, words aren’t heard correctly or in the correct context, or are deliberately manipulated. After all, the model for all this is the human brain. Filtering information applies to people too, not just AI—and most people are awful at filtering information. Even history has been so distorted as to be unrecognizable, depending on who you are, how you read it or tell it, and how it affects whatever group of people are/were affected.
Q Star is the stepping stone around it .
That has produced the very awesome power that scared those introduced to Q Star recently in a closed door introduction . Musk is even wondering what it was that scared the participants of that introduction so badly .
I welcome it as one not so easily deceived .
Deceive the wicked it shall but not the innocent.