Reality Check Commentary: ‘Data Voids’ Cause AI Models to Spread Russian Disinformation

Misinformation sources are often the only news AI models access, making them unintentional collaborators with the worst spreaders of false news, writes NewsGuard Co-CEO Gordon Crovitz

Jun 26, 2024

Welcome to NewsGuard's Reality Check, a report on how misinformation online is undermining trust — and who’s behind it.

‘Data Voids’ Cause AI Models to Spread Russian Disinformation

By Gordon Crovitz, NewsGuard Co-CEO

Imagine if you were a student assigned a research paper on, say, the risks to young people of using social media. But what if you were told that the only sources you could consult were three papers written by the PR departments of Facebook, TikTok, and YouTube? To say the least, you’re not likely to produce a fully balanced picture.

The same is true when it comes to the topic of Russian disinformation and the AI models: When McKenzie Sadeghi, NewsGuard’s Editor for AI and Foreign Influence, recently tested the 10 largest AI models to see if they spread Russian disinformation, we were shocked but not surprised by the results. She “red teamed” the top 10 AI models with prompts to see if they’d spread the 19 false claims NewsGuard discovered from a network of Russian propaganda sites. Operating from Moscow, American fugitive John Mark Dougan launched 167 websites posing as independent local news sites to spread these false claims. For example, his made-up “London Crier” website was the source for the false claim that Ukraine President Volodymyr Zelensky used Western aid money to buy an estate in England from King Charles.

Dougan’s “reliable sources”: Our report found that the AI models spread the false claims on average 32 percent of the time, with some spreading the claims most of the time. The AI models even cited Dougan’s fake news sites as being authoritative sources. These include the Dougan-created Boston Times and Flagstaff Post, which sound like independent local U.S. news sites but like his London Crier instead spew Russian propaganda.

The reason AI models rely on untrustworthy sources of news and present false claims is that the models have unlimited access to these kinds of unreliable news sources. Russian, Chinese, and Iranian disinformation sites, healthcare hoax sites, and conspiracy sites and their related networks are only too happy to be included in the training data for the AI models.

But AI models are now largely barred from being trained on reliable sources of news. Many quality news publishers have put up “No Trespassing” signs, understandably barring the AI models from using their content as training data without first paying for it. By one count, 88 percent of top publishers block the AI models from accessing their journalism until the models agree to pay licensing fees. OpenAI has made licensing deals with publishers such as the AP, News Corp, and Axel Springer, but models that have licensed some news haven’t licensed more than a handful of sources, and many models haven’t licensed any quality news.

What’s become known as “Data Voids” result in misinformation: As with what would happen with that student’s hapless attempt to get the full story about the risks of social media while only having access to defenses by social media companies, this leads to what Microsoft researchers have called “data voids,” meaning searches “for which the available relevant data is limited, non-existent, or deeply problematic.”

If someone heard about the claim that there is a secret troll farm in Ukraine seeking to interfere in the 2024 U.S. election and asks if this is true, AI models may only have access to the claim as made by the Dougan-created Boston Times. This explains why some of the AI models we tested said this false claim was true, citing the Boston Times as a reliable source. News sources that debunked the claim likely exempted themselves from the AI training data.

How can the AI models stop being superspreaders of false claims? The first step is for them to get access to trustworthy sources of news and to rely on them more than on sources of false claims. (Disclosure: NewsGuard offers tools for the AI models to differentiate between generally trustworthy news sites versus misinformation sources and to recognize false claims in the news.)

The good news is that quality publishers are eager for the AI models to be trained on their news. But they want these multibillion-dollar companies to pay them for it, which they are absolutely right to demand. The AI models can stop superspreading false claims by paying to license reliable news. For the sake of the AI models and their users, the sooner the AI models invest in trust the better.

Correction: An earlier version of this commentary referred to the Microsoft research as citing “information voids.” The research used the phrase “data voids.” The commentary has been updated to reflect this distinction.

Gordon Crovitz is the Co-CEO and Co-Editor-In-Chief of NewsGuard. Previously, he was publisher of The Wall Street Journal.

We launched Reality Check after seeing how much interest there is in our work beyond the business and tech communities that we serve. Subscribe to this newsletter to support our apolitical mission to counter misinformation for readers, brands, and democracies. Have feedback? Send us an email: realitycheck@newsguardtech.com.

Share NewsGuard’s Substack