Three Open Datasets on AI-Driven Disinformation: From LLM-Generated News to Expert Threat Assessments

How well can people actually detect AI-generated news? What strategies do large language models use to mimic journalistic writing? And what do domain experts think about the evolving threat landscape of generative AI disinformation?

These are the questions driving a multi-year, mixed-methods research program on AI-driven disinformation. Today, I’m making three core datasets from this work publicly available on Zenodo under a CC BY 4.0 license, so other researchers can build on them.

The Datasets

🔴 RogueGPT Stimulus Corpus

2,308 multilingual text fragments generated by 7 different LLMs (GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.1, Mistral Large, and more), each mimicking the style of legitimate news outlets across English and German. The corpus covers 12 news domains and 6 manipulation strategies, from factual distortion to emotional framing.

This dataset powers the experimental stimuli behind two peer-reviewed papers presented at The Web Conference 2026 (WWW ’26).

👉 Access on Zenodo | GitHub

🔵 JudgeGPT Human Perception Data

2,438 dual-axis perception judgments from 504 participants who evaluated AI-generated news fragments on both credibility and emotional impact. The data reveals how different LLM writing strategies affect human trust, and which manipulation techniques are hardest to detect.

👉 Access on Zenodo | GitHub

🟢 Expert Survey on GenAI Threats

21 domain experts from AI safety, journalism, policy, and academia shared their assessments of how generative AI is reshaping the disinformation landscape. The survey captures expert consensus on threat severity, countermeasure effectiveness, and emerging risks through 2030.

👉 Access on Zenodo | GitHub

Why Open Data Matters

Disinformation research suffers from a reproducibility gap. Stimulus materials and perception data are often locked behind institutional agreements or simply not published. By releasing these datasets openly, we hope to:

Enable replication of our experimental findings
Support benchmarking of new AI detection tools against real human perception data
Foster cross-disciplinary collaboration between AI researchers, social scientists, and policymakers

All three datasets are released under CC BY 4.0 with a Data Use Agreement that prohibits using them for LLM training or re-identification of participants.

Ongoing Research: We Want Your Expertise

The expert survey is still open, and we are actively looking for more participants. If you work in AI safety, content moderation, journalism, policy, or disinformation research, your perspective would be invaluable.

The survey takes approximately 15 minutes and covers your assessment of current and future GenAI threats, countermeasure strategies, and the evolving information ecosystem.

👉 Take the Expert Survey

Results will be presented at ACM Web Science 2026 in Braunschweig, Germany.

Related Papers

These datasets underpin several peer-reviewed publications at WWW ’26:

If you find these datasets useful for your research, please cite the corresponding papers. And if you’re an expert in this space, join our ongoing survey to help map the threat landscape.