How well can people actually detect AI-generated news? What strategies do large language models use to mimic journalistic writing? And what do domain experts think about the evolving threat landscape of generative AI disinformation?
These are the questions driving a multi-year, mixed-methods research program on AI-driven disinformation. Today, I’m making three core datasets from this work publicly available on Zenodo under a CC BY 4.0 license, so other researchers can build on them.
The Datasets
๐ด RogueGPT Stimulus Corpus
2,308 multilingual text fragments generated by 7 different LLMs (GPT-4o, Claude 3.5, Gemini 1.5, Llama 3.1, Mistral Large, and more), each mimicking the style of legitimate news outlets across English and German. The corpus covers 12 news domains and 6 manipulation strategies, from factual distortion to emotional framing.
This dataset powers the experimental stimuli behind two peer-reviewed papers presented at The Web Conference 2026 (WWW ’26).
๐ Access on Zenodo | GitHub
๐ต JudgeGPT Human Perception Data
2,438 dual-axis perception judgments from 504 participants who evaluated AI-generated news fragments on both credibility and emotional impact. The data reveals how different LLM writing strategies affect human trust, and which manipulation techniques are hardest to detect.
๐ Access on Zenodo | GitHub
๐ข Expert Survey on GenAI Threats
21 domain experts from AI safety, journalism, policy, and academia shared their assessments of how generative AI is reshaping the disinformation landscape. The survey captures expert consensus on threat severity, countermeasure effectiveness, and emerging risks through 2030.
๐ Access on Zenodo | GitHub
Why Open Data Matters
Disinformation research suffers from a reproducibility gap. Stimulus materials and perception data are often locked behind institutional agreements or simply not published. By releasing these datasets openly, we hope to:
- Enable replication of our experimental findings
- Support benchmarking of new AI detection tools against real human perception data
- Foster cross-disciplinary collaboration between AI researchers, social scientists, and policymakers
All three datasets are released under CC BY 4.0 with a Data Use Agreement that prohibits using them for LLM training or re-identification of participants.
Ongoing Research: We Want Your Expertise
The expert survey is still open, and we are actively looking for more participants. If you work in AI safety, content moderation, journalism, policy, or disinformation research, your perspective would be invaluable.
The survey takes approximately 15 minutes and covers your assessment of current and future GenAI threats, countermeasure strategies, and the evolving information ecosystem.
Results will be presented at ACM Web Science 2026 in Braunschweig, Germany.
Related Papers
These datasets underpin several peer-reviewed publications at WWW ’26:
- Collateral Effects of LLM-Generated Disinformation
- Eroding the Truth Default: How LLM-Generated News Fragments Undermine Human Judgment
- Verification Crisis: Expert Perspectives on Generative AI as a Disinformation Vector
If you find these datasets useful for your research, please cite the corresponding papers. And if you’re an expert in this space, join our ongoing survey to help map the threat landscape.
