How do you know whether a news source is trustworthy? That question is at the heart of my latest research project. Today, I am releasing CRED-1, an open multi-signal domain credibility dataset that assigns credibility scores to 2,672 domains known for publishing misinformation, conspiracy theories, or other unreliable content.
Why Another Dataset?
Existing source lists like OpenSources.co or Iffy.news provide valuable labels, but they are binary: a domain is either flagged or not. Real-world credibility is more nuanced. A satirical outlet is different from a state propaganda channel, and a site flagged by multiple independent lists is more concerning than one flagged by a single source.
CRED-1 addresses this by combining multiple openly-licensed source lists with five computed enrichment signals to produce a continuous credibility score between 0.0 (least credible) and 1.0 (most credible).
Five Independent Signals
Each domain in the dataset is enriched with signals from independent sources:
- Source Category (50% weight) — Consensus label from OpenSources.co and Iffy.news (fake, unreliable, conspiracy, satire, mixed)
- Iffy.news Score (15%) — Credibility rating from the Iffy.news index, derived from Media Bias/Fact Check assessments
- Fact-Check Frequency (15%) — Number of fact-check claims found via the Google Fact Check Tools API. More claims suggest more scrutiny from fact-checkers.
- Web Popularity (5%) — Tranco Top-1M rank. Higher reach means higher potential impact.
- Domain Age (5%) — Registration date via RDAP/WHOIS. Freshly registered domains are a common indicator for disposable misinformation sites.
Additionally, Google Safe Browsing acts as an override: any domain flagged for malware or social engineering gets a hard cap at 0.05.
Key Numbers
- 2,672 domains with credibility scores
- Score range: 0.000 to 0.962 (mean: 0.299)
- Category breakdown: 50% mixed, 22% unreliable, 18.4% fake, 5.7% conspiracy, 3.5% satire
- Tranco matches: 704 domains (26.3%), including 56 in the Top 10K
- Fact-check claims: 67 domains with 332 total claims
Designed for On-Device Deployment
One of the core design goals is privacy. The compact JSON format (117 KB) is small enough to ship inside a browser extension or mobile app. No server calls needed, no browsing history leaves the device. This makes CRED-1 ideal for pre-bunking: warning users before they engage with unreliable content, right at the delivery stage of the misinformation kill chain.
I am already integrating CRED-1 into Trackless Links, my Safari extension for tracker removal, to add credibility warnings for flagged domains.
Fully Reproducible
The entire pipeline is open source. Two Python scripts rebuild the dataset from scratch using only the standard library (no external dependencies). You can reproduce every score, extend the dataset with new sources, or adapt the scoring model for your own research.
python3 pipeline/build_dataset.py # Fetch & merge sources
python3 pipeline/enrich_dataset.py # Enrich with 5 signals
Get the Dataset
- GitHub: github.com/aloth/cred-1
- Zenodo (DOI): 10.5281/zenodo.18769460
- License: CC BY 4.0 (free to use, even commercially, with attribution)
Paper
A companion paper describing the methodology, scoring model, and limitations has been submitted to Data in Brief (Elsevier) and is available as an arXiv preprint. If you use CRED-1 in your research, please cite:
@article{loth2026cred1,
title = {CRED-1: An Open Multi-Signal Domain Credibility Dataset
for Automated Pre-Bunking of Online Misinformation},
author = {Loth, Alexander},
journal = {Data in Brief},
year = {2026},
doi = {10.5281/zenodo.18769460}
}
What’s Next
CRED-1 v1.0 is a starting point. I am working on automated monthly updates via a CI/CD pipeline that detects new domains from upstream sources, enriches them incrementally, and publishes delta releases to Zenodo. If you have ideas for additional signals or data sources, open an issue or reach out.
Fighting misinformation is a collective effort. I hope CRED-1 makes it a little easier.
