Update (March 23, 2026): Since this study was conducted, a recent ChatGPT API model update has begun citing Reddit through the API at approximately 12% of queries - reversing the zero-citation pattern we observed. This suggests the API-level Reddit suppression may have been model-version-specific rather than a permanent platform policy. Perplexity and Claude API citation rates have not yet been re-tested. The original findings below reflect data collected January-February 2026.
If you're building on the ChatGPT, Claude, or Perplexity API, you're seeing a fundamentally different information environment than what end users see in the browser.
I discovered this while running a companion study to a larger AI citation project. I scraped 6,699 URLs cited by ChatGPT and Perplexity across 120 product recommendation queries via their APIs. Reddit occupied 38.3% of Google's Top-3 organic positions for these same queries. The number of Reddit citations returned by the APIs? Zero. Across every platform. Every query. Not one.
Then I ran the same queries through the web UIs.
| Platform | Reddit Citation Rate (API) | Reddit Citation Rate (Web UI) |
|---|---|---|
| Google AI Mode | N/A | 44% |
| Perplexity | 0% | 20% |
| ChatGPT | 0% | 17% |
| Claude | 0% | 0% |
Google AI Mode cited Reddit in 44% of queries through its web interface. Perplexity hit 20%. ChatGPT 17%. Through the API: zero across the board.
Claude was the only platform that was consistent - it never cited Reddit through either channel.
This isn't a minor discrepancy
For certain query types, the divergence is extreme. Validation queries - "is X worth it?", "should I buy Y?" - surfaced Reddit at these rates through web UIs:
- Google AI Mode: 71% of validation queries cited Reddit
- Perplexity: 46%
- ChatGPT: not broken out by intent in web UI data
So if you're building a product recommendation tool on the Perplexity API, your users will never see Reddit threads in your output. But someone asking the same question on perplexity.ai will see Reddit nearly half the time for validation queries. These are two different products pretending to be the same one.
Reddit shapes AI recommendations even without citations
Here's where it gets weirder. Even though the APIs never cite Reddit, the AI models clearly learned from Reddit during training.
I collected 12,187 posts and 103,696 comments from 60 subreddits across 12 consumer product categories (headphones, running shoes, coffee makers, etc.). I ranked brands by Reddit community consensus - basically, which brands get upvoted most in recommendation threads.
Then I asked each AI platform "what's the best [product category]?" via API and compared their brand rankings to Reddit's.
The mean Spearman rank correlation was 0.554 across all 12 categories. Every single category reached statistical significance. The brands Reddit upvotes are the brands AI recommends - even when Reddit is never cited as a source.
This means Reddit functions as what I'm calling a "shadow corpus." Its influence is baked into the model weights during training, shapes the outputs, but never appears in the citation list. If you only look at citations (which is what most AI SEO tools do), you'd conclude Reddit doesn't matter. The data says otherwise.
Three channels of Reddit influence
The data supports a three-channel model:
Channel 1 - Training data: Reddit content is absorbed into model weights during pre-training. It influences brand recommendations, product rankings, and general opinions without being cited. This channel is invisible to any citation-tracking methodology. Evidence: ρ = 0.554 correlation between Reddit consensus and AI outputs.
Channel 2 - Web UI citations: When users interact through browser interfaces, AI platforms actively retrieve and cite Reddit threads. This channel is visible to anyone using the consumer products. Evidence: 17-44% citation rates across web UIs (excluding Claude).
Channel 3 - API suppression: When developers access the same models through APIs, Reddit citations drop to exactly zero. This channel is what developers and API-based tools observe. Evidence: 0% Reddit citations across 6,699 URLs from 120 queries.
No single research methodology can observe all three channels simultaneously. If you only use the API, you see Channel 3. If you only scrape the web UI, you see Channel 2. If you only analyze outputs for influence patterns, you see Channel 1. Most studies only look at one.
Why this matters for developers
If you're building anything that relies on AI-generated citations or recommendations:
- Your API output is not representative of user experience. Whatever citation behavior you're seeing through the API is a subset of what the web UI produces. Reddit is the most dramatic example, but it may not be the only divergence.
- Testing your app through the web UI is misleading. If you prototype by chatting with ChatGPT in a browser and then ship an API-based product, the citation behavior will differ. Test with the same interface your users will use.
- Reddit sentiment matters even if you can't see it. If your product category has active Reddit communities, the consensus opinions there are likely influencing what AI platforms recommend - regardless of whether Reddit appears in citations. A product that Reddit hates will face an uphill battle in AI recommendations, and you won't be able to trace why from citation data alone.
- Claude is the exception. It cited Reddit through neither channel. If Reddit-free recommendations matter to your use case, that's worth noting.
Why the divergence exists
I can't say definitively why APIs suppress Reddit, but there are plausible explanations:
- Licensing and legal risk. Google has an explicit content licensing deal with Reddit. Surfacing Reddit content through an API that third parties build products on creates different licensing implications than showing it in Google's own UI.
- Content quality filtering. API outputs may have stricter content filtering to reduce liability for downstream applications.
- Retrieval architecture differences. The web UI may use a different retrieval pipeline with access to additional search indexes that include Reddit.
The December 2025 OpenAI documentation changes (removing language about ChatGPT-User respecting robots.txt) suggest that the boundary between "user-initiated browsing" and "API-mediated retrieval" is one these companies are actively thinking about - and drawing differently depending on the access channel.
Limitations
- The API data comes from 120 product recommendation queries. Other query types may show different patterns.
- Web UI citation scraping is inherently noisier than API extraction. I used browser automation to capture citations, but UI changes could affect accuracy.
- The training data influence analysis (Spearman correlation) shows association, not causation. Models may have learned similar patterns from sources other than Reddit.
- Reddit community consensus was measured at a point in time. As threads age and new ones appear, the correlation may shift.
- This data is from January-February 2026. Platform behavior changes frequently.
The paper
"Reddit Doesn't Get Cited (Through the API): Training Data Influence, Access-Channel Divergence, and the Shadow Corpus in AI Brand Recommendations"
Lee, A. (2026). Preprint v3.
DOI: 10.5281/zenodo.18679003
PDF: aiplusautomation.com/research/reddit-training-data-influence.pdf
The companion study on page-level citation predictors:
"Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior"
DOI: 10.5281/zenodo.18653093
Happy to discuss methodology or findings.