Reddit Doesn’t Get Cited (Through the API): Training Data Influence, Access-Channel Divergence, and the Shadow Corpus in AI Brand Recommendations
Anthony Lee — AI+Automation
Preprint — February 2026 (v3) | Not yet peer-reviewed
Abstract
AI chatbots functionally never cite Reddit—through their APIs. In a companion study of 6,699 URLs cited by ChatGPT and Perplexity across 120 product recommendation queries, we observed zero Reddit citations in our sample—despite Reddit occupying 38.3% of Google’s Top-3 organic positions for those same queries. This paper investigates Reddit’s influence on AI through two complementary analyses: a training data correlation study and a systematic comparison of Reddit citation behavior across API and web UI access channels.
For the training data analysis, we collected 12,187 posts and 103,696 comments from 60 subreddits spanning 12 consumer product categories and extracted brand mentions using an upvote-weighted scoring system. We then correlated Reddit’s brand consensus rankings against AI brand recommendation rankings derived from four major platforms—ChatGPT, Claude, Perplexity, and Gemini—each queried three times across 50 product recommendation queries. The correlation was strong, consistent, and statistically significant across every category tested. The mean Spearman rank correlation was ρ = .554 across all 12 consumer categories, with all 12 reaching significance at p < .05 and 8 of 12 surviving Bonferroni correction. Fisher’s combined probability test confirmed the aggregate effect (χ²(22) = 188.42, p < 10−8). Three robustness analyses—weighting sensitivity, independent brand extraction via NER, and partial correlation controlling for market popularity—confirmed the reliability of these findings.
For the access-channel analysis, we built browser automation scrapers that collected citation data from the web UIs of four platforms (Google AI Mode, Perplexity, ChatGPT, and Claude) across 100 queries spanning 13 domains and five intent types, then compared these against API results for the same queries. The divergence was stark: APIs produced 0% Reddit citation rates across all platforms, while web UIs produced 44% (Google AI Mode), 20% (Perplexity), and 17% (ChatGPT). Validation queries—those seeking opinions and comparisons—surfaced Reddit at the highest rates (71% on Google AI Mode, 46% on Perplexity). Only Claude maintained zero Reddit citations across both access channels.
These findings support a three-channel model of Reddit’s influence on AI: (1) a training data pathway through which Reddit’s community consensus is absorbed into model weights during pre-training (ρ = .554); (2) a web UI citation pathway through which Reddit is actively retrieved and cited in consumer-facing interfaces (27% aggregate rate); and (3) an API citation pathway that categorically suppresses Reddit (0% rate). Reddit functions as what we term a shadow corpus—a source whose influence on AI outputs operates through multiple channels that cannot be fully observed through any single research methodology.
Keywords
Reddit, AI brand recommendations, training data influence, access-channel divergence, API vs web UI, Generative Engine Optimization, GEO, Spearman correlation, community consensus, shadow corpus, ChatGPT, Claude, Perplexity, Gemini
Citation
Lee, A. (2026). Reddit doesn’t get cited (through the API): Training data influence, access-channel divergence, and the shadow corpus in AI brand recommendations. Preprint v3, AI+Automation.