The most-cited paper in Generative Engine Optimization made three claims: add statistics, add citations, add quotations. We tested all three on 3,205 real-world pages across ChatGPT, Claude, Perplexity, and Google AI Mode. Statistics replicate on every platform. Citations and quotations go the wrong direction on most platforms. The combined Princeton features add zero predictive value beyond basic technical page features (AUC 0.882 with or without them).
Aggarwal et al. (2024), "GEO: Generative Engine Optimization" (KDD 2024), is the paper that launched an industry. Their central claim: enriching content with statistics (+30-40%), citations (+30-40%), and quotations (+25-30%) significantly improves visibility in generative engine responses.
This claim is repeated in virtually every GEO guide, agency pitch deck, and AI SEO strategy document. It underpins consulting packages selling for $5,000 to $50,000+.
There is one problem nobody checked: the Princeton paper used a custom generative engine built for their study, not production ChatGPT, Google AI Mode, Perplexity, or Claude. Nobody had previously tested whether their findings transfer to the platforms people actually use.
We tested all three claims across four production platforms with 3,205 pages. One of three claims holds. Two are contradicted.
For context on what page-level features DO predict AI citation, see our 7 significant predictors study. For the complete research roundup, see every AI citation study worth reading in 2026.
๐ข KEY NUMBERS AT A GLANCE
| Metric | Value | What It Means |
|---|---|---|
| Total pages tested | 3,205 | 2,520 cited, 685 not-cited across 4 platforms |
| Platforms tested | 4 | ChatGPT, Claude, Perplexity, Google AI Mode |
| Statistics density | Replicates | Significant on all 4 platforms, correct direction (p < 0.001 on all) |
| Citation density | Does not replicate | Wrong direction on 4/4 platforms (significant on Google AI Mode, p = 0.008) |
| Quotation density | Does not replicate | Wrong direction on 4/4 platforms (significant on Perplexity p = 0.0009, Google AI p = 0.036) |
| Princeton-only model AUC | 0.652 to 0.712 | Above chance, driven almost entirely by statistics density |
| Base + Princeton model AUC | 0.882 | Same as base-only (0.882). Zero improvement from adding Princeton features |
| Partial correlation (after base) | rho = 0.036 | Technically significant (p = 0.044) but explains only ~0.1% of variance |
The Bottom Line: The Princeton paper is one-third right. Statistics-rich pages are more likely to be cited. But adding citations and quotations is either irrelevant or counterproductive. The combined Princeton features add zero practical predictive value beyond technical page features.
๐งช STUDY DESIGN
Data Collection (Four Phases)
We conducted this replication in four progressively larger phases to ensure robustness and address critiques at each stage.
| Phase | Pages | Platforms | Purpose |
|---|---|---|---|
| 1 | 430 | Perplexity (159), ChatGPT (40), Both (17), Not-cited (214) | Proof of concept |
| 2 | 913 | Google AI Mode (312), ChatGPT (128), Both (2), Not-cited (471) | Balanced multi-platform test |
| 3 | 1,343 | Phase 1 + Phase 2 merged | Address Perplexity-exclusion critique |
| 4 | 3,205 | Google AI (1,113), Perplexity (779), ChatGPT (329), Multi-platform (180), Claude (119), Not-cited (685) | Definitive test with all 4 platforms |
Phase 4 data: Includes a fresh export (March 16-26, 2026) from our production citation scraper database running across 4 platforms. 5,750 citations from 682 scrapes. 2,159 unique real-website URLs crawled via Playwright (1,862 successful with text).
UGC exclusion: All phases exclude Reddit, YouTube, Facebook, Twitter/X, Instagram, TikTok, Pinterest, Quora, LinkedIn, Medium, and other user-generated content platforms to prevent systematic confounds.
Feature Extraction
We extracted the three core Princeton features plus secondary features using regex-based NLP, all normalized per 1,000 words:
| Feature Category | Patterns Used | Examples |
|---|---|---|
| Statistics density | 9 patterns | Percentages, currency figures, multipliers, large numbers, year references, figures/tables, sample sizes, ratios, statistical terms |
| Citation density | 5 patterns | Attribution phrases ("according to"), parenthetical citations, organizational attributions, URL citations, footnote markers |
| Quotation density | 5 patterns | Double-quoted text, smart quotes, single quotes, blockquote indicators, speech verbs |
| Authoritative tone | Markers per 1k words | Authority language patterns |
| Technical terminology | Terms per 1k words | Domain-specific vocabulary density |
| Structured formatting | Items per 1k words | Bullets, numbered lists, heading density |
Statistical Methods
- Univariate: Mann-Whitney U (continuous features), Chi-squared and Fisher exact (binary features)
- Multiple comparisons: Benjamini-Hochberg FDR correction at alpha = 0.05
- Multivariate: Logistic Regression, Random Forest, and Gradient Boosting with 5-fold cross-validation
- Nested model comparison: Likelihood ratio test comparing base features alone vs. base + Princeton features
- Platform stratification: Separate analyses for each of the 4 platforms
- Collinearity check: Spearman correlations + partial correlation (residualizing base features)
๐ THE THREE PRINCETON CLAIMS
Claim 1: Statistics Density (+30-40%) -- REPLICATES
| Platform | n cited | Cited Median | Not-Cited Median | Difference | r | p |
|---|---|---|---|---|---|---|
| Claude | 119 | 9.74 | 4.41 | +120.9% | -0.343 | < 0.0001 |
| ChatGPT | 329 | 6.86 | 4.41 | +55.6% | -0.168 | < 0.0001 |
| Perplexity | 779 | 6.05 | 4.41 | +37.2% | -0.162 | < 0.0001 |
| Google AI Mode | 1,113 | 5.34 | 4.41 | +21.1% | -0.098 | 0.0005 |
| Multi-platform | 180 | 6.50 | 4.41 | +47.4% | -0.201 | < 0.0001 |
Significant on every platform, correct direction, strongest on Claude (r = -0.343). The Princeton paper's claim about statistics is supported by production platform data.
But the mechanism likely isn't what the Princeton paper implies. Pages with more statistics tend to be more substantive. Statistics density correlates with content quality. AI platforms are probably selecting for substance, not for the presence of numbers per se.
Claim 2: Citation Density (+30-40%) -- DOES NOT REPLICATE (Wrong Direction)
| Platform | Cited Median | Not-Cited Median | r | p | Direction |
|---|---|---|---|---|---|
| Claude | 0.00 | 0.20 | +0.038 | 0.483 | Wrong |
| ChatGPT | 0.00 | 0.20 | +0.057 | 0.120 | Wrong |
| Perplexity | 0.00 | 0.20 | +0.055 | 0.053 | Wrong |
| Google AI Mode | 0.00 | 0.20 | +0.069 | 0.008 | Wrong |
| Multi-platform | 0.15 | 0.20 | -0.013 | 0.782 | Neutral |
Not-cited pages have more source attributions than cited pages on every platform. On Google AI Mode, this reversal is statistically significant (p = 0.008).
The likely explanation: citation-heavy pages are aggregation content (roundups, listicles, meta-analyses) that cite many sources but aren't themselves authoritative. AI platforms prefer to cite primary sources, not pages that cite other sources.
The Princeton paper's advice to "add citations" is contradicted by production platform data.
Claim 3: Quotation Density (+25-30%) -- DOES NOT REPLICATE (Wrong Direction)
| Platform | Cited Median | Not-Cited Median | r | p | Direction |
|---|---|---|---|---|---|
| Claude | 0.33 | 0.57 | +0.076 | 0.174 | Wrong |
| ChatGPT | 0.45 | 0.57 | +0.046 | 0.226 | Wrong |
| Perplexity | 0.32 | 0.57 | +0.097 | 0.0009 | Wrong |
| Google AI Mode | 0.40 | 0.57 | +0.057 | 0.036 | Wrong |
| Multi-platform | 0.58 | 0.57 | -0.026 | 0.577 | Neutral |
Perplexity and Google AI Mode significantly cite pages with fewer quotations. The effect is wrong-direction on all four individual platforms.
Quotation-heavy pages may correlate with news articles and interviews, which are time-sensitive and less likely to be selected as authoritative references for general queries.
The Princeton paper's advice to "add quotations" is contradicted by production platform data.
๐ฏ MODEL COMPARISON
We tested whether Princeton features add any predictive value beyond our existing page-level predictors (Lee, 2026a).
| Model | Base Features Only | Princeton Only | Base + Princeton |
|---|---|---|---|
| Logistic Regression | 0.788 | 0.652 | 0.801 |
| Random Forest | 0.882 | 0.712 | 0.882 |
| Gradient Boosting | 0.876 | 0.694 | 0.868 |
Key findings:
Princeton-only models are above chance but weak. AUC 0.652 to 0.712, driven almost entirely by statistics density. This confirms the statistics signal is real.
Adding Princeton features to the base model provides zero improvement. Random Forest AUC stays at 0.882. Gradient Boosting actually drops from 0.876 to 0.868.
The likelihood ratio test is statistically significant (p < 0.001) but practically meaningless. At n = 3,205, the test has enough power to detect even tiny effects. The partial correlation of Princeton composite after controlling for all base features is rho = 0.036, explaining approximately 0.1% of variance.
Class imbalance caveat: Phase 4 has 2,520 cited vs 685 not-cited pages. This inflates absolute AUC values. The per-platform comparisons (each using the full 685 not-cited pool) are more trustworthy than aggregate model metrics.
The Bottom Line: The statistics signal is real but redundant. The base technical features already capture whatever the statistics signal measures. Adding Princeton features on top provides zero practical improvement.
๐ HOW THE PICTURE EVOLVED ACROSS FOUR PHASES
| Metric | Phase 1 (430) | Phase 2 (913) | Phase 3 (1,343) | Phase 4 (3,205) |
|---|---|---|---|---|
| FDR-significant features | 4 / 28 | 0 / 16 | 0 / 14 | 16 / 16 |
| Stats density p | 0.115 | 0.078 | 0.027 | < 0.0001 |
| Citations direction | Correct (weak) | Wrong | Wrong | Wrong |
| Quotations direction | Correct (weak) | NS | NS | Wrong |
| Princeton-only AUC (RF) | 0.577 | 0.544 | 0.523 | 0.712 |
| Base + Princeton AUC (RF) | 0.640 | 0.665 | 0.617 | 0.882 |
| Base-only AUC (RF) | 0.627 | 0.665 | 0.620 | 0.882 |
The statistics signal grew stronger with more data. The citation and quotation signals also grew stronger, but in the wrong direction. More data didn't rescue the Princeton claims. It strengthened the contradiction.
๐ COLLINEARITY CHECK
A legitimate concern: do base features absorb the Princeton signal? We checked Spearman correlations:
| Base Feature | stats/1k | citations/1k | quotes/1k | composite/1k |
|---|---|---|---|---|
| word_count | +0.362 | +0.269 | +0.118 | +0.225 |
| content_to_html_ratio | -0.096 | +0.159 | +0.071 | +0.013 |
| load_time_ms | +0.121 | +0.028 | +0.045 | +0.053 |
| internal_link_count | +0.301 | +0.031 | -0.033 | +0.150 |
These correlations are modest, not high enough to explain away the results. The partial correlation of Princeton composite with citation status after residualizing ALL base features is rho = +0.036 (p = 0.044). There is a real but tiny independent signal, driven by statistics and offset by the wrong-direction citation and quotation effects.
๐ก๏ธ ROBUSTNESS CHECKS
Class Imbalance
Phase 4 has 2,520 cited vs 685 not-cited pages (3.7:1 imbalance). To verify this doesn't inflate results, we downsampled the cited class to 685 (matching not-cited) and re-ran across 10 random seeds:
| Feature | Balanced Cited Median | Not-Cited Median | p | Direction |
|---|---|---|---|---|
| Statistics / 1k | 5.93 | 4.41 | < 0.0001 | Correct |
| Citations / 1k | 0.00 | 0.20 | 0.009 | Wrong |
| Quotations / 1k | 0.40 | 0.57 | 0.067 | Wrong (NS) |
| Composite / 1k | 8.50 | 6.73 | 0.0001 | Correct |
Model comparison across 10 balanced downsamples: Base RF AUC 0.879 (+/- 0.007), Base + Princeton RF AUC 0.878 (+/- 0.008). Improvement: -0.0005. The pattern is identical on balanced data.
Citation Construct Mismatch
A critic might argue our NLP measures "attribution phrases" while Princeton's intervention added formal hyperlinked citations. We tested every citation subcategory independently:
| Citation Type | Cited Median | NC Median | r | p | Direction |
|---|---|---|---|---|---|
| Attribution phrases ("according to") | 0 | 0 | +0.069 | 0.002 | Wrong |
| Parenthetical cites (Author, 2024) | 0 | 0 | -0.000 | 0.975 | Null |
| Organization attributions | 0 | 0 | -0.026 | 0.008 | Marginal correct |
| URL citations (hyperlinks) | 0 | 0 | +0.047 | 0.001 | Wrong |
| Footnote markers [1] | 0 | 0 | +0.007 | 0.110 | Null |
| External link count (closest to Princeton) | 16 | 20 | +0.100 | 0.0001 | Wrong |
External links, the closest proxy to Princeton's "add hyperlinked source citations," go significantly in the wrong direction (cited pages have fewer: 16 vs 20). No operationalization of "cites sources" predicts AI citation in the correct direction.
Perplexity Inclusion/Exclusion
Phase 2 excluded Perplexity, which showed weak signals in Phase 1. Phase 4 includes Perplexity with a large sample (779 cited pages). Result: Perplexity statistics density replicates (p < 0.0001) but Perplexity quotation density is significantly wrong (p = 0.0009). Including or excluding Perplexity does not change the overall conclusion.
๐ WHY TWO OF THREE CLAIMS DON'T TRANSFER
1. Custom Engine vs Production Platforms
The Princeton paper built their own generative engine for evaluation. This is a legitimate research choice, but their engine may respond to source attributions and quotation marks in ways that production platforms do not. Production platforms have different retrieval pipelines, ranking algorithms, and source selection logic.
2. Interventional vs Observational
The Princeton study modified existing content and measured changes. Our approach measured whether pages that naturally contain more of these features are more likely to be cited. These test different questions. Our observational design cannot prove that adding citations or quotations to a specific page won't help. But it can show that pages with more citations and quotations are not more likely to be cited, and in fact are less likely to be cited. That is what practitioners need to know when evaluating their existing content.
3. Why Citations Go Wrong
Citation-heavy pages are likely aggregation content: roundups, "best of" lists, meta-analyses that reference many sources. AI platforms prefer to cite the primary source, not the page that summarizes other sources. Adding "according to Harvard Business Review" to your content doesn't make your page more authoritative. It may signal that your page is derivative.
4. Why Quotations Go Wrong
Quotation-heavy pages likely correlate with news articles, interviews, and opinion pieces. These are time-sensitive content types that AI platforms deprioritize for general informational queries. A page full of "Dr. Smith said..." signals journalism, not reference material.
5. Why Statistics Work
Statistics-dense pages tend to be more substantive. Data tables, percentages, and research findings correlate with thoroughness and depth. The mechanism is content quality, not formatting. AI platforms aren't detecting numbers. They're selecting for the kind of page that naturally contains numbers.
๐๏ธ WHAT PRINCETON GOT RIGHT
This is not a blanket dismissal. The Aggarwal et al. (2024) paper made genuine contributions:
They coined "GEO." The term "Generative Engine Optimization" gave the field a name and a framework for discussion.
They created GEO-bench. The first systematic benchmark for evaluating content optimization strategies in generative search.
They proved the field exists. By demonstrating that content modifications can change generative engine outputs, they validated that optimization is possible in principle.
The statistics claim holds. Pages with more data and evidence are more likely to be cited. This is their strongest finding and it replicates across all four production platforms.
โ ๏ธ WHAT THIS MEANS FOR THE GEO INDUSTRY
The GEO industry has operationalized the Princeton paper as a three-part formula: add statistics, add citations, add quotations. Two of those three recommendations are unsupported by production platform data and may be counterproductive.
What to stop doing
- Stop adding source citations ("according to...") to improve AI visibility. This signals aggregation content, not authority.
- Stop adding expert quotations to improve AI visibility. This correlates with news-style content that AI platforms deprioritize.
- Stop citing "+40% visibility improvement" from Aggarwal et al. as evidence for a three-part optimization formula. Only one of three parts holds.
What to keep doing
- Statistics and data. Pages with real numbers, research findings, and evidence are more likely to be cited across all platforms. This is the one Princeton claim that holds.
- Technical page features. The 7 page-level predictors from Lee (2026a) remain far stronger than any content-level feature:
| Feature | Effect | Direction |
|---|---|---|
| Internal link count | r = 0.127, fewer = cited | Positive |
| Self-referencing canonical | OR = 1.92 | Positive |
| Schema markup presence | OR = non-significant (p=0.78) for generic presence | Positive |
| Word count | Median 1,799 vs 2,114 | Positive |
| Content-to-HTML ratio | 0.086 vs 0.065 | Positive |
| Schema attribute completeness | OR = 1.21 | Positive |
| Total link count | OR = 0.47 | Negative (when external-heavy) |
โ FREQUENTLY ASKED QUESTIONS
"Isn't this observational while Princeton's was interventional?"
Yes, and both designs are necessary. Princeton proved you CAN modify content features on a custom benchmark engine and change visibility. We tested whether those features MATTER on real pages across production platforms. Our finding: statistics density does predict citation (supporting Princeton), but citation and quotation density predict against citation (contradicting Princeton).
The observational/interventional gap cuts both ways. If the gap invalidates our citation and quotation nulls, it also invalidates our statistics positive. You cannot selectively invoke the gap only for findings you disagree with. More importantly, the GEO industry operationalized the Princeton advice as "add statistics, citations, and quotations to your pages." Our data shows that the outcome this intervention aims to produce (pages rich in these features) is not consistently associated with AI citation. If the intervention produces the outcome but the outcome doesn't predict citation, the intervention doesn't work.
"Your regex NLP probably isn't accurate enough to measure these features."
We use 19 patterns (9 statistics, 5 citations, 5 quotations). The accuracy argument actually strengthens the statistics finding: even with noisy measurement, the signal is strong across all 4 platforms (p < 0.001 on every platform). For citations and quotations, the effect is not just absent but reversed. Measurement noise could mask a signal, but it cannot invert one. The wrong-direction finding for citations and quotations is not a measurement artifact.
"3,205 pages is still just one study."
True. But at 3,205 pages across 4 platforms with consistent results, the burden shifts. The Princeton paper used a single custom engine. We tested on the four platforms that cover essentially the entire AI search market. If the features don't predict citation on any of them, the practical relevance of the Princeton findings is limited to their custom engine.
"Different platforms. You can't compare."
We tested on all four major production AI platforms individually. Statistics density replicates on all four. Citation density goes the wrong direction on all four. Quotation density goes the wrong direction on all four. The cross-platform consistency is a strength, not a weakness.
"KDD peer review validates these results. Your blog post doesn't carry the same weight."
KDD reviewed the methodology and novelty of the GEO framework within its defined scope. Peer review does not make results permanent; replication does. Approximately 60% of published psychology findings failed replication in the Open Science Collaboration (2015). Our data is open, our methodology is documented, and anyone can replicate our analysis. We welcome that.
"The statistics finding supports Princeton. So their paper isn't wrong."
Partially correct. The statistics claim holds. But the GEO industry didn't operationalize "add statistics." It operationalized all three features as a package: statistics AND citations AND quotations. Two of the three go the wrong direction. Telling practitioners to add citations and quotations is not supported by data and may be actively harmful.
Also note: the statistics signal likely reflects content substance, not formatting. A practitioner who adds three random percentages to a thin blog post will not replicate the effect. Pages that are natively data-rich (research reports, analysis pieces, data-driven guides) are the ones being cited more. The mechanism is quality, not decoration.
"Your NLP measures 'attribution phrases,' not formal citations. Princeton meant something different."
We tested every citation subcategory independently, including external link count (the closest proxy to Princeton's "add hyperlinked source citations"). External links go significantly in the wrong direction: cited pages have fewer external links (median 16 vs 20, p = 0.0001). No operationalization of "cites sources" predicts AI citation in the correct direction. The construct mismatch argument fails on every variant we tested.
"The partial correlation (rho = 0.036) is statistically significant. You can't dismiss it."
We don't dismiss it. It's real. It explains approximately 0.1% of variance after controlling for all base features. No practitioner should invest in an optimization that explains 0.1% of the outcome when load time, internal links, and schema markup each explain orders of magnitude more. Statistical significance and practical significance are different things.
"You're contradicting your own published content."
Yes, and we are updating every post. We previously cited Aggarwal et al.'s three-part claim in over 25 posts because it was the best available evidence at the time. Now we have better evidence from 3,205 pages across 4 production platforms. We would rather correct 25+ posts than continue promoting claims that are two-thirds wrong.
"Your sample excludes UGC. That changes the composition."
UGC exclusion is a methodological improvement, not a limitation. User-generated content (Reddit threads, forum posts) introduces systematic confounds: different authorship patterns, different structural features, different intent profiles. Princeton's GEO-bench also did not use UGC content. Our sample is cleaner and more comparable to the content types that practitioners actually optimize.
"The class imbalance (2,520 cited vs 685 not-cited) inflates your results."
Yes, and we acknowledge this. The absolute AUC values in the aggregate model are inflated. That is why the per-platform comparisons (each using the same 685 not-cited pool) are more important than the aggregate model metrics. The per-platform results tell a consistent story: statistics work, citations and quotations do not.
๐ REFERENCES
- Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. DOI: 10.48550/arXiv.2311.09735
- Lee, J. (2026a). Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior. DOI: 10.5281/zenodo.18653093
- Open Science Collaboration. (2015). Estimating the Reproducibility of Psychological Science. Science, 349(6251), aac4716.
๐ฏ THE BOTTOM LINE
The Princeton GEO paper is one-third right.
Statistics density replicates. Pages with more data, percentages, and evidence are more likely to be cited across all four production AI platforms. The effect is strongest on Claude (+120.9%) and consistent everywhere. If you take one piece of Princeton advice, make it this one.
Citation density does not replicate and goes the wrong direction. Pages with more source attributions are less likely to be cited, significantly so on Google AI Mode. Adding "according to [source]" signals aggregation, not authority.
Quotation density does not replicate and goes the wrong direction. Pages with more quotations are less likely to be cited, significantly so on Perplexity and Google AI Mode. Expert quotes signal news content, not reference material.
Combined practical value: zero. Adding all Princeton features to a technical page feature model produces no improvement (AUC 0.882 with or without them). The partial independent signal is rho = 0.036, explaining 0.1% of variance.
The GEO industry's three-part formula is two-thirds wrong. Use data and evidence in your content. Skip the citation padding and quotation stuffing. And focus your optimization budget on the technical features that actually predict citation.