AI SEO EXPERIMENTS

The Princeton GEO Paper Is One-Third Right: A 3,205-Page Replication

By Anthony Lee Published 2026-03-26

The most-cited paper in Generative Engine Optimization made three claims: add statistics, add citations, add quotations. We tested all three on 3,205 real-world pages across ChatGPT, Claude, Perplexity, and Google AI Mode. Statistics replicate on every platform. Citations and quotations go the wrong direction on most platforms. The combined Princeton features add zero predictive value beyond basic technical page features (AUC 0.882 with or without them).

Aggarwal et al. (2024), "GEO: Generative Engine Optimization" (KDD 2024), is the paper that launched an industry. Their central claim: enriching content with statistics (+30-40%), citations (+30-40%), and quotations (+25-30%) significantly improves visibility in generative engine responses.

This claim is repeated in virtually every GEO guide, agency pitch deck, and AI SEO strategy document. It underpins consulting packages selling for $5,000 to $50,000+.

There is one problem nobody checked: the Princeton paper used a custom generative engine built for their study, not production ChatGPT, Google AI Mode, Perplexity, or Claude. Nobody had previously tested whether their findings transfer to the platforms people actually use.

We tested all three claims across four production platforms with 3,205 pages. One of three claims holds. Two are contradicted.

For context on what page-level features DO predict AI citation, see our 7 significant predictors study. For the complete research roundup, see every AI citation study worth reading in 2026.

🔢 KEY NUMBERS AT A GLANCE

Metric	Value	What It Means
Total pages tested	3,205	2,520 cited, 685 not-cited across 4 platforms
Platforms tested	4	ChatGPT, Claude, Perplexity, Google AI Mode
Statistics density	Replicates	Significant on all 4 platforms, correct direction (p < 0.001 on all)
Citation density	Does not replicate	Wrong direction on 4/4 platforms (significant on Google AI Mode, p = 0.008)
Quotation density	Does not replicate	Wrong direction on 4/4 platforms (significant on Perplexity p = 0.0009, Google AI p = 0.036)
Princeton-only model AUC	0.652 to 0.712	Above chance, driven almost entirely by statistics density
Base + Princeton model AUC	0.882	Same as base-only (0.882). Zero improvement from adding Princeton features
Partial correlation (after base)	rho = 0.036	Technically significant (p = 0.044) but explains only ~0.1% of variance

The Bottom Line: The Princeton paper is one-third right. Statistics-rich pages are more likely to be cited. But adding citations and quotations is either irrelevant or counterproductive. The combined Princeton features add zero practical predictive value beyond technical page features.

🧪 STUDY DESIGN

Data Collection (Four Phases)

We conducted this replication in four progressively larger phases to ensure robustness and address critiques at each stage.

Phase	Pages	Platforms	Purpose
1	430	Perplexity (159), ChatGPT (40), Both (17), Not-cited (214)	Proof of concept
2	913	Google AI Mode (312), ChatGPT (128), Both (2), Not-cited (471)	Balanced multi-platform test
3	1,343	Phase 1 + Phase 2 merged	Address Perplexity-exclusion critique
4	3,205	Google AI (1,113), Perplexity (779), ChatGPT (329), Multi-platform (180), Claude (119), Not-cited (685)	Definitive test with all 4 platforms

Phase 4 data: Includes a fresh export (March 16-26, 2026) from our production citation scraper database running across 4 platforms. 5,750 citations from 682 scrapes. 2,159 unique real-website URLs crawled via Playwright (1,862 successful with text).

UGC exclusion: All phases exclude Reddit, YouTube, Facebook, Twitter/X, Instagram, TikTok, Pinterest, Quora, LinkedIn, Medium, and other user-generated content platforms to prevent systematic confounds.

Feature Extraction

We extracted the three core Princeton features plus secondary features using regex-based NLP, all normalized per 1,000 words:

Feature Category	Patterns Used	Examples
Statistics density	9 patterns	Percentages, currency figures, multipliers, large numbers, year references, figures/tables, sample sizes, ratios, statistical terms
Citation density	5 patterns	Attribution phrases ("according to"), parenthetical citations, organizational attributions, URL citations, footnote markers
Quotation density	5 patterns	Double-quoted text, smart quotes, single quotes, blockquote indicators, speech verbs
Authoritative tone	Markers per 1k words	Authority language patterns
Technical terminology	Terms per 1k words	Domain-specific vocabulary density
Structured formatting	Items per 1k words	Bullets, numbered lists, heading density

Statistical Methods

Univariate: Mann-Whitney U (continuous features), Chi-squared and Fisher exact (binary features)
Multiple comparisons: Benjamini-Hochberg FDR correction at alpha = 0.05
Multivariate: Logistic Regression, Random Forest, and Gradient Boosting with 5-fold cross-validation
Nested model comparison: Likelihood ratio test comparing base features alone vs. base + Princeton features
Platform stratification: Separate analyses for each of the 4 platforms
Collinearity check: Spearman correlations + partial correlation (residualizing base features)

📊 THE THREE PRINCETON CLAIMS

Claim 1: Statistics Density (+30-40%) -- REPLICATES

Platform	n cited	Cited Median	Not-Cited Median	Difference	r	p
Claude	119	9.74	4.41	+120.9%	-0.343	< 0.0001
ChatGPT	329	6.86	4.41	+55.6%	-0.168	< 0.0001
Perplexity	779	6.05	4.41	+37.2%	-0.162	< 0.0001
Google AI Mode	1,113	5.34	4.41	+21.1%	-0.098	0.0005
Multi-platform	180	6.50	4.41	+47.4%	-0.201	< 0.0001

Significant on every platform, correct direction, strongest on Claude (r = -0.343). The Princeton paper's claim about statistics is supported by production platform data.

But the mechanism likely isn't what the Princeton paper implies. Pages with more statistics tend to be more substantive. Statistics density correlates with content quality. AI platforms are probably selecting for substance, not for the presence of numbers per se.

Claim 2: Citation Density (+30-40%) -- DOES NOT REPLICATE (Wrong Direction)

Platform	Cited Median	Not-Cited Median	r	p	Direction
Claude	0.00	0.20	+0.038	0.483	Wrong
ChatGPT	0.00	0.20	+0.057	0.120	Wrong
Perplexity	0.00	0.20	+0.055	0.053	Wrong
Google AI Mode	0.00	0.20	+0.069	0.008	Wrong
Multi-platform	0.15	0.20	-0.013	0.782	Neutral

Not-cited pages have more source attributions than cited pages on every platform. On Google AI Mode, this reversal is statistically significant (p = 0.008).

The likely explanation: citation-heavy pages are aggregation content (roundups, listicles, meta-analyses) that cite many sources but aren't themselves authoritative. AI platforms prefer to cite primary sources, not pages that cite other sources.

The Princeton paper's advice to "add citations" is contradicted by production platform data.

Claim 3: Quotation Density (+25-30%) -- DOES NOT REPLICATE (Wrong Direction)

Platform	Cited Median	Not-Cited Median	r	p	Direction
Claude	0.33	0.57	+0.076	0.174	Wrong
ChatGPT	0.45	0.57	+0.046	0.226	Wrong
Perplexity	0.32	0.57	+0.097	0.0009	Wrong
Google AI Mode	0.40	0.57	+0.057	0.036	Wrong
Multi-platform	0.58	0.57	-0.026	0.577	Neutral

Perplexity and Google AI Mode significantly cite pages with fewer quotations. The effect is wrong-direction on all four individual platforms.

Quotation-heavy pages may correlate with news articles and interviews, which are time-sensitive and less likely to be selected as authoritative references for general queries.

The Princeton paper's advice to "add quotations" is contradicted by production platform data.

🎯 MODEL COMPARISON

We tested whether Princeton features add any predictive value beyond our existing page-level predictors (Lee, 2026a).

Model	Base Features Only	Princeton Only	Base + Princeton
Logistic Regression	0.788	0.652	0.801
Random Forest	0.882	0.712	0.882
Gradient Boosting	0.876	0.694	0.868

Key findings:

Princeton-only models are above chance but weak. AUC 0.652 to 0.712, driven almost entirely by statistics density. This confirms the statistics signal is real.
Adding Princeton features to the base model provides zero improvement. Random Forest AUC stays at 0.882. Gradient Boosting actually drops from 0.876 to 0.868.
The likelihood ratio test is statistically significant (p < 0.001) but practically meaningless. At n = 3,205, the test has enough power to detect even tiny effects. The partial correlation of Princeton composite after controlling for all base features is rho = 0.036, explaining approximately 0.1% of variance.
Class imbalance caveat: Phase 4 has 2,520 cited vs 685 not-cited pages. This inflates absolute AUC values. The per-platform comparisons (each using the full 685 not-cited pool) are more trustworthy than aggregate model metrics.

The Bottom Line: The statistics signal is real but redundant. The base technical features already capture whatever the statistics signal measures. Adding Princeton features on top provides zero practical improvement.

📈 HOW THE PICTURE EVOLVED ACROSS FOUR PHASES

Metric	Phase 1 (430)	Phase 2 (913)	Phase 3 (1,343)	Phase 4 (3,205)
FDR-significant features	4 / 28	0 / 16	0 / 14	16 / 16
Stats density p	0.115	0.078	0.027	< 0.0001
Citations direction	Correct (weak)	Wrong	Wrong	Wrong
Quotations direction	Correct (weak)	NS	NS	Wrong
Princeton-only AUC (RF)	0.577	0.544	0.523	0.712
Base + Princeton AUC (RF)	0.640	0.665	0.617	0.882
Base-only AUC (RF)	0.627	0.665	0.620	0.882

The statistics signal grew stronger with more data. The citation and quotation signals also grew stronger, but in the wrong direction. More data didn't rescue the Princeton claims. It strengthened the contradiction.

🔍 COLLINEARITY CHECK

A legitimate concern: do base features absorb the Princeton signal? We checked Spearman correlations:

Base Feature	stats/1k	citations/1k	quotes/1k	composite/1k
word_count	+0.362	+0.269	+0.118	+0.225
content_to_html_ratio	-0.096	+0.159	+0.071	+0.013
load_time_ms	+0.121	+0.028	+0.045	+0.053
internal_link_count	+0.301	+0.031	-0.033	+0.150

These correlations are modest, not high enough to explain away the results. The partial correlation of Princeton composite with citation status after residualizing ALL base features is rho = +0.036 (p = 0.044). There is a real but tiny independent signal, driven by statistics and offset by the wrong-direction citation and quotation effects.

🛡️ ROBUSTNESS CHECKS

Class Imbalance

Phase 4 has 2,520 cited vs 685 not-cited pages (3.7:1 imbalance). To verify this doesn't inflate results, we downsampled the cited class to 685 (matching not-cited) and re-ran across 10 random seeds:

Feature	Balanced Cited Median	Not-Cited Median	p	Direction
Statistics / 1k	5.93	4.41	< 0.0001	Correct
Citations / 1k	0.00	0.20	0.009	Wrong
Quotations / 1k	0.40	0.57	0.067	Wrong (NS)
Composite / 1k	8.50	6.73	0.0001	Correct

Model comparison across 10 balanced downsamples: Base RF AUC 0.879 (+/- 0.007), Base + Princeton RF AUC 0.878 (+/- 0.008). Improvement: -0.0005. The pattern is identical on balanced data.

Citation Construct Mismatch

A critic might argue our NLP measures "attribution phrases" while Princeton's intervention added formal hyperlinked citations. We tested every citation subcategory independently:

Citation Type	Cited Median	NC Median	r	p	Direction
Attribution phrases ("according to")	0	0	+0.069	0.002	Wrong
Parenthetical cites (Author, 2024)	0	0	-0.000	0.975	Null
Organization attributions	0	0	-0.026	0.008	Marginal correct
URL citations (hyperlinks)	0	0	+0.047	0.001	Wrong
Footnote markers [1]	0	0	+0.007	0.110	Null
External link count (closest to Princeton)	16	20	+0.100	0.0001	Wrong

External links, the closest proxy to Princeton's "add hyperlinked source citations," go significantly in the wrong direction (cited pages have fewer: 16 vs 20). No operationalization of "cites sources" predicts AI citation in the correct direction.

Perplexity Inclusion/Exclusion

Phase 2 excluded Perplexity, which showed weak signals in Phase 1. Phase 4 includes Perplexity with a large sample (779 cited pages). Result: Perplexity statistics density replicates (p < 0.0001) but Perplexity quotation density is significantly wrong (p = 0.0009). Including or excluding Perplexity does not change the overall conclusion.

🔄 WHY TWO OF THREE CLAIMS DON'T TRANSFER

1. Custom Engine vs Production Platforms

The Princeton paper built their own generative engine for evaluation. This is a legitimate research choice, but their engine may respond to source attributions and quotation marks in ways that production platforms do not. Production platforms have different retrieval pipelines, ranking algorithms, and source selection logic.

2. Interventional vs Observational

The Princeton study modified existing content and measured changes. Our approach measured whether pages that naturally contain more of these features are more likely to be cited. These test different questions. Our observational design cannot prove that adding citations or quotations to a specific page won't help. But it can show that pages with more citations and quotations are not more likely to be cited, and in fact are less likely to be cited. That is what practitioners need to know when evaluating their existing content.

3. Why Citations Go Wrong

Citation-heavy pages are likely aggregation content: roundups, "best of" lists, meta-analyses that reference many sources. AI platforms prefer to cite the primary source, not the page that summarizes other sources. Adding "according to Harvard Business Review" to your content doesn't make your page more authoritative. It may signal that your page is derivative.

4. Why Quotations Go Wrong

Quotation-heavy pages likely correlate with news articles, interviews, and opinion pieces. These are time-sensitive content types that AI platforms deprioritize for general informational queries. A page full of "Dr. Smith said..." signals journalism, not reference material.

5. Why Statistics Work

Statistics-dense pages tend to be more substantive. Data tables, percentages, and research findings correlate with thoroughness and depth. The mechanism is content quality, not formatting. AI platforms aren't detecting numbers. They're selecting for the kind of page that naturally contains numbers.

🏛️ WHAT PRINCETON GOT RIGHT

This is not a blanket dismissal. The Aggarwal et al. (2024) paper made genuine contributions:

They coined "GEO." The term "Generative Engine Optimization" gave the field a name and a framework for discussion.
They created GEO-bench. The first systematic benchmark for evaluating content optimization strategies in generative search.
They proved the field exists. By demonstrating that content modifications can change generative engine outputs, they validated that optimization is possible in principle.
The statistics claim holds. Pages with more data and evidence are more likely to be cited. This is their strongest finding and it replicates across all four production platforms.

⚠️ WHAT THIS MEANS FOR THE GEO INDUSTRY

The GEO industry has operationalized the Princeton paper as a three-part formula: add statistics, add citations, add quotations. Two of those three recommendations are unsupported by production platform data and may be counterproductive.

What to stop doing

Stop adding source citations ("according to...") to improve AI visibility. This signals aggregation content, not authority.
Stop adding expert quotations to improve AI visibility. This correlates with news-style content that AI platforms deprioritize.
Stop citing "+40% visibility improvement" from Aggarwal et al. as evidence for a three-part optimization formula. Only one of three parts holds.

What to keep doing

Statistics and data. Pages with real numbers, research findings, and evidence are more likely to be cited across all platforms. This is the one Princeton claim that holds.
Technical page features. The 7 page-level predictors from Lee (2026a) remain far stronger than any content-level feature:

Feature	Effect	Direction
Internal link count	r = 0.127, fewer = cited	Positive
Self-referencing canonical	OR = 1.92	Positive
Schema markup presence	non-significant for generic presence	Positive
Word count	Median 1,799 vs 2,114	Positive
Content-to-HTML ratio	0.086 vs 0.065	Positive
Schema attribute completeness	OR = 1.21	Positive
Total link count	OR = 0.47	Negative (when external-heavy)

❓ FREQUENTLY ASKED QUESTIONS

"Isn't this observational while Princeton's was interventional?"

Yes, and both designs are necessary. Princeton proved you CAN modify content features on a custom benchmark engine and change visibility. We tested whether those features MATTER on real pages across production platforms. Our finding: statistics density does predict citation (supporting Princeton), but citation and quotation density predict against citation (contradicting Princeton).

The observational/interventional gap cuts both ways. If the gap invalidates our citation and quotation nulls, it also invalidates our statistics positive. You cannot selectively invoke the gap only for findings you disagree with. More importantly, the GEO industry operationalized the Princeton advice as "add statistics, citations, and quotations to your pages." Our data shows that the outcome this intervention aims to produce (pages rich in these features) is not consistently associated with AI citation. If the intervention produces the outcome but the outcome doesn't predict citation, the intervention doesn't work.

"Your regex NLP probably isn't accurate enough to measure these features."

We use 19 patterns (9 statistics, 5 citations, 5 quotations). The accuracy argument actually strengthens the statistics finding: even with noisy measurement, the signal is strong across all 4 platforms (p < 0.001 on every platform). For citations and quotations, the effect is not just absent but reversed. Measurement noise could mask a signal, but it cannot invert one. The wrong-direction finding for citations and quotations is not a measurement artifact.

"3,205 pages is still just one study."

True. But at 3,205 pages across 4 platforms with consistent results, the burden shifts. The Princeton paper used a single custom engine. We tested on the four platforms that cover essentially the entire AI search market. If the features don't predict citation on any of them, the practical relevance of the Princeton findings is limited to their custom engine.

"Different platforms. You can't compare."

We tested on all four major production AI platforms individually. Statistics density replicates on all four. Citation density goes the wrong direction on all four. Quotation density goes the wrong direction on all four. The cross-platform consistency is a strength, not a weakness.

"KDD peer review validates these results. Your blog post doesn't carry the same weight."

KDD reviewed the methodology and novelty of the GEO framework within its defined scope. Peer review does not make results permanent; replication does. Approximately 60% of published psychology findings failed replication in the Open Science Collaboration (2015). Our data is open, our methodology is documented, and anyone can replicate our analysis. We welcome that.

"The statistics finding supports Princeton. So their paper isn't wrong."

Partially correct. The statistics claim holds. But the GEO industry didn't operationalize "add statistics." It operationalized all three features as a package: statistics AND citations AND quotations. Two of the three go the wrong direction. Telling practitioners to add citations and quotations is not supported by data and may be actively harmful.

Also note: the statistics signal likely reflects content substance, not formatting. A practitioner who adds three random percentages to a thin blog post will not replicate the effect. Pages that are natively data-rich (research reports, analysis pieces, data-driven guides) are the ones being cited more. The mechanism is quality, not decoration.

"Your NLP measures 'attribution phrases,' not formal citations. Princeton meant something different."

We tested every citation subcategory independently, including external link count (the closest proxy to Princeton's "add hyperlinked source citations"). External links go significantly in the wrong direction: cited pages have fewer external links (median 16 vs 20, p = 0.0001). No operationalization of "cites sources" predicts AI citation in the correct direction. The construct mismatch argument fails on every variant we tested.

"The partial correlation (rho = 0.036) is statistically significant. You can't dismiss it."

We don't dismiss it. It's real. It explains approximately 0.1% of variance after controlling for all base features. No practitioner should invest in an optimization that explains 0.1% of the outcome when load time, internal links, and schema markup each explain orders of magnitude more. Statistical significance and practical significance are different things.

"You're contradicting your own published content."

Yes, and we are updating every post. We previously cited Aggarwal et al.'s three-part claim in over 25 posts because it was the best available evidence at the time. Now we have better evidence from 3,205 pages across 4 production platforms. We would rather correct 25+ posts than continue promoting claims that are two-thirds wrong.

"Your sample excludes UGC. That changes the composition."

UGC exclusion is a methodological improvement, not a limitation. User-generated content (Reddit threads, forum posts) introduces systematic confounds: different authorship patterns, different structural features, different intent profiles. Princeton's GEO-bench also did not use UGC content. Our sample is cleaner and more comparable to the content types that practitioners actually optimize.

"The class imbalance (2,520 cited vs 685 not-cited) inflates your results."

Yes, and we acknowledge this. The absolute AUC values in the aggregate model are inflated. That is why the per-platform comparisons (each using the same 685 not-cited pool) are more important than the aggregate model metrics. The per-platform results tell a consistent story: statistics work, citations and quotations do not.

📚 REFERENCES

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). GEO: Generative Engine Optimization. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. DOI: 10.48550/arXiv.2311.09735
Lee, J. (2026a). Query Intent and Google Rank as Joint Predictors of AI Citation: A Multi-Platform Observational Study. DOI: 10.5281/zenodo.18653093
Open Science Collaboration. (2015). Estimating the Reproducibility of Psychological Science. Science, 349(6251), aac4716.

🎯 THE BOTTOM LINE

The Princeton GEO paper is one-third right.

Statistics density replicates. Pages with more data, percentages, and evidence are more likely to be cited across all four production AI platforms. The effect is strongest on Claude (+120.9%) and consistent everywhere. If you take one piece of Princeton advice, make it this one.

Citation density does not replicate and goes the wrong direction. Pages with more source attributions are less likely to be cited, significantly so on Google AI Mode. Adding "according to [source]" signals aggregation, not authority.

Quotation density does not replicate and goes the wrong direction. Pages with more quotations are less likely to be cited, significantly so on Perplexity and Google AI Mode. Expert quotes signal news content, not reference material.

Combined practical value: zero. Adding all Princeton features to a technical page feature model produces no improvement (AUC 0.882 with or without them). The partial independent signal is rho = 0.036, explaining 0.1% of variance.

The GEO industry's three-part formula is two-thirds wrong. Use data and evidence in your content. Skip the citation padding and quotation stuffing. And focus your optimization budget on the technical features that actually predict citation.