AI SEO EXPERIMENTS

7 Page Features That Predict AI Citation (And 5 That Dont Matter At All)

By Anthony Lee Published 2026-03-24

Update (2026-04-02): Experiment M (n=10,293 pages, position-band matched) is now our most rigorous study. It confirms some findings below, reverses others (word count direction, internal link direction), and adds major new signals (comparison structure, first-person density, primary source score, query term coverage). The original 7-predictor model (AUC=0.594) is superseded by Experiment M's model (AUC=0.753). Key change: within position bands, content features (AUC=0.631) beat domain identity (AUC=0.583) for the first time. See inline updates below for each predictor.

The 7 page-level features that survived FDR correction explain who wins within a query intent pool. Internal links are the strongest predictor (r = 0.127, fewer = cited), but only when they are navigation links, not content links. Five commonly optimized features, including page speed and author attribution, have zero statistical relationship with AI citation.

Most "AI SEO" advice treats every page element as potentially important. Add author bios. Speed up load time. Remove popups. These recommendations sound reasonable, but none of them are backed by citation-level data.

Lee (2026) tested 19,556 queries across ChatGPT, Perplexity, Claude, and Gemini and identified exactly 7 page-level features that statistically predict whether a page gets cited. Everything else is noise. This post is a deep dive into each predictor: what the data says, why the mechanism works, and how to act on it. We also cover the 5 features that practitioners commonly optimize but that have zero predictive power.

If you want the full research roundup covering all major AI citation studies, see our complete 2026 research summary. For a practical implementation checklist, see the AI SEO audit checklist.

The 7 predictors at a glance

Before the deep dive, here is the summary table. All values are from Lee (2026), tested across 19,556 queries with Benjamini-Hochberg FDR correction applied.

#	Feature	Odds Ratio / Effect	Beta	Direction	Why It Matters
1	Internal link count	r = 0.127, fewer = cited	0.73	Positive	Strongest single predictor; navigation links drive the effect
2	Self-referencing canonical	OR = 1.92	--	Positive	Signals a clean, authoritative primary page
3	Schema markup presence	non-significant for generic presence	--	Positive	Type matters far more than presence alone
4	Word count	Median 1,799 vs 2,114	--	Positive	Cited pages are ~15% shorter at median
5	Content-to-HTML ratio	0.086 vs 0.065	--	Positive	Higher content density signals substance
6	Schema count (attribute completeness)	OR = 1.21	--	Positive	More attributes per schema = better
7	Total link count	OR = 0.47	--	Negative (when external-heavy)	External link dominance suppresses citation

The Bottom Line: These 7 features operate as a second-level selector. They only matter after the first filter (query intent) has already determined which content pool is eligible. Intent decides the pool. These features decide the winner within it.

Predictor 1: internal link count (β = 0.73, fewer = cited)

Internal links are the single strongest page-level predictor of AI citation. Pages with higher internal link counts are modestly more likely (r=0.127, with fewer in-content links predicting citation) to be cited, all else equal.

But the mechanism is not what most people assume.

Navigation Links, Not Content Links

The effect is driven by navigation links (menus, sidebars, footer links, breadcrumbs), not by in-content editorial links. Pages with robust site-wide navigation structures signal to AI retrieval systems that the page is part of a well-organized, authoritative site. The internal link count acts as a proxy for site architecture quality.

This distinction matters for implementation. Adding 50 inline links to other blog posts will not replicate the effect. Building a proper navigation architecture with consistent menus, breadcrumb trails, and related-content sidebars will.

Why Navigation Links Work

AI platforms that use live fetching (ChatGPT, Claude) parse the full HTML of your page. A dense navigation structure tells the retrieval system:

This page belongs to a real, maintained website (not a thin affiliate page)
There are related resources the model could also reference
The site has topical depth around the subject

Aggarwal et al. (2024) reported that "citing sources" was among the most effective GEO strategies on their custom GEO-bench engine (up to +40% visibility), though a 3,205-page replication across 4 production platforms found that two of the three Princeton features (citations and quotations) do not replicate on production platforms, while statistics density does replicate (Lee, 2026c). Internal navigation links serve a structural version of the same purpose: they demonstrate the page exists within a network of supporting content.

The Bottom Line: Audit your navigation architecture before adding inline content links. Breadcrumbs, category menus, sidebar navigation, and footer site maps all contribute to the internal link count that predicts citation. For a full schema and navigation audit, see our AI SEO audit checklist.

Predictor 2: self-referencing canonical (OR = 1.92)

Pages with a self-referencing canonical tag are 1.92 times more likely to be cited by AI platforms. This is the second strongest positive predictor.

A self-referencing canonical tag (<link rel="canonical" href="[this page's URL]">) tells crawlers and retrieval systems that this URL is the primary, authoritative version of the content. It signals:

No duplicate content issues
This is not a syndicated copy
This is not a paginated fragment or print version

AI retrieval pipelines, particularly those that crawl and index pages (like Perplexity's PerplexityBot), appear to use the canonical tag as a quality filter. Pages without a canonical, or with a canonical pointing elsewhere, may be deprioritized or skipped entirely.

Implementation

Every page on your site should have a self-referencing canonical tag. This is a one-time technical fix that takes minutes to implement in most CMS platforms.

<link rel="canonical" href="https://yoursite.com/current-page-url" />

If you have paginated content, make sure each page canonicalizes to itself (not to page 1). If you syndicate content, ensure the original always has a self-referencing canonical.

The Bottom Line: This is the easiest win on the list. A missing or incorrect canonical tag costs you 1.92x citation probability. Fix it site-wide.

Predictor 3: schema markup (type matters more than presence)

Pages with the right schema markup are more likely to be cited. But generic "any schema" presence is statistically indistinguishable from zero. The signal comes from matching the schema type to the page, not from adding schema for its own sake.

The strongest individual schema type in our position-controlled research is FAQPage, which is significant in all four Google rank bands. For other types, match the schema to the page: Product on ecommerce pages, Service on service pages, Article on genuine blog posts. Applying Article schema to a product or comparison page sends the wrong signal about what the page contains.

The Bottom Line: Do not treat schema as a checkbox. Match the schema type to the page, deploy multiple appropriate types per page, and aim for ≥76% attribute completeness. See our complete schema guide for type-by-type implementation instructions.

Predictor 4: word count

Update (Experiment M, n=10,293 pages, position-band matched): The original cross-domain comparison showed cited = 1,799 vs not-cited = 2,114 (shorter). However, Experiment M controls for Google rank position and finds the direction reverses: within position bands, cited pages are consistently LONGER (~2,000 vs ~1,400, +42-52%). The "shorter is better" finding was confounded by domain identity. Target approximately 2,000 words of substantive content. Word count is the 3rd most important actionable feature (7.1% importance) in the position-controlled model.

The original cross-domain analysis found cited pages had a median of 1,799 words vs 2,114 for not-cited. Our follow-up Experiment M, which controls for Google rank position, found the opposite: within the same position bands, cited pages are ~42-52% longer (median ~2,000 vs ~1,400). The earlier finding was confounded by domain identity -- prestigious domains tend to have shorter, denser pages that get cited regardless of length.

The Content Length Sweet Spot

Word Count Range	Citation Likelihood	Interpretation
Under 1,000	Low	Too thin to provide comprehensive answers
1,000 to 1,800	Below average	May lack depth for informational queries
1,800 to 3,000	Above average	Sweet spot for most query types
3,000 to 5,000	Above average	Good for comprehensive guides
Over 5,000	Diminishing returns	Length alone does not help; structure matters more

The important nuance: word count interacts with content-to-HTML ratio (Predictor 5). A 3,000-word page with a low content-to-HTML ratio (heavy boilerplate, ads, sidebar widgets) performs worse than a 2,500-word page with a high ratio. Raw length is a necessary but not sufficient condition.

Aggarwal et al. (2024) reported that "adding statistics" and "citing sources" were among the most effective GEO strategies on their custom benchmark, though these specific features have not replicated on production platforms. Regardless of the mechanism, these strategies naturally increase word count while also increasing information density, a combination that moves both Predictor 4 and Predictor 5 in the right direction.

The Bottom Line: Aim for 2,000 to 3,000 words of substantive content for pages targeting informational or discovery queries. Do not pad. Every word should add information density.

Predictor 5: content-to-HTML ratio (0.086 vs 0.065)

Cited pages have a mean content-to-HTML ratio of 0.086. Non-cited pages average 0.065. This 32% difference is statistically significant after FDR correction.

Content-to-HTML ratio measures the proportion of visible text content relative to total HTML markup. A higher ratio means more substance and less boilerplate. A lower ratio means the page is dominated by navigation markup, ad code, tracking scripts, widget HTML, and other non-content elements.

What Drives Low Ratios

Heavy ad placement (especially mid-content ad blocks)
Complex JavaScript frameworks that generate excessive wrapper HTML
Sidebar widgets, pop-up modals, and notification banners
Cookie consent overlays with large embedded HTML
Excessive inline styling and redundant div nesting

What Drives High Ratios

Clean, semantic HTML
Minimal non-content markup
Server-side rendered content (vs. client-side hydration)
Efficient CSS (external stylesheets, not inline styles)
Content-first page architecture

AI retrieval bots parse your raw HTML. When a large percentage of that HTML is non-content markup, the signal-to-noise ratio drops. The model's ability to extract useful, citable information is directly affected by how much of the page is actually content.

The Bottom Line: Strip unnecessary HTML bloat. If your content-to-HTML ratio is below 0.07, you have a structural problem that suppresses AI citation probability. Run your pages through a markup analyzer and remove non-essential wrapper elements.

Predictor 6: schema attribute completeness

The schema count predictor (OR = 1.21) measures something different from schema presence (Predictor 3). Schema count captures the total number of schema attributes and nested properties on a page. More attributes mean higher completeness.

A page with Product schema that includes name, description, brand, price, availability, rating, review count, SKU, and image has a much higher schema count than a page with Product schema that only includes name and price. The completeness of your schema implementation matters.

Completeness Benchmarks

Schema Type	Minimum Attributes	Target Attributes	Impact
Product	name, price	name, description, brand, price, currency, availability, rating, reviewCount, sku, image	High
Review	reviewBody, rating	reviewBody, rating, author, datePublished, itemReviewed (with nested attributes)	High
FAQPage	3 Q&A pairs	5+ Q&A pairs with detailed answers	Moderate

The mechanism is straightforward: AI platforms that process structured data can extract more useful facts from complete schema than from sparse schema. A product page with full specifications in structured data is a better citation source for comparison queries than one with only a name and price.

The Bottom Line: Do not just add schema. Fill it out completely. Every empty or missing attribute is a missed opportunity for the retrieval system to extract structured facts from your page.

Predictor 7: total link count (OR = 0.47 when external-heavy)

Total link count is the only negative predictor among the 7 significant features. Pages with high total link counts are less likely to be cited, but only when external links dominate.

This finding requires careful decomposition. Internal links (Predictor 1) are strongly positive (r = 0.127, fewer = cited). Total links are negative (OR = 0.47). The math resolves when you look at the link ratio.

The Link Ratio Decomposition

Lee (2026) decomposed total links into internal and external components and found a clear pattern:

Link Profile	Internal Link %	External Link %	Citation Rate
High internal, low external	~70%+	~30% or less	59.7%
Balanced	~50%	~50%	38.2%
Low internal, high external	~30% or less	~70%+	21.4%

Pages with high internal links and low external links have a 59.7% citation rate. Pages with the inverse profile have a 21.4% rate. The difference is nearly 3x.

Why External Links Suppress Citation

The likely mechanism is twofold:

Affiliate signal. Pages with many external links (especially to commercial sites) pattern-match to affiliate content, listicles, and sponsored posts. AI platforms may deprioritize these as lower-quality sources.
Authority leakage signal. A page that links out heavily signals that the authoritative information lives elsewhere. AI retrieval systems may follow those external links to the primary source rather than citing the intermediary.

This does not mean you should never link externally. Citing academic sources, linking to primary data, and referencing official documentation all serve legitimate purposes. The key is proportion: external links should be a minority of your total link profile, with internal navigation links making up the majority.

The Bottom Line: Audit your link ratio. If external links outnumber internal links on key pages, restructure. High internal + low external = 59.7% citation rate. The inverse = 21.4%.

5 features that do not predict AI citation

Equally important as what predicts citation is what does not. Lee (2026) tested numerous additional page features that failed to reach statistical significance after FDR correction. Five of these are commonly targeted by "AI SEO" advice:

Feature	Common Advice	Actual Result	p-value
Page load time	"Faster pages get cited more"	Not significant	> 0.10
Page size (bytes)	"Lighter pages rank better"	Not significant	> 0.10
Pop-up/modal presence	"Remove popups for AI bots"	Not significant	> 0.10
Author attribution	"Add author bios for E-E-A-T"	Not significant	> 0.10
Affiliate link presence	"Remove affiliate links"	Not significant	> 0.10

Why These Do Not Matter

Page load time and page size: AI retrieval bots are not humans. They do not abandon slow-loading pages. ChatGPT's fetcher and Claude's retrieval bot wait for the full page to load, then parse the HTML. A 3-second load time has no effect on whether the bot can extract useful content.

Pop-ups and modals: AI bots parse the HTML source, not the rendered visual experience. A cookie consent popup that annoys human visitors is invisible to a bot reading raw HTML. The popup's HTML exists in the source code but does not block content extraction.

Author attribution: This is the most surprising non-finding for practitioners steeped in Google's E-E-A-T framework. Author bios, bylines, and author schema have zero predictive power for AI citation. AI platforms evaluate content substance, not authorship credentials. The content either answers the query in a structured, extractable way, or it does not.

Affiliate links: While external link dominance suppresses citation (Predictor 7), the mere presence of affiliate links does not. A page can contain affiliate links and still get cited, as long as the overall link profile remains internal-dominant and the content provides genuine informational value.

The Bottom Line: Stop optimizing for page speed, popup removal, or author bios if your goal is AI citation. These are valid optimizations for human UX and traditional SEO, but they have no measurable effect on whether AI platforms cite your content.

The two-level model: intent first, features second

The 7 predictors do not operate in isolation. Lee (2026) demonstrated a two-level hierarchical model where query intent acts as the first filter and page features act as the second selector.

How It Works

Level 1: Query intent determines the eligible content pool. Intent distributions vary significantly by vertical (chi-squared(28) = 5,195, p < .001, Cramer's V = 0.258). A comparison query will never cite a pure informational article, regardless of how many internal links it has. A discovery query will never cite an academic paper, regardless of word count.

Intent Type	Query Share	Typical Cited Sources
Informational	Informational (61.3% of real-world autocomplete queries, though our citation experiments used a balanced 20% per intent design)	Wikipedia, .gov/.edu, tutorials
Discovery	31.2% of autocomplete queries	Review aggregators, YouTube, listicles
Validation	3.2%	Brand sites, Reddit (web UI only)
Comparison	2.3%	Publisher/media, review sites
Review-seeking	2.0%	YouTube, TechRadar/PCMag, Reddit

Level 2: Page features select winners within the pool. Once intent has filtered the pool, the 7 page features determine which pages get cited. The logistic regression achieves AUC = 0.594 at this level.

The critical finding: adding intent features to the page-level model provides zero additional predictive power (likelihood ratio p = .78). This means intent and page features operate independently. Intent decides eligibility. Features decide selection. They do not interact.

This has a direct practical implication: optimizing page features without first matching query intent is wasted effort. You must be in the right pool before the 7 features can help you win.

For a complete breakdown of query intent categories and how to align your content, see our query intent and AI citation research.

Implementation priority table

Based on the odds ratios, here is the recommended optimization order:

Priority	Action	Expected Impact	Effort
1	Audit query intent alignment	Determines pool eligibility	Medium (content strategy)
2	Build internal navigation architecture	r = 0.127, fewer = cited	Medium (site-wide)
3	Add self-referencing canonicals	OR = 1.92	Low (one-time technical)
4	Implement correct schema types with full attributes	non-significant for generic presence + 1.21	Medium (per-page)
5	Expand thin pages to 2,000+ words	Median gap: 723 words	Medium (content)
6	Improve content-to-HTML ratio	0.086 vs 0.065 target	Medium (technical)
7	Rebalance link ratio (reduce external)	59.7% vs 21.4% citation rate	Low to medium

For a hands-on audit of your site against these factors, use our free AI Visibility Quick Check or explore our full AI SEO audit service.

Frequently asked questions

Does page speed affect AI citations at all?

No. Lee (2026) found page load time had no statistically significant relationship with AI citation after FDR correction. AI retrieval bots parse HTML content regardless of load speed. Page speed matters for human UX and traditional Google ranking, but it has zero measurable effect on whether ChatGPT, Perplexity, Claude, or Gemini cite your page.

How many internal links should a page have to maximize citation probability?

The data does not identify a specific threshold. The relationship is log-linear (r = 0.127), meaning each incremental internal link adds a diminishing marginal benefit. Focus on navigation architecture (menus, breadcrumbs, sidebars, footer links) rather than stuffing inline content links. Pages with robust site-wide navigation naturally accumulate the internal link counts that predict citation.

Why does Article schema hurt AI citation odds?

Article schema signals editorial or opinion content. AI platforms preferentially cite factual, structured, extractable content when answering queries. Pages marked as Articles may be deprioritized in favor of pages with Product, Review, or FAQPage schema that provide structured data the model can directly extract and synthesize. See our schema markup guide for alternative schema strategies.

Is the word count effect just a proxy for content quality?

Partially. The median gap (1,799 vs 2,114 words) reflects the fact that comprehensive, citable content requires sufficient depth. But word count interacts with content-to-HTML ratio: a 3,000-word page with heavy boilerplate and low content density performs worse than a 2,500-word page with clean, content-focused HTML. The data supports a sweet spot of roughly 2,000 to 3,000 substantive words, not padding.

Can I have external links and still get cited?

Yes, but proportion matters. The link ratio decomposition shows that pages with 70%+ internal links and 30% or fewer external links achieve a 59.7% citation rate. The inverse ratio (external-dominant) drops to 21.4%. A few well-chosen external links to primary sources or official documentation will not hurt you. Dozens of affiliate or outbound links will.

References

Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
Lee, A. (2026). "Query Intent and Google Rank as Joint Predictors of AI Citation: A Multi-Platform Observational Study." Preprint v6. DOI
Sellm (2025). "ChatGPT Citation Analysis." Industry report (400K+ pages analyzed).
Tian, Y. et al. (2025). "Diagnosing and Repairing Citation Failures in Generative Engine Optimization." Preprint.