← Back to Blog

AI SEO EXPERIMENTS

7 Page Features That Predict AI Citation (And 5 That Dont Matter At All)

2026-03-24

7 Page Features That Predict AI Citation (And 5 That Dont Matter At All)

The 7 page-level features that survived FDR correction explain who wins within a query intent pool. Internal links are the strongest predictor (OR = 2.75), but only when they are navigation links, not content links. Five commonly optimized features, including page speed and author attribution, have zero statistical relationship with AI citation.

Most "AI SEO" advice treats every page element as potentially important. Add author bios. Speed up load time. Remove popups. These recommendations sound reasonable, but none of them are backed by citation-level data.

Lee (2026) tested 19,556 queries across ChatGPT, Perplexity, Claude, and Gemini and identified exactly 7 page-level features that statistically predict whether a page gets cited. Everything else is noise. This post is a deep dive into each predictor: what the data says, why the mechanism works, and how to act on it. We also cover the 5 features that practitioners commonly optimize but that have zero predictive power.

If you want the full research roundup covering all major AI citation studies, see our complete 2026 research summary. For a practical implementation checklist, see the AI SEO audit checklist.

🔢 THE 7 PREDICTORS AT A GLANCE

Before the deep dive, here is the summary table. All values are from Lee (2026), tested across 19,556 queries with Benjamini-Hochberg FDR correction applied.

# Feature Odds Ratio / Effect Beta Direction Why It Matters
1 Internal link count OR = 2.75 0.73 Positive Strongest single predictor; navigation links drive the effect
2 Self-referencing canonical OR = 1.92 -- Positive Signals a clean, authoritative primary page
3 Schema markup presence OR = 1.69 -- Positive Type matters far more than presence alone
4 Word count Median 2,582 vs 1,859 -- Positive Cited pages are ~39% longer
5 Content-to-HTML ratio 0.086 vs 0.065 -- Positive Higher content density signals substance
6 Schema count (attribute completeness) OR = 1.21 -- Positive More attributes per schema = better
7 Total link count OR = 0.47 -- Negative (when external-heavy) External link dominance suppresses citation

The Bottom Line: These 7 features operate as a second-level selector. They only matter after the first filter (query intent) has already determined which content pool is eligible. Intent decides the pool. These features decide the winner within it.

🔗 PREDICTOR 1: INTERNAL LINK COUNT (BETA = 0.73, OR = 2.75)

Internal links are the single strongest page-level predictor of AI citation. Pages with higher internal link counts are 2.75 times more likely to be cited, all else equal.

But the mechanism is not what most people assume.

Navigation Links, Not Content Links

The effect is driven by navigation links (menus, sidebars, footer links, breadcrumbs), not by in-content editorial links. Pages with robust site-wide navigation structures signal to AI retrieval systems that the page is part of a well-organized, authoritative site. The internal link count acts as a proxy for site architecture quality.

This distinction matters for implementation. Adding 50 inline links to other blog posts will not replicate the effect. Building a proper navigation architecture with consistent menus, breadcrumb trails, and related-content sidebars will.

Why Navigation Links Work

AI platforms that use live fetching (ChatGPT, Claude) parse the full HTML of your page. A dense navigation structure tells the retrieval system:

  • This page belongs to a real, maintained website (not a thin affiliate page)
  • There are related resources the model could also reference
  • The site has topical depth around the subject

Aggarwal et al. (2024) found that "citing sources" was among the most effective GEO strategies (up to +40% visibility). Internal navigation links serve a structural version of the same purpose: they demonstrate the page exists within a network of supporting content.

The Bottom Line: Audit your navigation architecture before adding inline content links. Breadcrumbs, category menus, sidebar navigation, and footer site maps all contribute to the internal link count that predicts citation. For a full schema and navigation audit, see our AI SEO audit checklist.

🏷️ PREDICTOR 2: SELF-REFERENCING CANONICAL (OR = 1.92)

Pages with a self-referencing canonical tag are 1.92 times more likely to be cited by AI platforms. This is the second strongest positive predictor.

A self-referencing canonical tag (<link rel="canonical" href="[this page's URL]">) tells crawlers and retrieval systems that this URL is the primary, authoritative version of the content. It signals:

  • No duplicate content issues
  • This is not a syndicated copy
  • This is not a paginated fragment or print version

AI retrieval pipelines, particularly those that crawl and index pages (like Perplexity's PerplexityBot), appear to use the canonical tag as a quality filter. Pages without a canonical, or with a canonical pointing elsewhere, may be deprioritized or skipped entirely.

Implementation

Every page on your site should have a self-referencing canonical tag. This is a one-time technical fix that takes minutes to implement in most CMS platforms.

<link rel="canonical" href="https://yoursite.com/current-page-url" />

If you have paginated content, make sure each page canonicalizes to itself (not to page 1). If you syndicate content, ensure the original always has a self-referencing canonical.

The Bottom Line: This is the easiest win on the list. A missing or incorrect canonical tag costs you 1.92x citation probability. Fix it site-wide.

🧩 PREDICTOR 3: SCHEMA MARKUP (OR = 1.69, BUT TYPE MATTERS MORE)

Pages with schema markup are 1.69 times more likely to be cited. But this headline number conceals the real story: schema type matters far more than schema presence.

Our schema markup deep dive analyzed 3,251 real websites (UGC excluded) and found dramatic variation by schema type:

Schema Type Odds Ratio Direction
Product 3.09 Strong positive
Review 2.24 Strong positive
FAQPage 1.39 Moderate positive
Organization 1.08 Not significant
Breadcrumb 0.99 Not significant
Article 0.76 Negative
Any schema (generic) 1.02 Not significant

The generic "any schema" measure is statistically indistinguishable from zero (OR = 1.02, p = 0.78). The aggregate OR = 1.69 from Lee (2026) reflects the fact that pages with schema tend to have the right kind of schema. But if you add Breadcrumb or Organization schema expecting a citation boost, the data says you will get nothing.

Article schema actually decreases citation probability (OR = 0.76). The likely mechanism: Article schema signals editorial/opinion content, while AI platforms preferentially cite factual, structured, extractable content.

The Bottom Line: Do not treat schema as a checkbox. Implement Product, Review, or FAQPage schema with high attribute completeness. Avoid relying on Article schema for citation visibility. See our complete schema guide for type-by-type implementation instructions.

📏 PREDICTOR 4: WORD COUNT (MEDIAN 2,582 VS 1,859)

Cited pages have a median word count of 2,582. Non-cited pages have a median of 1,859. That is a 39% difference.

This does not mean longer is always better. The word count effect reflects a threshold: pages need enough content to provide a comprehensive, citable answer to a query. Below approximately 1,800 words, the probability of citation drops meaningfully. Above approximately 2,500 words, the marginal benefit flattens.

The Content Length Sweet Spot

Word Count Range Citation Likelihood Interpretation
Under 1,000 Low Too thin to provide comprehensive answers
1,000 to 1,800 Below average May lack depth for informational queries
1,800 to 3,000 Above average Sweet spot for most query types
3,000 to 5,000 Above average Good for comprehensive guides
Over 5,000 Diminishing returns Length alone does not help; structure matters more

The important nuance: word count interacts with content-to-HTML ratio (Predictor 5). A 3,000-word page with a low content-to-HTML ratio (heavy boilerplate, ads, sidebar widgets) performs worse than a 2,500-word page with a high ratio. Raw length is a necessary but not sufficient condition.

Aggarwal et al. (2024) found that "adding statistics" and "citing sources" were among the most effective GEO strategies. These strategies naturally increase word count while also increasing information density, a combination that moves both Predictor 4 and Predictor 5 in the right direction.

The Bottom Line: Aim for 2,000 to 3,000 words of substantive content for pages targeting informational or discovery queries. Do not pad. Every word should add information density.

⚖️ PREDICTOR 5: CONTENT-TO-HTML RATIO (0.086 VS 0.065)

Cited pages have a mean content-to-HTML ratio of 0.086. Non-cited pages average 0.065. This 32% difference is statistically significant after FDR correction.

Content-to-HTML ratio measures the proportion of visible text content relative to total HTML markup. A higher ratio means more substance and less boilerplate. A lower ratio means the page is dominated by navigation markup, ad code, tracking scripts, widget HTML, and other non-content elements.

What Drives Low Ratios

  • Heavy ad placement (especially mid-content ad blocks)
  • Complex JavaScript frameworks that generate excessive wrapper HTML
  • Sidebar widgets, pop-up modals, and notification banners
  • Cookie consent overlays with large embedded HTML
  • Excessive inline styling and redundant div nesting

What Drives High Ratios

  • Clean, semantic HTML
  • Minimal non-content markup
  • Server-side rendered content (vs. client-side hydration)
  • Efficient CSS (external stylesheets, not inline styles)
  • Content-first page architecture

AI retrieval bots parse your raw HTML. When a large percentage of that HTML is non-content markup, the signal-to-noise ratio drops. The model's ability to extract useful, citable information is directly affected by how much of the page is actually content.

The Bottom Line: Strip unnecessary HTML bloat. If your content-to-HTML ratio is below 0.07, you have a structural problem that suppresses AI citation probability. Run your pages through a markup analyzer and remove non-essential wrapper elements.

📊 PREDICTOR 6: SCHEMA COUNT (ATTRIBUTE COMPLETENESS)

The schema count predictor (OR = 1.21) measures something different from schema presence (Predictor 3). Schema count captures the total number of schema attributes and nested properties on a page. More attributes mean higher completeness.

A page with Product schema that includes name, description, brand, price, availability, rating, review count, SKU, and image has a much higher schema count than a page with Product schema that only includes name and price. The completeness of your schema implementation matters.

Completeness Benchmarks

Schema Type Minimum Attributes Target Attributes Impact
Product name, price name, description, brand, price, currency, availability, rating, reviewCount, sku, image High
Review reviewBody, rating reviewBody, rating, author, datePublished, itemReviewed (with nested attributes) High
FAQPage 3 Q&A pairs 5+ Q&A pairs with detailed answers Moderate

The mechanism is straightforward: AI platforms that process structured data can extract more useful facts from complete schema than from sparse schema. A product page with full specifications in structured data is a better citation source for comparison queries than one with only a name and price.

The Bottom Line: Do not just add schema. Fill it out completely. Every empty or missing attribute is a missed opportunity for the retrieval system to extract structured facts from your page.

🔗 PREDICTOR 7: TOTAL LINK COUNT (OR = 0.47, NEGATIVE WHEN EXTERNAL-HEAVY)

Total link count is the only negative predictor among the 7 significant features. Pages with high total link counts are less likely to be cited, but only when external links dominate.

This finding requires careful decomposition. Internal links (Predictor 1) are strongly positive (OR = 2.75). Total links are negative (OR = 0.47). The math resolves when you look at the link ratio.

The Link Ratio Decomposition

Lee (2026) decomposed total links into internal and external components and found a clear pattern:

Link Profile Internal Link % External Link % Citation Rate
High internal, low external ~70%+ ~30% or less 59.7%
Balanced ~50% ~50% 38.2%
Low internal, high external ~30% or less ~70%+ 21.4%

Pages with high internal links and low external links have a 59.7% citation rate. Pages with the inverse profile have a 21.4% rate. The difference is nearly 3x.

Why External Links Suppress Citation

The likely mechanism is twofold:

  1. Affiliate signal. Pages with many external links (especially to commercial sites) pattern-match to affiliate content, listicles, and sponsored posts. AI platforms may deprioritize these as lower-quality sources.
  2. Authority leakage signal. A page that links out heavily signals that the authoritative information lives elsewhere. AI retrieval systems may follow those external links to the primary source rather than citing the intermediary.

This does not mean you should never link externally. Citing academic sources, linking to primary data, and referencing official documentation all serve legitimate purposes. The key is proportion: external links should be a minority of your total link profile, with internal navigation links making up the majority.

The Bottom Line: Audit your link ratio. If external links outnumber internal links on key pages, restructure. High internal + low external = 59.7% citation rate. The inverse = 21.4%.

🚫 5 FEATURES THAT DO NOT PREDICT AI CITATION

Equally important as what predicts citation is what does not. Lee (2026) tested numerous additional page features that failed to reach statistical significance after FDR correction. Five of these are commonly targeted by "AI SEO" advice:

Feature Common Advice Actual Result p-value
Page load time "Faster pages get cited more" Not significant > 0.10
Page size (bytes) "Lighter pages rank better" Not significant > 0.10
Pop-up/modal presence "Remove popups for AI bots" Not significant > 0.10
Author attribution "Add author bios for E-E-A-T" Not significant > 0.10
Affiliate link presence "Remove affiliate links" Not significant > 0.10

Why These Do Not Matter

Page load time and page size: AI retrieval bots are not humans. They do not abandon slow-loading pages. ChatGPT's fetcher and Claude's retrieval bot wait for the full page to load, then parse the HTML. A 3-second load time has no effect on whether the bot can extract useful content.

Pop-ups and modals: AI bots parse the HTML source, not the rendered visual experience. A cookie consent popup that annoys human visitors is invisible to a bot reading raw HTML. The popup's HTML exists in the source code but does not block content extraction.

Author attribution: This is the most surprising non-finding for practitioners steeped in Google's E-E-A-T framework. Author bios, bylines, and author schema have zero predictive power for AI citation. AI platforms evaluate content substance, not authorship credentials. The content either answers the query in a structured, extractable way, or it does not.

Affiliate links: While external link dominance suppresses citation (Predictor 7), the mere presence of affiliate links does not. A page can contain affiliate links and still get cited, as long as the overall link profile remains internal-dominant and the content provides genuine informational value.

The Bottom Line: Stop optimizing for page speed, popup removal, or author bios if your goal is AI citation. These are valid optimizations for human UX and traditional SEO, but they have no measurable effect on whether AI platforms cite your content.

🏗️ THE TWO-LEVEL MODEL: INTENT FIRST, FEATURES SECOND

The 7 predictors do not operate in isolation. Lee (2026) demonstrated a two-level hierarchical model where query intent acts as the first filter and page features act as the second selector.

How It Works

Level 1: Query intent determines the eligible content pool. Intent distributions vary significantly by vertical (chi-squared(28) = 5,195, p < .001, Cramer's V = 0.258). A comparison query will never cite a pure informational article, regardless of how many internal links it has. A discovery query will never cite an academic paper, regardless of word count.

Intent Type Query Share Typical Cited Sources
Informational 61.3% Wikipedia, .gov/.edu, tutorials
Discovery 31.2% Review aggregators, YouTube, listicles
Validation 3.2% Brand sites, Reddit (web UI only)
Comparison 2.3% Publisher/media, review sites
Review-seeking 2.0% YouTube, TechRadar/PCMag, Reddit

Level 2: Page features select winners within the pool. Once intent has filtered the pool, the 7 page features determine which pages get cited. The logistic regression achieves AUC = 0.594 at this level.

The critical finding: adding intent features to the page-level model provides zero additional predictive power (likelihood ratio p = .78). This means intent and page features operate independently. Intent decides eligibility. Features decide selection. They do not interact.

This has a direct practical implication: optimizing page features without first matching query intent is wasted effort. You must be in the right pool before the 7 features can help you win.

For a complete breakdown of query intent categories and how to align your content, see our query intent and AI citation research.

📋 IMPLEMENTATION PRIORITY TABLE

Based on the odds ratios, here is the recommended optimization order:

Priority Action Expected Impact Effort
1 Audit query intent alignment Determines pool eligibility Medium (content strategy)
2 Build internal navigation architecture OR = 2.75 Medium (site-wide)
3 Add self-referencing canonicals OR = 1.92 Low (one-time technical)
4 Implement correct schema types with full attributes OR = 1.69 + 1.21 Medium (per-page)
5 Expand thin pages to 2,000+ words Median gap: 723 words Medium (content)
6 Improve content-to-HTML ratio 0.086 vs 0.065 target Medium (technical)
7 Rebalance link ratio (reduce external) 59.7% vs 21.4% citation rate Low to medium

For a hands-on audit of your site against these factors, use our free AI Visibility Quick Check or explore our full AI SEO audit service.

❓ FREQUENTLY ASKED QUESTIONS

Does page speed affect AI citations at all?

No. Lee (2026) found page load time had no statistically significant relationship with AI citation after FDR correction. AI retrieval bots parse HTML content regardless of load speed. Page speed matters for human UX and traditional Google ranking, but it has zero measurable effect on whether ChatGPT, Perplexity, Claude, or Gemini cite your page.

How many internal links should a page have to maximize citation probability?

The data does not identify a specific threshold. The relationship is log-linear (beta = 0.73), meaning each incremental internal link adds a diminishing marginal benefit. Focus on navigation architecture (menus, breadcrumbs, sidebars, footer links) rather than stuffing inline content links. Pages with robust site-wide navigation naturally accumulate the internal link counts that predict citation.

Why does Article schema hurt AI citation odds?

Article schema (OR = 0.76) signals editorial or opinion content. AI platforms preferentially cite factual, structured, extractable content when answering queries. Pages marked as Articles may be deprioritized in favor of pages with Product, Review, or FAQPage schema that provide structured data the model can directly extract and synthesize. See our schema markup guide for alternative schema strategies.

Is the word count effect just a proxy for content quality?

Partially. The median gap (2,582 vs 1,859 words) reflects the fact that comprehensive, citable content requires sufficient depth. But word count interacts with content-to-HTML ratio: a 3,000-word page with heavy boilerplate and low content density performs worse than a 2,500-word page with clean, content-focused HTML. The data supports a sweet spot of roughly 2,000 to 3,000 substantive words, not padding.

Can I have external links and still get cited?

Yes, but proportion matters. The link ratio decomposition shows that pages with 70%+ internal links and 30% or fewer external links achieve a 59.7% citation rate. The inverse ratio (external-dominant) drops to 21.4%. A few well-chosen external links to primary sources or official documentation will not hurt you. Dozens of affiliate or outbound links will.

📚 REFERENCES

  • Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI
  • Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
  • Sellm (2025). "ChatGPT Citation Analysis." Industry report (400K+ pages analyzed).
  • Tian, Y. et al. (2025). "Diagnosing and Repairing Citation Failures in Generative Engine Optimization." Preprint.