Your site can rank #1 on Google and still be invisible to ChatGPT, Perplexity, and Google AI Mode. This checklist tells you exactly what to fix, in priority order, backed by data from 4,658 real pages.
Most "AI SEO" guides recycle generic advice: add schema, write longer content, optimize for "conversational queries." That advice is incomplete. It misses the specific technical signals that actually predict whether AI platforms cite your pages.
This checklist is built from two published research datasets. Lee (2026) analyzed 19,556 queries across 8 verticals and crawled 479 pages to identify 7 statistically significant page-level predictors of AI citation. Aggarwal et al. (2024) demonstrated that targeted optimization can boost generative engine visibility by up to 40%.
The Bottom Line: AI search visibility is not a mystery. It is a measurable set of technical and content features. This checklist gives you 47 specific pass/fail checks across 7 categories, ordered by impact.
For a full walkthrough of the research behind this checklist, see our Complete GEO Guide. If you want us to run this audit for you, check out our AI SEO Audit Service.
🤖 CRAWL ACCESS AND DISCOVERABILITY
Before AI platforms can cite your content, their crawlers need to reach it. There are now 15+ distinct AI crawlers operating on the web, each with different user-agent strings and crawl behaviors. Blocking even one of them means losing visibility on that platform entirely.
Here is every crawl access check you should run:
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 1 | GPTBot allowed in robots.txt | No Disallow rule for GPTBot |
Critical |
| 2 | ChatGPT-User allowed in robots.txt | No Disallow rule for ChatGPT-User |
Critical |
| 3 | ClaudeBot allowed in robots.txt | No Disallow rule for ClaudeBot (session-cached) |
Critical |
| 4 | PerplexityBot allowed in robots.txt | No Disallow rule for PerplexityBot |
Critical |
| 5 | Google-Extended allowed in robots.txt | No Disallow rule for Google-Extended (AI Mode, Gemini) |
Critical |
| 6 | Bytespider allowed in robots.txt | No Disallow for Bytespider (ByteDance AI) |
Medium |
| 7 | Applebot-Extended allowed in robots.txt | No Disallow for Applebot-Extended (Apple Intelligence) |
Medium |
| 8 | Meta-ExternalAgent allowed | No Disallow for Meta-ExternalAgent (Meta AI) |
Medium |
| 9 | Amazonbot allowed | No Disallow for Amazonbot (Alexa, Amazon Q) |
Low |
| 10 | cohere-ai allowed | No Disallow for cohere-ai (Cohere models) |
Low |
| 11 | XML sitemap referenced in robots.txt | Sitemap: directive points to valid XML sitemap |
Critical |
| 12 | XML sitemap contains all indexable pages | All target pages listed with <loc> tags |
High |
| 13 | Sitemap lastmod tags are accurate | <lastmod> reflects actual content modification date |
High |
| 14 | Server-side rendering for key pages | AI crawlers receive full HTML content (not empty JS shells) | Critical |
| 15 | Pages return 200 status to AI bots | No soft 404s, no redirect chains, no bot-specific blocks | High |
The Bottom Line: The minimum viable crawl configuration allows GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Google-Extended. Blocking any of these five means you are invisible on the corresponding platform.
For a detailed guide on identifying which AI bots are already hitting your site, see our post on how to see which AI bots crawl your website. We also cover advanced bot tracking in tracking AI bots effectively.
📊 TECHNICAL PAGE FEATURES (THE 7 PREDICTORS)
Our research identified exactly 7 page-level features that reach statistical significance (Benjamini-Hochberg FDR-corrected at alpha = .05) for predicting AI citation. These are not opinions. They are the output of logistic regression on 479 crawled pages (241 cited, 238 not cited) from real AI platform responses (Lee, 2026).
The model using these 7 features achieved AUC = 0.594, significantly above chance. Here is each feature translated into a concrete audit check:
| # | Check Item | Pass Criteria | Odds Ratio | Priority |
|---|---|---|---|---|
| 16 | Internal navigation links >= 100 | Site header/footer/sidebar contain 100+ internal links | OR = 2.75 | Critical |
| 17 | Self-referencing canonical tag | <link rel="canonical"> points to the page's own URL |
OR = 1.92 | Critical |
| 18 | Schema markup present | At least one structured data block (JSON-LD or microdata) | OR = 1.69 | High |
| 19 | Word count >= 2,500 | Main content body contains 2,500+ words | Effect r = 0.194 | High |
| 20 | Content-to-HTML ratio >= 0.08 | Visible text / total HTML source > 0.08 | Effect r = 0.132 | Medium |
| 21 | Schema attribute completeness >= 76% | Average fill rate across schema properties >= 76% | See schema section | High |
| 22 | External link ratio is low | External links < internal links (avoid affiliate-heavy patterns) | OR = 0.47 (neg.) | High |
The internal links finding is the most important and the most counterintuitive. The OR of 2.75 means pages with high internal navigation link counts have nearly 3x the odds of being cited. But this signal comes from navigation links (p = 0.017), not in-content links (p = 0.497). The signal is about site architecture breadth: sites with robust navigation structures look authoritative to AI crawlers.
The link ratio decomposition makes this clear:
| Link Profile | Citation Rate |
|---|---|
| High internal + Low external | 59.7% |
| High internal + High external | 52.1% |
| Low internal + Low external | 45.6% |
| Low internal + High external | 42.5% |
Pages with many external links and few internal links resemble affiliate or aggregator content. AI platforms appear to discount these.
The Bottom Line: Fix your canonical tags (OR = 1.92, nearly 2x citation odds), add schema with high attribute completeness (OR = 1.69), and ensure your site architecture includes comprehensive internal navigation. These three changes produce the highest return per engineering hour.
Run your pages through our AI Visibility Quick Check tool to test these features instantly.
🏷️ SCHEMA AUDIT (TYPE MATTERS MORE THAN PRESENCE)
Generic "add schema markup" advice misses a critical finding: schema type predicts citation probability, but generic schema presence does not. Our expanded analysis of 3,251 real websites (UGC excluded) found that adding the wrong schema type can actually hurt your citation odds.
| Schema Type | Odds Ratio | Effect | Action |
|---|---|---|---|
| Product | 3.09 | Strong positive | Add to product/service pages |
| Review | 2.24 | Strong positive | Add to review and comparison content |
| FAQPage | 1.39 | Moderate positive | Add to pages with Q&A sections |
| Article | 0.76 | Negative | Consider replacing with more specific types |
| Organization | 1.08 (p = 0.35) | Not significant | Keep but do not rely on it |
| Breadcrumb | 0.99 (p = 0.97) | Not significant | Keep for usability, no AI citation impact |
| Any schema (generic) | 1.02 (p = 0.78) | Not significant | Type matters, not mere presence |
Article markup (OR = 0.76) signals opinion or editorial content, which AI platforms may deprioritize for factual citations. Here are your schema audit checks:
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 23 | Product schema on product/service pages | Product type in JSON-LD with price, name, description filled |
Critical |
| 24 | FAQPage schema on FAQ sections | FAQPage type wrapping all Q&A pairs |
High |
| 25 | Review schema on review content | Review type with rating, author, itemReviewed |
High |
| 26 | Article schema evaluated for replacement | If only using Article, consider Product, FAQ, or HowTo |
Medium |
| 27 | Schema attribute completeness >= 76% | Average fill rate across all schema properties >= 76% | High |
| 28 | No empty or stub schema blocks | Every schema block has meaningful data, not placeholder text | Medium |
| 29 | Schema validates in Google Rich Results Test | Zero errors in structured data validation | Medium |
The Bottom Line: Product (OR = 3.09) and Review (OR = 2.24) schemas are the highest-impact structured data you can add. If your pages only have Article schema, consider whether a more specific type applies. And regardless of type, filling at least 76% of available attributes matters more than adding more schema blocks.
📝 CONTENT STRUCTURE AND FORMAT
Aggarwal et al. (2024) found that GEO strategies including adding citations, quotations, and statistics to content can boost visibility by up to 40% in generative engine responses. Our own data confirms that content structure influences citation, particularly through word count (cited pages have a median of 2,582 words vs. 1,859 for non-cited) and content-to-HTML ratio.
Beyond the raw metrics, certain content formats align with how AI platforms extract and synthesize information:
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 30 | Comparison tables present | At least one HTML <table> comparing options, features, or products |
High |
| 31 | FAQ section with clear Q&A pairs | Visible FAQ section with distinct questions and answers | High |
| 32 | Key insight in first 30% of content | Primary conclusion or data point appears in first third | High |
| 33 | Statistics and specific numbers used | At least 3 specific data points (percentages, counts, measurements) | Medium |
| 34 | Semantic HTML structure | Proper H2/H3 hierarchy, <table>, <ul>/<ol> tags |
Medium |
| 35 | Content-to-HTML ratio >= 0.08 | Minimize wrapper divs, ads, and boilerplate; maximize content | Medium |
| 36 | Word count >= 2,500 for target pages | Main body content (excluding nav, footer) exceeds 2,500 words | Medium |
AI platforms parse content structurally. Perplexity extracts comparison tables directly. ChatGPT matches FAQ sections to user queries. Front-loading key findings helps because AI models often prioritize content that appears early in a document.
The Bottom Line: Structure your content for machine parsing: comparison tables, FAQ blocks, specific numbers, and the key answer in the first 30%.
For a deeper look at how different AI platforms handle content, see our ChatGPT vs Perplexity vs Gemini comparison.
🕐 FRESHNESS SIGNALS
Perplexity shows a 3.3x freshness bias compared to Google for medium-velocity topics. ChatGPT and Claude fetch pages live, meaning they always get current content. But they discover pages through indices (Bing for ChatGPT), where freshness signals affect crawl priority.
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 37 | dateModified in schema markup | dateModified reflects actual last-edit date, not auto-generated |
Critical |
| 38 | datePublished in schema markup | datePublished present with accurate publication date |
High |
| 39 | Visible "Last updated" date on page | Human-readable date on the page matching schema dates | High |
| 40 | Sitemap lastmod matches schema dateModified | Sitemap <lastmod> and schema dateModified align |
Medium |
| 41 | Medium-velocity content refreshed every 60 to 90 days | Pages targeting evolving topics show recent modification dates | Medium |
Freshness matters most for Perplexity (pre-crawled index, heavy recency bias). For ChatGPT, freshness signals affect how quickly Bing re-crawls your pages, which controls ChatGPT's update discovery.
The Bottom Line: Keep your dateModified schema, visible dates, and sitemap lastmod in sync. For Perplexity visibility, refresh content on evolving topics every 60 to 90 days. Stale dates signal stale content.
🔍 PLATFORM-SPECIFIC CHECKS
Each AI platform uses a different architecture for discovering and citing content. A one-size-fits-all approach leaves visibility gaps. Lee (2026) found that architectural differences (live fetching vs. pre-built indices) create divergent optimization needs.
| Platform | Architecture | Discovery Method | Key Optimization |
|---|---|---|---|
| ChatGPT | Live fetching | Bing index for URL discovery, ChatGPT-User for page fetch | Bing indexing is the gate |
| Claude | Live fetching | Claude-User checks robots.txt, fetches on demand | robots.txt compliance required |
| Perplexity | Pre-built index | PerplexityBot background crawling | Freshness signals, crawl frequency |
| Google AI Mode | Google infrastructure | Standard Googlebot crawling | Traditional Google SEO + structured data |
| Gemini | Google infrastructure | No identified AI-specific crawlers | Google Search grounding |
Here are the platform-specific audit checks:
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 42 | Bing Webmaster Tools: pages indexed | Target pages appear in Bing's index (required for ChatGPT) | Critical |
| 43 | Bing IndexNow configured | IndexNow protocol set up for faster Bing indexing | Medium |
| 44 | Perplexity freshness: recent crawl dates | PerplexityBot has crawled key pages within 30 days | High |
| 45 | Google Search Console: pages indexed | Target pages indexed by Google (required for AI Mode, Gemini) | Critical |
| 46 | Server-side rendering verified for ChatGPT-User | ChatGPT-User receives full HTML (test with curl -A "ChatGPT-User") |
High |
The Bing indexing check (item 42) is one of the most commonly missed steps. ChatGPT discovers URLs through Bing's index. If Bing has not indexed a page, ChatGPT cannot find it, regardless of content quality.
Lee (2026) also found a "Reddit paradox": despite Reddit occupying 38.3% of Google's top results, it received zero API citations from ChatGPT but 8.9% to 15.6% of web UI citations. Platform-dependent behavior means you must test on each platform individually.
The Bottom Line: At minimum, verify your pages are indexed in both Bing (for ChatGPT) and Google (for AI Mode/Gemini). Set up IndexNow for faster Bing crawling. Test your server-side rendering with each AI bot's user-agent string.
For platform-specific citation behavior details, see our research on query intent and AI citations.
📡 MONITORING AND TRACKING SETUP
You cannot improve what you do not measure. AI search visibility tracking is fundamentally different from traditional rank tracking. Instead of monitoring keyword positions on one search engine, you need to track citation appearances across multiple AI platforms simultaneously.
| # | Check Item | Pass Criteria | Priority |
|---|---|---|---|
| 47 | AI bot log monitoring active | Server logs parsed for GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot | Critical |
Beyond this technical check, your monitoring stack should cover two layers:
AI Crawler Tracking. Parse server access logs for all 15+ AI bot user-agent strings. Track crawl frequency per page, and flag drops in crawl activity (a sudden PerplexityBot silence signals robots.txt or server issues).
Citation Monitoring. Run target queries through ChatGPT, Perplexity, Claude, and Google AI Mode weekly. Record which URLs appear in citations, and track changes over time. Crawler activity is a leading indicator; actual citations are the lagging indicator. Track both.
The Bottom Line: Start with server log monitoring for AI bots. Layer on citation tracking for your highest-priority queries. Our AI Visibility Monitoring service automates both.
✅ THE COMPLETE AUDIT SUMMARY
Critical (fix first): Allow 5 essential AI crawlers (checks 1 through 5), XML sitemap in robots.txt (11), server-side rendering (14), internal navigation links >= 100 (16), self-referencing canonical (17), Product schema (23), dateModified (37), Bing + Google indexing verified (42, 45), AI bot log monitoring (47).
High (significant impact): Sitemap completeness and lastmod accuracy (12, 13), 200 status returns (15), remaining page-level predictors (18 through 22), FAQ and Review schema (24, 25), content structure (30 through 32), freshness signals (38, 39), Perplexity crawl freshness (44), SSR verification (46).
Medium (incremental gains): Secondary AI crawlers (6 through 10), schema refinement (26 through 29), content format polish (33 through 36), freshness sync (40, 41), IndexNow (43).
If you want a professional audit against all 47 checks, our AI SEO Audit Service covers every item with a detailed remediation plan. For a quick self-assessment, run your top 5 pages through the AI Visibility Quick Check.
❓ FREQUENTLY ASKED QUESTIONS
How long does it take to see results after fixing audit items? It depends on the platform. Google AI Mode propagates within days. ChatGPT depends on Bing reindexing (1 to 4 weeks, faster with IndexNow). Perplexity updates within days for frequently crawled pages. Claude fetches pages live, so changes take effect on the next query. Overall, expect 2 to 6 weeks for full propagation across all platforms.
Should I block AI crawlers to prevent content scraping? That depends on your business model. Blocking AI crawlers prevents training use but also prevents citation in AI responses. For most businesses, the visibility benefit outweighs the concern. If you need selective control, allow ChatGPT-User (citation) while blocking GPTBot (training). See our guide on tracking AI bots effectively for the distinction between training and citation bots.
Does page speed matter for AI citation? Our research found no statistically significant effect of load time on AI citation probability. This contradicts popular advice. While page speed matters for user experience and traditional Google ranking, AI crawlers are patient. They will wait for your page to load. Focus your optimization efforts on the 7 features that actually predict citation (internal links, canonical, schema, word count, content-to-HTML ratio) rather than shaving milliseconds off load time.
What about author bios and E-E-A-T signals? Author attribution showed no significant effect on AI citation in our dataset (p = .522). E-E-A-T may matter for Google ranking, but it does not appear to influence AI citation decisions. AI platforms parse content structure, not bylines.
How is this different from a traditional SEO audit? A traditional SEO audit focuses on factors like backlinks, Core Web Vitals, keyword density, and mobile-friendliness. An AI search visibility audit focuses on crawl access for 15+ AI bots, schema type selection, content structure for machine parsing, freshness signals, and platform-specific indexing requirements. The two audits overlap on technical basics (canonical tags, sitemaps, server rendering) but diverge significantly on what counts as a ranking factor. Google rank has essentially zero correlation with AI citation (rho = -0.02 to 0.11) across platforms (Lee, 2026).