← Back to Blog

AI TOOLS

The AI SEO Audit Checklist: 47 Checks to Maximize Your AI Search Visibility

2026-03-24

The AI SEO Audit Checklist: 47 Checks to Maximize Your AI Search Visibility

Your site can rank #1 on Google and still be invisible to ChatGPT, Perplexity, and Google AI Mode. This checklist tells you exactly what to fix, in priority order, backed by data from 4,658 real pages.

Most "AI SEO" guides recycle generic advice: add schema, write longer content, optimize for "conversational queries." That advice is incomplete. It misses the specific technical signals that actually predict whether AI platforms cite your pages.

This checklist is built from two published research datasets. Lee (2026) analyzed 19,556 queries across 8 verticals and crawled 479 pages to identify 7 statistically significant page-level predictors of AI citation. Aggarwal et al. (2024) demonstrated that targeted optimization can boost generative engine visibility by up to 40%.

The Bottom Line: AI search visibility is not a mystery. It is a measurable set of technical and content features. This checklist gives you 47 specific pass/fail checks across 7 categories, ordered by impact.

For a full walkthrough of the research behind this checklist, see our Complete GEO Guide. If you want us to run this audit for you, check out our AI SEO Audit Service.

🤖 CRAWL ACCESS AND DISCOVERABILITY

Before AI platforms can cite your content, their crawlers need to reach it. There are now 15+ distinct AI crawlers operating on the web, each with different user-agent strings and crawl behaviors. Blocking even one of them means losing visibility on that platform entirely.

Here is every crawl access check you should run:

# Check Item Pass Criteria Priority
1 GPTBot allowed in robots.txt No Disallow rule for GPTBot Critical
2 ChatGPT-User allowed in robots.txt No Disallow rule for ChatGPT-User Critical
3 ClaudeBot allowed in robots.txt No Disallow rule for ClaudeBot (session-cached) Critical
4 PerplexityBot allowed in robots.txt No Disallow rule for PerplexityBot Critical
5 Google-Extended allowed in robots.txt No Disallow rule for Google-Extended (AI Mode, Gemini) Critical
6 Bytespider allowed in robots.txt No Disallow for Bytespider (ByteDance AI) Medium
7 Applebot-Extended allowed in robots.txt No Disallow for Applebot-Extended (Apple Intelligence) Medium
8 Meta-ExternalAgent allowed No Disallow for Meta-ExternalAgent (Meta AI) Medium
9 Amazonbot allowed No Disallow for Amazonbot (Alexa, Amazon Q) Low
10 cohere-ai allowed No Disallow for cohere-ai (Cohere models) Low
11 XML sitemap referenced in robots.txt Sitemap: directive points to valid XML sitemap Critical
12 XML sitemap contains all indexable pages All target pages listed with <loc> tags High
13 Sitemap lastmod tags are accurate <lastmod> reflects actual content modification date High
14 Server-side rendering for key pages AI crawlers receive full HTML content (not empty JS shells) Critical
15 Pages return 200 status to AI bots No soft 404s, no redirect chains, no bot-specific blocks High

The Bottom Line: The minimum viable crawl configuration allows GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, and Google-Extended. Blocking any of these five means you are invisible on the corresponding platform.

For a detailed guide on identifying which AI bots are already hitting your site, see our post on how to see which AI bots crawl your website. We also cover advanced bot tracking in tracking AI bots effectively.

📊 TECHNICAL PAGE FEATURES (THE 7 PREDICTORS)

Our research identified exactly 7 page-level features that reach statistical significance (Benjamini-Hochberg FDR-corrected at alpha = .05) for predicting AI citation. These are not opinions. They are the output of logistic regression on 479 crawled pages (241 cited, 238 not cited) from real AI platform responses (Lee, 2026).

The model using these 7 features achieved AUC = 0.594, significantly above chance. Here is each feature translated into a concrete audit check:

# Check Item Pass Criteria Odds Ratio Priority
16 Internal navigation links >= 100 Site header/footer/sidebar contain 100+ internal links OR = 2.75 Critical
17 Self-referencing canonical tag <link rel="canonical"> points to the page's own URL OR = 1.92 Critical
18 Schema markup present At least one structured data block (JSON-LD or microdata) OR = 1.69 High
19 Word count >= 2,500 Main content body contains 2,500+ words Effect r = 0.194 High
20 Content-to-HTML ratio >= 0.08 Visible text / total HTML source > 0.08 Effect r = 0.132 Medium
21 Schema attribute completeness >= 76% Average fill rate across schema properties >= 76% See schema section High
22 External link ratio is low External links < internal links (avoid affiliate-heavy patterns) OR = 0.47 (neg.) High

The internal links finding is the most important and the most counterintuitive. The OR of 2.75 means pages with high internal navigation link counts have nearly 3x the odds of being cited. But this signal comes from navigation links (p = 0.017), not in-content links (p = 0.497). The signal is about site architecture breadth: sites with robust navigation structures look authoritative to AI crawlers.

The link ratio decomposition makes this clear:

Link Profile Citation Rate
High internal + Low external 59.7%
High internal + High external 52.1%
Low internal + Low external 45.6%
Low internal + High external 42.5%

Pages with many external links and few internal links resemble affiliate or aggregator content. AI platforms appear to discount these.

The Bottom Line: Fix your canonical tags (OR = 1.92, nearly 2x citation odds), add schema with high attribute completeness (OR = 1.69), and ensure your site architecture includes comprehensive internal navigation. These three changes produce the highest return per engineering hour.

Run your pages through our AI Visibility Quick Check tool to test these features instantly.

🏷️ SCHEMA AUDIT (TYPE MATTERS MORE THAN PRESENCE)

Generic "add schema markup" advice misses a critical finding: schema type predicts citation probability, but generic schema presence does not. Our expanded analysis of 3,251 real websites (UGC excluded) found that adding the wrong schema type can actually hurt your citation odds.

Schema Type Odds Ratio Effect Action
Product 3.09 Strong positive Add to product/service pages
Review 2.24 Strong positive Add to review and comparison content
FAQPage 1.39 Moderate positive Add to pages with Q&A sections
Article 0.76 Negative Consider replacing with more specific types
Organization 1.08 (p = 0.35) Not significant Keep but do not rely on it
Breadcrumb 0.99 (p = 0.97) Not significant Keep for usability, no AI citation impact
Any schema (generic) 1.02 (p = 0.78) Not significant Type matters, not mere presence

Article markup (OR = 0.76) signals opinion or editorial content, which AI platforms may deprioritize for factual citations. Here are your schema audit checks:

# Check Item Pass Criteria Priority
23 Product schema on product/service pages Product type in JSON-LD with price, name, description filled Critical
24 FAQPage schema on FAQ sections FAQPage type wrapping all Q&A pairs High
25 Review schema on review content Review type with rating, author, itemReviewed High
26 Article schema evaluated for replacement If only using Article, consider Product, FAQ, or HowTo Medium
27 Schema attribute completeness >= 76% Average fill rate across all schema properties >= 76% High
28 No empty or stub schema blocks Every schema block has meaningful data, not placeholder text Medium
29 Schema validates in Google Rich Results Test Zero errors in structured data validation Medium

The Bottom Line: Product (OR = 3.09) and Review (OR = 2.24) schemas are the highest-impact structured data you can add. If your pages only have Article schema, consider whether a more specific type applies. And regardless of type, filling at least 76% of available attributes matters more than adding more schema blocks.

📝 CONTENT STRUCTURE AND FORMAT

Aggarwal et al. (2024) found that GEO strategies including adding citations, quotations, and statistics to content can boost visibility by up to 40% in generative engine responses. Our own data confirms that content structure influences citation, particularly through word count (cited pages have a median of 2,582 words vs. 1,859 for non-cited) and content-to-HTML ratio.

Beyond the raw metrics, certain content formats align with how AI platforms extract and synthesize information:

# Check Item Pass Criteria Priority
30 Comparison tables present At least one HTML <table> comparing options, features, or products High
31 FAQ section with clear Q&A pairs Visible FAQ section with distinct questions and answers High
32 Key insight in first 30% of content Primary conclusion or data point appears in first third High
33 Statistics and specific numbers used At least 3 specific data points (percentages, counts, measurements) Medium
34 Semantic HTML structure Proper H2/H3 hierarchy, <table>, <ul>/<ol> tags Medium
35 Content-to-HTML ratio >= 0.08 Minimize wrapper divs, ads, and boilerplate; maximize content Medium
36 Word count >= 2,500 for target pages Main body content (excluding nav, footer) exceeds 2,500 words Medium

AI platforms parse content structurally. Perplexity extracts comparison tables directly. ChatGPT matches FAQ sections to user queries. Front-loading key findings helps because AI models often prioritize content that appears early in a document.

The Bottom Line: Structure your content for machine parsing: comparison tables, FAQ blocks, specific numbers, and the key answer in the first 30%.

For a deeper look at how different AI platforms handle content, see our ChatGPT vs Perplexity vs Gemini comparison.

🕐 FRESHNESS SIGNALS

Perplexity shows a 3.3x freshness bias compared to Google for medium-velocity topics. ChatGPT and Claude fetch pages live, meaning they always get current content. But they discover pages through indices (Bing for ChatGPT), where freshness signals affect crawl priority.

# Check Item Pass Criteria Priority
37 dateModified in schema markup dateModified reflects actual last-edit date, not auto-generated Critical
38 datePublished in schema markup datePublished present with accurate publication date High
39 Visible "Last updated" date on page Human-readable date on the page matching schema dates High
40 Sitemap lastmod matches schema dateModified Sitemap <lastmod> and schema dateModified align Medium
41 Medium-velocity content refreshed every 60 to 90 days Pages targeting evolving topics show recent modification dates Medium

Freshness matters most for Perplexity (pre-crawled index, heavy recency bias). For ChatGPT, freshness signals affect how quickly Bing re-crawls your pages, which controls ChatGPT's update discovery.

The Bottom Line: Keep your dateModified schema, visible dates, and sitemap lastmod in sync. For Perplexity visibility, refresh content on evolving topics every 60 to 90 days. Stale dates signal stale content.

🔍 PLATFORM-SPECIFIC CHECKS

Each AI platform uses a different architecture for discovering and citing content. A one-size-fits-all approach leaves visibility gaps. Lee (2026) found that architectural differences (live fetching vs. pre-built indices) create divergent optimization needs.

Platform Architecture Discovery Method Key Optimization
ChatGPT Live fetching Bing index for URL discovery, ChatGPT-User for page fetch Bing indexing is the gate
Claude Live fetching Claude-User checks robots.txt, fetches on demand robots.txt compliance required
Perplexity Pre-built index PerplexityBot background crawling Freshness signals, crawl frequency
Google AI Mode Google infrastructure Standard Googlebot crawling Traditional Google SEO + structured data
Gemini Google infrastructure No identified AI-specific crawlers Google Search grounding

Here are the platform-specific audit checks:

# Check Item Pass Criteria Priority
42 Bing Webmaster Tools: pages indexed Target pages appear in Bing's index (required for ChatGPT) Critical
43 Bing IndexNow configured IndexNow protocol set up for faster Bing indexing Medium
44 Perplexity freshness: recent crawl dates PerplexityBot has crawled key pages within 30 days High
45 Google Search Console: pages indexed Target pages indexed by Google (required for AI Mode, Gemini) Critical
46 Server-side rendering verified for ChatGPT-User ChatGPT-User receives full HTML (test with curl -A "ChatGPT-User") High

The Bing indexing check (item 42) is one of the most commonly missed steps. ChatGPT discovers URLs through Bing's index. If Bing has not indexed a page, ChatGPT cannot find it, regardless of content quality.

Lee (2026) also found a "Reddit paradox": despite Reddit occupying 38.3% of Google's top results, it received zero API citations from ChatGPT but 8.9% to 15.6% of web UI citations. Platform-dependent behavior means you must test on each platform individually.

The Bottom Line: At minimum, verify your pages are indexed in both Bing (for ChatGPT) and Google (for AI Mode/Gemini). Set up IndexNow for faster Bing crawling. Test your server-side rendering with each AI bot's user-agent string.

For platform-specific citation behavior details, see our research on query intent and AI citations.

📡 MONITORING AND TRACKING SETUP

You cannot improve what you do not measure. AI search visibility tracking is fundamentally different from traditional rank tracking. Instead of monitoring keyword positions on one search engine, you need to track citation appearances across multiple AI platforms simultaneously.

# Check Item Pass Criteria Priority
47 AI bot log monitoring active Server logs parsed for GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot Critical

Beyond this technical check, your monitoring stack should cover two layers:

AI Crawler Tracking. Parse server access logs for all 15+ AI bot user-agent strings. Track crawl frequency per page, and flag drops in crawl activity (a sudden PerplexityBot silence signals robots.txt or server issues).

Citation Monitoring. Run target queries through ChatGPT, Perplexity, Claude, and Google AI Mode weekly. Record which URLs appear in citations, and track changes over time. Crawler activity is a leading indicator; actual citations are the lagging indicator. Track both.

The Bottom Line: Start with server log monitoring for AI bots. Layer on citation tracking for your highest-priority queries. Our AI Visibility Monitoring service automates both.

✅ THE COMPLETE AUDIT SUMMARY

Critical (fix first): Allow 5 essential AI crawlers (checks 1 through 5), XML sitemap in robots.txt (11), server-side rendering (14), internal navigation links >= 100 (16), self-referencing canonical (17), Product schema (23), dateModified (37), Bing + Google indexing verified (42, 45), AI bot log monitoring (47).

High (significant impact): Sitemap completeness and lastmod accuracy (12, 13), 200 status returns (15), remaining page-level predictors (18 through 22), FAQ and Review schema (24, 25), content structure (30 through 32), freshness signals (38, 39), Perplexity crawl freshness (44), SSR verification (46).

Medium (incremental gains): Secondary AI crawlers (6 through 10), schema refinement (26 through 29), content format polish (33 through 36), freshness sync (40, 41), IndexNow (43).

If you want a professional audit against all 47 checks, our AI SEO Audit Service covers every item with a detailed remediation plan. For a quick self-assessment, run your top 5 pages through the AI Visibility Quick Check.

❓ FREQUENTLY ASKED QUESTIONS

How long does it take to see results after fixing audit items? It depends on the platform. Google AI Mode propagates within days. ChatGPT depends on Bing reindexing (1 to 4 weeks, faster with IndexNow). Perplexity updates within days for frequently crawled pages. Claude fetches pages live, so changes take effect on the next query. Overall, expect 2 to 6 weeks for full propagation across all platforms.

Should I block AI crawlers to prevent content scraping? That depends on your business model. Blocking AI crawlers prevents training use but also prevents citation in AI responses. For most businesses, the visibility benefit outweighs the concern. If you need selective control, allow ChatGPT-User (citation) while blocking GPTBot (training). See our guide on tracking AI bots effectively for the distinction between training and citation bots.

Does page speed matter for AI citation? Our research found no statistically significant effect of load time on AI citation probability. This contradicts popular advice. While page speed matters for user experience and traditional Google ranking, AI crawlers are patient. They will wait for your page to load. Focus your optimization efforts on the 7 features that actually predict citation (internal links, canonical, schema, word count, content-to-HTML ratio) rather than shaving milliseconds off load time.

What about author bios and E-E-A-T signals? Author attribution showed no significant effect on AI citation in our dataset (p = .522). E-E-A-T may matter for Google ranking, but it does not appear to influence AI citation decisions. AI platforms parse content structure, not bylines.

How is this different from a traditional SEO audit? A traditional SEO audit focuses on factors like backlinks, Core Web Vitals, keyword density, and mobile-friendliness. An AI search visibility audit focuses on crawl access for 15+ AI bots, schema type selection, content structure for machine parsing, freshness signals, and platform-specific indexing requirements. The two audits overlap on technical basics (canonical tags, sitemaps, server rendering) but diverge significantly on what counts as a ranking factor. Google rank has essentially zero correlation with AI citation (rho = -0.02 to 0.11) across platforms (Lee, 2026).

📚 REFERENCES

  • Lee, A. (2026). "Query Intent, Not Google Rank: What Best Predicts AI Citation Behavior." Preprint v5. DOI
  • Aggarwal, P., Murahari, V., Rajpurohit, T., Kalyan, A., Narasimhan, K., & Deshpande, A. (2024). "GEO: Generative Engine Optimization." KDD 2024. DOI