Sanity

How Structured Data Helps AI Agents Understand Your Website

Discover how structured data AI agents use to parse and understand your site—covering JSON-LD, schema.org types, Next.js implementation, and CMS-driven generation.

June 26, 202610 min readMuhammad Zohaib Ramzan

Abstract visualization of interconnected data nodes and network graph representing structured semantic data and knowledge graphs

Search engines and AI agents face the same fundamental challenge: HTML is designed for humans, not machines. Structured data bridges that gap by embedding machine-readable semantics directly into your pages. For structured data AI agents—the crawlers, reasoning engines, and retrieval-augmented generation (RAG) pipelines that increasingly power modern search and AI products—this metadata is the difference between a page that gets understood and one that gets ignored.

This guide walks through everything a senior developer needs to know: the vocabulary, the implementation patterns, the tooling, and the emerging conventions that make your site legible to the next generation of AI-powered consumers.

What Is Structured Data?

Structured data is a standardized format for annotating web content so that automated systems can interpret its meaning, not just its text. Rather than inferring that a block of text is a recipe, a product listing, or a news article, a crawler can read an explicit declaration that says exactly what the content is and what its properties are.

The dominant format on the modern web is JSON-LD (JavaScript Object Notation for Linked Data), recommended by Google and widely adopted across the industry. JSON-LD is embedded in a <script type="application/ld+json"> tag in the document <head> or <body>. Because it lives in a separate script block rather than being interleaved with HTML attributes, it is easy to generate, maintain, and validate without touching markup.

The vocabulary that gives JSON-LD its meaning comes from schema.org, a collaborative project founded by Google, Microsoft, Yahoo, and Yandex. Schema.org defines hundreds of types—Article, Product, Event, Person, Organization, and many more—along with the properties each type supports. Types form an inheritance hierarchy: NewsArticle extends Article, which extends CreativeWork, which extends Thing. This hierarchy lets AI agents apply broad reasoning even when they encounter a specific subtype they haven’t seen before.

A minimal Article annotation looks like this:

{ "@context": "https://schema.org", "@type": "Article", "headline": "How Structured Data Helps AI Agents Understand Your Website", "author": { "@type": "Person", "name": "Jane Smith" }, "datePublished": "2024-06-01", "dateModified": "2024-06-15", "description": "A developer guide to structured data for AI agents." }

Beyond JSON-LD, two older formats still exist: Microdata (HTML attributes like itemscope and itemprop) and RDFa (attribute-based annotations). Both are harder to maintain and less commonly used in new projects. JSON-LD is the clear choice for any greenfield implementation.

How AI Agents Crawl and Parse Web Content

Understanding why structured data matters requires a mental model of how AI agents actually consume web pages.

Crawling is the process of discovering and fetching pages. A crawler follows links, respects robots.txt directives, and stores raw HTML (and sometimes rendered JavaScript output) in an index. Traditional search crawlers like Googlebot have done this for decades. Newer AI agents—including those powering large language model (LLM) search features, autonomous web agents, and RAG pipelines—operate on similar principles but with different downstream goals.

Parsing transforms raw HTML into a structured representation. A parser builds a DOM tree, extracts text nodes, identifies headings and lists, and—critically—locates <script type="application/ld+json"> blocks. JSON-LD is trivially parseable: it requires no DOM traversal, no CSS selector logic, and no heuristic inference. The agent simply deserializes the JSON and has a typed, property-rich object graph.

Semantic understanding is where structured data pays its biggest dividend. Without annotations, an AI agent must infer that a page is a how-to guide from signals like heading text, numbered lists, and imperative verbs. With a HowTo schema, the agent receives an explicit list of HowToStep objects, each with a name, text, and optional image. The difference in confidence and accuracy is substantial.

Modern AI agents also use structured data to:

Resolve entity identity. A sameAs property linking to a Wikidata or Wikipedia URL tells an agent that the entity on your page is the same as a well-known knowledge graph node.
Understand content freshness. datePublished and dateModified allow agents to rank or filter content by recency without parsing prose.
Navigate site hierarchy. BreadcrumbList gives agents an explicit map of where a page sits within your information architecture.
Assess authority. Organization and Person schemas with url, logo, and sameAs properties help agents evaluate source credibility.

The practical implication: structured data AI agents rely on is not a nice-to-have SEO trick. It is a first-class signal that shapes how your content is retrieved, ranked, and synthesized.

Key Schema Types for AI Agent Comprehension

Not all schema types are equally valuable for AI agent legibility. The following are the highest-impact types for most content-driven websites.

Article / NewsArticle / BlogPosting

Use Article (or its subtypes) for any editorial content. Required properties include headline, author, datePublished, and image. Recommended additions: dateModified, description, publisher, mainEntityOfPage, and keywords.

FAQPage

FAQPage with nested Question and Answer entities is one of the most powerful types for AI agents. It provides a structured Q&A surface that LLMs can directly incorporate into conversational responses. Each Question has a name (the question text) and an acceptedAnswer with a text property.

HowTo

HowTo is ideal for tutorial and guide content. It accepts a name, description, totalTime (ISO 8601 duration), estimatedCost, and an array of HowToStep objects. Each step has a name, text, and optionally an image and url.

BreadcrumbList

BreadcrumbList with ListItem children gives AI agents an explicit site hierarchy. Each ListItem has a position (integer), name, and item (URL). This is especially valuable for large sites where URL structure alone is ambiguous.

WebSite and Organization

A single WebSite schema on your homepage declares your site’s name, url, and optionally a SearchAction. Organization establishes your brand identity with name, url, logo, and sameAs (an array of social profile URLs and knowledge graph URIs). For personal sites or author pages, use Person with name, url, jobTitle, sameAs, and image.

SoftwareApplication

For developer tools and apps, SoftwareApplication with applicationCategory, operatingSystem, offers, and aggregateRating gives AI agents the metadata they need to surface your tool in relevant queries.

Implementing Structured Data in Next.js

Next.js is the dominant React framework for content-driven sites, and it offers clean patterns for injecting JSON-LD in both the App Router and Pages Router.

App Router (Next.js 13+)

In the App Router, render a <script> tag directly inside your page or layout Server Component. Because Server Components render on the server, the JSON-LD is present in the initial HTML response—no hydration required.

// app/blog/[slug]/page.tsx export default async function BlogPost({ params }) { const post = await getPost(params.slug); const jsonLd = { '@context': 'https://schema.org', '@type': 'BlogPosting', headline: post.title, description: post.excerpt, author: { '@type': 'Person', name: post.author.name, url: post.author.url, }, datePublished: post.publishedAt, dateModified: post.updatedAt ?? post.publishedAt, image: post.mainImage.url, mainEntityOfPage: { '@type': 'WebPage', '@id': `https://example.com/blog/${params.slug}`, }, publisher: { '@type': 'Organization', name: 'Example Inc.', logo: { '@type': 'ImageObject', url: 'https://example.com/logo.png', }, }, }; return ( <> <script type="application/ld+json" dangerouslySetInnerHTML={{ __html: JSON.stringify(jsonLd) }} /> <article> <h1>{post.title}</h1> </article> </> ); }

For FAQ content, compose a separate FAQPage schema and inject it as a second <script type="application/ld+json"> tag. Agents and crawlers will process all of them independently.

Pages Router (Next.js 12 and below)

In the Pages Router, use next/head to inject the script into the document <head>:

// pages/blog/[slug].tsx import Head from 'next/head'; export default function BlogPost({ post }) { const jsonLd = { '@context': 'https://schema.org', '@type': 'BlogPosting', headline: post.title, author: { '@type': 'Person', name: post.author.name }, datePublished: post.publishedAt, image: post.mainImage.url, }; return ( <> <Head> <script type="application/ld+json" dangerouslySetInnerHTML={{ __html: JSON.stringify(jsonLd) }} /> </Head> <article><h1>{post.title}</h1></article> </> ); }

Reusable Schema Utilities

Avoid duplicating schema construction logic across pages. Extract builders into a shared module and use TypeScript’s schema-dts package for compile-time validation:

// lib/schema.ts import type { BlogPosting, WithContext } from 'schema-dts'; export function buildArticleSchema( post: Post, siteUrl: string ): WithContext<BlogPosting> { return { '@context': 'https://schema.org', '@type': 'BlogPosting', headline: post.title, description: post.excerpt, author: { '@type': 'Person', name: post.author.name, url: `${siteUrl}/authors/${post.author.slug}`, }, datePublished: post.publishedAt, dateModified: post.updatedAt ?? post.publishedAt, image: post.mainImage.url, url: `${siteUrl}/blog/${post.slug}`, }; } export function buildBreadcrumbSchema( crumbs: { name: string; url: string }[] ) { return { '@context': 'https://schema.org', '@type': 'BreadcrumbList', itemListElement: crumbs.map((crumb, i) => ({ '@type': 'ListItem', position: i + 1, name: crumb.name, item: crumb.url, })), }; }

Testing Structured Data

Implementing structured data without validation is a recipe for silent failures. Use these tools to verify correctness before and after deployment.

Google Rich Results Test

The Google Rich Results Test (search.google.com/test/rich-results) accepts a URL or raw HTML and reports which rich result types are eligible, which properties are present, and which required fields are missing. It is the authoritative tool for Google-specific validation and should be part of your pre-launch checklist.

Schema Markup Validator

The Schema Markup Validator (validator.schema.org) validates against the full schema.org specification rather than Google’s subset. Use it to catch type errors, unknown properties, and structural issues that the Rich Results Test might not flag.

Browser DevTools and CI Validation

For rapid iteration during development, inspect the rendered HTML in DevTools and locate the <script type="application/ld+json"> blocks. Copy the content into JSON.parse() in the console to verify it is valid JSON. For production-grade pipelines, integrate the schema-dts npm package for TypeScript compile-time validation, and add Playwright tests that assert <script type="application/ld+json"> blocks are present and parseable on every deploy.

Combining Structured Data with llms.txt and CMS

Structured data is one layer of a broader strategy for making your site legible to AI systems. Two complementary approaches are worth understanding: the emerging llms.txt convention and CMS-driven schema generation.

The llms.txt Convention

llms.txt is a proposed convention (analogous to robots.txt) for providing AI agents with a curated, markdown-formatted summary of your site’s content and structure. Placed at https://yourdomain.com/llms.txt, it can include a brief description of the site, links to key pages with short descriptions, guidance on which content is most relevant for AI consumption, and licensing preferences. Structured data and llms.txt are complementary: structured data annotates individual pages at the schema level, while llms.txt provides a high-level map of the entire site.

CMS-Driven Structured Data Generation

Manually maintaining JSON-LD across hundreds of pages is unsustainable. A headless CMS like Sanity solves this by treating content as structured data from the ground up. Every document has a typed schema with explicit fields—title, author (a reference to an author document), publishedAt, mainImage—that map almost directly to schema.org types.

You can write a single GROQ query that fetches a post and its related data, then transform the result into a JSON-LD object:

// lib/sanity-schema.ts export async function getPostWithSchema(slug: string) { const post = await client.fetch( `*[_type == "post" && slug.current == $slug][0]{ title, excerpt, publishedAt, updatedAt, "slug": slug.current, "author": author->{ name, "slug": slug.current }, "mainImage": mainImage.asset->url, tags }`, { slug } ); const jsonLd = { '@context': 'https://schema.org', '@type': 'BlogPosting', headline: post.title, description: post.excerpt, author: { '@type': 'Person', name: post.author.name, }, datePublished: post.publishedAt, dateModified: post.updatedAt ?? post.publishedAt, image: post.mainImage, keywords: post.tags?.join(', '), }; return { post, jsonLd }; }

This approach means your structured data is always in sync with your CMS content. When an editor updates an author name or publication date, the JSON-LD updates automatically on the next build or revalidation. For sites using on-demand ISR, structured data updates propagate within seconds of a content change.

Common Mistakes

Even experienced developers make predictable errors when implementing structured data. Here are the most impactful ones to avoid.

Missing required properties. Every schema type has required properties defined by Google’s guidelines. Article requires headline, image, datePublished, and author. Omitting any of these disqualifies the page from rich results and reduces AI agent confidence.

Using incorrect @type values. Assigning @type: "Blog" to an individual post (instead of BlogPosting) is a common mistake. Blog describes the collection; BlogPosting describes an individual entry.

Duplicate schemas of the same type. Injecting two Article schemas on the same page—for example, one from a layout component and one from the page component—confuses crawlers. Use a single, authoritative schema per type per page.

Mismatched content. The structured data must accurately reflect the visible page content. A headline in JSON-LD that differs from the <h1> on the page, or a datePublished that doesn’t match the displayed date, can trigger manual actions from Google and reduce AI agent trust.

Blocking JavaScript rendering. If your JSON-LD is injected by client-side JavaScript and your page is not server-rendered, some crawlers may never see it. Always verify that structured data is present in the raw HTML response, not just in the post-hydration DOM.

Ignoring validation errors. The Rich Results Test and Schema Markup Validator surface warnings and errors that are easy to dismiss. Treat validation errors as bugs, not suggestions.

Best Practices

A mature structured data implementation is not a one-time task—it is an ongoing engineering discipline.

Keep schemas up to date. Schema.org evolves. New types and properties are added regularly, and Google’s rich result requirements change. Subscribe to the schema.org changelog and Google’s Search Central blog to stay current.

Validate on every deploy. Integrate Rich Results Test API calls or schema-dts type checks into your CI/CD pipeline. A broken schema that ships to production can silently degrade your search and AI visibility for weeks before anyone notices.

Use CMS-driven generation. Generating JSON-LD from your CMS schema eliminates an entire class of drift bugs. If your CMS has typed fields, your structured data should be derived from them programmatically.

Prefer specificity. Use the most specific @type that accurately describes your content. TechArticle is better than Article for developer documentation. HowTo is better than Article for step-by-step guides.

Use @id for entity disambiguation. Assign stable @id URIs to your key entities (organization, authors, products). This allows AI agents to build a consistent knowledge graph across multiple pages and crawl sessions.

Test with real AI agents. Beyond Google’s tools, test how your pages appear in AI-powered search features like Bing Copilot and Perplexity. These surfaces often surface structured data in ways that traditional SEO tools don’t capture.

FAQ

What is the difference between structured data and metadata?

Metadata (like <meta> tags) provides page-level information such as title, description, and Open Graph properties. Structured data goes further by describing the entities on a page—their types, properties, and relationships—using a formal vocabulary like schema.org. Both are important, but structured data provides significantly richer semantic context for AI agents.

Do AI agents actually use structured data, or is it just for Google?

Structured data is used by a growing range of AI systems beyond Google. Bing’s AI-powered search, Perplexity, and various RAG pipelines that crawl the web all benefit from structured data. Additionally, LLM training pipelines that process Common Crawl data can use JSON-LD annotations to improve entity recognition and factual grounding. The investment pays dividends across the entire AI ecosystem.

How does structured data interact with vector search and RAG?

In a RAG pipeline, documents are chunked, embedded, and stored in a vector database. Structured data can be extracted from JSON-LD blocks and stored as metadata alongside the vector embeddings. This allows retrieval systems to filter by entity type, date, author, or other properties before performing semantic similarity search—dramatically improving retrieval precision for structured queries.

Should I use JSON-LD, Microdata, or RDFa?

JSON-LD is the recommended format for all new implementations. It is easier to generate programmatically, easier to validate, and does not require modifying HTML markup. Google explicitly recommends JSON-LD. Microdata and RDFa are legacy formats that should only be maintained if they already exist in a codebase with no clear migration path.

How often should I update my structured data?

Structured data should be updated whenever the underlying content changes. For CMS-driven sites, this happens automatically if your JSON-LD is generated from CMS fields. For statically authored schemas, establish a review cadence (quarterly at minimum) and always re-validate after major site redesigns, URL structure changes, or CMS migrations.

Conclusion

Structured data is no longer an optional SEO enhancement—it is foundational infrastructure for the AI-powered web. As structured data AI agents become the primary interface through which users discover and consume content, the sites that invest in rich, accurate, and well-maintained schemas will have a durable competitive advantage.

The implementation path is clear: adopt JSON-LD with schema.org vocabulary, use the most specific types that accurately describe your content, generate schemas programmatically from your CMS, validate on every deploy, and complement your page-level annotations with site-level conventions like llms.txt. In a Next.js application, this amounts to a few hundred lines of well-tested utility code that pays compounding returns as AI-powered search and retrieval continue to grow.

Start with your highest-traffic pages, validate rigorously, and build from there. The machines are reading—make sure they understand what they find.