Sanity
How to Improve AI Chatbot Accuracy with RAG and Structured Content
AI chatbot accuracy suffers when models hallucinate or lack context. Learn how Retrieval-Augmented Generation (RAG) and structured CMS content work together to deliver precise, reliable answers.

Why AI Chatbots Give Inaccurate Answers
AI chatbot accuracy is one of the most pressing challenges facing teams deploying conversational AI in production. Despite the remarkable capabilities of large language models (LLMs), they are fundamentally limited by the data they were trained on. When a user asks a question that falls outside the model’s training window — or requires knowledge of your specific product, documentation, or internal processes — the model is forced to either admit ignorance or, worse, hallucinate a plausible-sounding but incorrect answer.
Hallucination is not a bug in the traditional sense; it is an emergent property of how LLMs work. These models are trained to predict the most statistically likely next token, not to retrieve verified facts. When the correct answer isn’t in the training data, the model fills the gap with confident-sounding fabrications. This is particularly dangerous in domains like healthcare, legal, finance, and technical support, where inaccurate answers can have real consequences.
Beyond hallucination, there are several other root causes of poor AI chatbot accuracy:
- Stale training data. LLMs have a knowledge cutoff date. Any information published after that date is invisible to the model unless it is explicitly provided at inference time.
- Lack of domain specificity. General-purpose models are trained on broad internet data. They may lack the nuanced, proprietary knowledge your users need.
- Ambiguous or unstructured context. Even when relevant documents exist, if they are poorly structured or buried in noise, the model may fail to extract the right information.
- Context window limitations. LLMs can only process a finite amount of text at once. Feeding an entire knowledge base into a prompt is not feasible.
Understanding these root causes is the first step toward building systems that genuinely improve AI chatbot accuracy. The solution lies in combining retrieval mechanisms with well-structured content — and that is exactly what RAG enables.
How RAG Improves Accuracy
Retrieval-Augmented Generation (RAG) is an architectural pattern that addresses the core limitations of standalone LLMs by grounding model responses in retrieved, up-to-date, domain-specific content. Instead of relying solely on parametric knowledge baked into model weights, a RAG system dynamically fetches relevant documents at query time and injects them into the model’s context window.
The RAG pipeline typically works as follows:
- Indexing. Your content — documentation, FAQs, product pages, knowledge base articles — is chunked into smaller passages and converted into vector embeddings using an embedding model. These embeddings are stored in a vector database such as Pinecone, Weaviate, or pgvector.
- Retrieval. When a user submits a query, the query is also embedded and a similarity search is performed against the vector store. The top-k most semantically relevant passages are retrieved.
- Augmentation. The retrieved passages are injected into the LLM’s prompt as context, alongside the user’s question.
- Generation. The LLM generates a response grounded in the provided context, dramatically reducing the likelihood of hallucination.
RAG improves AI chatbot accuracy in several measurable ways. First, it ensures the model always has access to the most current information — you update your content, re-index it, and the chatbot immediately benefits. Second, it constrains the model’s response space to verified, authoritative sources. Third, it enables citation — the system can surface which documents it used to generate an answer, giving users a path to verify claims.
However, RAG is only as good as the content it retrieves. A retrieval system that surfaces poorly written, ambiguous, or inconsistently structured documents will still produce inaccurate or confusing answers. This is where structured content becomes the critical differentiator.
The Role of Structured CMS Content (Sanity)
Structured content management systems like Sanity treat content as data rather than as formatted blobs of HTML. Every piece of content is modeled with explicit fields, types, and relationships. This approach has profound implications for AI chatbot accuracy.
When content is stored as structured data, it becomes far easier to:
- Chunk intelligently. Instead of splitting arbitrary HTML at character boundaries, you can chunk by semantic units — a FAQ answer, a product description field, a step in a tutorial. Each chunk is self-contained and meaningful.
- Attach metadata. Structured content carries rich metadata: category, author, publish date, product version, audience. This metadata can be embedded alongside content chunks, enabling more precise retrieval filtering.
- Maintain consistency. Schema-enforced content models ensure that every article, FAQ, or product page follows the same structure. Consistency reduces noise in the retrieval pipeline.
- Enable real-time updates. Sanity’s Content Lake exposes a real-time API. When an editor updates a document, your indexing pipeline can be triggered immediately via webhooks, keeping the vector store fresh.
Sanity’s Portable Text format is particularly well-suited for AI pipelines. Because body content is stored as a typed array of blocks rather than raw HTML, you can programmatically extract headings, paragraphs, lists, and inline annotations without parsing markup. This makes it straightforward to build chunking logic that respects semantic boundaries.
Furthermore, Sanity’s GROQ query language allows you to fetch precisely the fields you need for indexing — no over-fetching, no irrelevant markup. You can query for title, excerpt, body, category, and tags in a single request, assembling a rich, metadata-annotated document ready for embedding.
Designing Content for AI Consumption
Improving AI chatbot accuracy is not purely an engineering problem — it is also a content design problem. The way your editors write and structure content has a direct impact on retrieval quality and generation accuracy. Here are the key principles for designing content that performs well in a RAG pipeline.
Write atomically. Each document, section, or FAQ entry should answer one question or cover one concept. Avoid sprawling articles that cover ten topics in a single page. Atomic content chunks are easier to retrieve precisely and less likely to introduce irrelevant context into the LLM’s prompt.
Use descriptive headings. Headings serve as semantic anchors. When a chunk is retrieved, the heading provides critical context about what the chunk is about. A heading like “How to reset your password” is far more useful to a retrieval system than “Step 3” or “Next steps.”
Prefer explicit over implicit. Avoid pronouns and references that only make sense in the context of the surrounding document. If a chunk says “It supports up to 100 users,” the retrieval system has no idea what “it” refers to. Rewrite as “The Enterprise plan supports up to 100 users.”
Leverage structured fields. Use dedicated fields for FAQs, key takeaways, and summaries rather than burying them in body text. A summary field or a keyPoints array can be indexed separately and retrieved with high precision for certain query types.
Maintain freshness. Stale content is a silent killer of AI chatbot accuracy. Establish editorial workflows that flag content for review when underlying products or policies change. Sanity’s document versioning and workflow tools can help enforce these processes.
Avoid duplication. Duplicate or near-duplicate content confuses retrieval systems and can cause the LLM to receive contradictory information. Use Sanity’s reference system to link to canonical content rather than copying it.
Implementation Steps
Building a RAG-powered chatbot on top of Sanity content involves several concrete engineering steps. Here is a practical implementation roadmap.
Step 1: Model your content schema for AI readiness. Review your existing Sanity schemas. Ensure that key informational content has dedicated fields for title, excerpt or summary, and structured body content. Add metadata fields like category, tags, audience, and lastReviewed if they don’t already exist. These fields will be invaluable for retrieval filtering.
Step 2: Build a content export pipeline. Use Sanity’s GROQ API to fetch all indexable documents. Query for _id, title, excerpt, body, category->title, tags, and publishedAt in a single request. Exclude draft documents by filtering out IDs that match the drafts.** path pattern. Trigger this pipeline on a schedule or via Sanity webhooks on document publish events.
Step 3: Chunk and embed your content. Convert Portable Text body arrays into plain text using a serializer library, then split into chunks of 300–600 tokens. Prepend each chunk with its document title and section heading for context. Use an embedding model such as OpenAI text-embedding-3-small or Cohere embed-english-v3 to generate vector representations.
Step 4: Store embeddings in a vector database. Upload chunks and their embeddings to a vector store such as Pinecone, Weaviate, or pgvector. Include metadata fields — documentId, title, category, publishedAt — as filterable attributes. This enables hybrid search, combining semantic similarity with metadata filters.
Step 5: Build the retrieval and generation layer. At query time, embed the user’s question, retrieve the top-k chunks, and construct a prompt that includes the retrieved context. Use a system prompt that instructs the LLM to answer only based on the provided context and to say “I don’t know” when the context is insufficient.
Step 6: Implement a feedback loop. Capture user feedback — thumbs up/down ratings, corrections — and log queries that returned low-confidence answers. Use this data to identify content gaps and improve both your content and your retrieval configuration.
Measuring Chatbot Accuracy
You cannot improve what you do not measure. Establishing robust evaluation metrics is essential for tracking AI chatbot accuracy over time and validating the impact of changes to your content or retrieval pipeline.
Retrieval metrics measure how well the system finds relevant documents:
- Recall@k — What fraction of relevant documents appear in the top-k retrieved results?
- Mean Reciprocal Rank (MRR) — How highly ranked is the first relevant result on average?
- Precision@k — Of the top-k retrieved chunks, what fraction are actually relevant?
Generation metrics measure the quality of the final answer:
- Faithfulness — Does the answer accurately reflect the retrieved context, without adding unsupported claims? Tools like RAGAS automate this evaluation.
- Answer relevance — Does the answer actually address the user’s question?
- Context utilization — Is the model making good use of the retrieved context, or ignoring it?
End-to-end metrics measure real-world performance:
- Human evaluation — Periodic manual review of a sample of conversations by domain experts.
- User satisfaction scores — Thumbs up/down ratings, CSAT surveys, or session abandonment rates.
- Escalation rate — How often does the chatbot fail to answer and escalate to a human agent?
Establish a golden dataset — a curated set of questions with known correct answers — and run it against your system regularly. This gives you a stable benchmark for comparing the impact of content updates, model upgrades, or retrieval configuration changes on AI chatbot accuracy.
Common Mistakes
Even well-intentioned RAG implementations frequently fall into predictable traps. Being aware of these pitfalls can save significant debugging time.
Chunking without context. Splitting documents at fixed token counts without regard for semantic boundaries produces chunks that are meaningless in isolation. A chunk that starts mid-sentence or mid-list item will confuse both the retrieval system and the LLM. Always chunk at natural boundaries — paragraph breaks, heading transitions, list item boundaries.
Ignoring metadata. Many teams index only the raw text of their content and discard metadata. This is a missed opportunity. Metadata enables filtered retrieval — for example, searching only articles tagged “billing” — which dramatically improves precision for domain-specific queries.
Over-retrieving context. Retrieving too many chunks fills the context window with noise and can actually decrease accuracy by diluting the relevant signal. Start with k=3 to k=5 and tune based on your evaluation metrics.
Neglecting content quality. RAG cannot compensate for fundamentally poor content. If your knowledge base contains outdated, contradictory, or ambiguous information, the chatbot will faithfully retrieve and present that bad information. Content quality is a prerequisite, not an afterthought.
Skipping re-ranking. Basic vector similarity search is a good first pass, but it is not perfect. Adding a cross-encoder re-ranker as a second stage — which scores each retrieved chunk against the query more precisely — can significantly improve retrieval quality.
No fallback strategy. Every RAG system will encounter queries for which no relevant content exists. Without a graceful fallback — such as acknowledging the gap and offering to escalate — the LLM will hallucinate. Always define explicit behavior for low-confidence scenarios.
Best Practices
Drawing together the lessons from the sections above, here are the best practices that consistently lead to high AI chatbot accuracy in production RAG systems.
Invest in content modeling first. Before writing a single line of retrieval code, ensure your Sanity schema is designed for AI consumption. Atomic documents, rich metadata, and consistent structure are the foundation everything else builds on.
Use hybrid search. Combine dense vector search with sparse keyword search (BM25). Hybrid search outperforms either approach alone, especially for queries that contain specific product names, error codes, or technical terms that may not be well-represented in the embedding space.
Implement contextual chunking. Prepend each chunk with its document title and the heading of the section it belongs to. This contextual prefix ensures that even a small, isolated chunk carries enough context for the LLM to use it correctly.
Version your embeddings. When you upgrade your embedding model, re-embed all content. Mixing embeddings from different models in the same index produces unpredictable retrieval behavior.
Automate re-indexing. Use Sanity webhooks to trigger re-indexing whenever a document is published or updated. Stale embeddings are a common and insidious source of accuracy degradation.
Apply prompt engineering discipline. Your system prompt is as important as your retrieval pipeline. Instruct the model clearly: use only the provided context, cite sources, express uncertainty when appropriate, and never fabricate information.
Monitor continuously. Deploy logging and monitoring from day one. Track retrieval latency, answer quality scores, and user feedback. Set up alerts for sudden drops in satisfaction metrics, which often indicate a content gap or a retrieval regression.
FAQ
What is the difference between RAG and fine-tuning for improving AI chatbot accuracy?
RAG and fine-tuning address different problems. Fine-tuning adjusts the model’s weights to improve its behavior on a specific task or domain, but it does not give the model access to new factual information after training. RAG, by contrast, retrieves up-to-date, domain-specific content at inference time without modifying the model. For most chatbot accuracy use cases — especially where content changes frequently — RAG is the more practical and cost-effective approach.
How often should I re-index my Sanity content for the RAG pipeline?
Ideally, re-indexing should be triggered automatically whenever content is published or updated, using Sanity webhooks. For large content libraries where full re-indexing is expensive, implement incremental indexing that only processes changed documents. At a minimum, run a full re-index on a weekly schedule to catch any missed updates.
What chunk size works best for RAG?
There is no universal answer, but a common starting point is 300–500 tokens per chunk with a 50-token overlap between adjacent chunks. Smaller chunks improve retrieval precision but may lack sufficient context for generation. Larger chunks provide more context but reduce precision. The optimal size depends on your content structure and query patterns — always tune based on your evaluation metrics.
Can RAG work with Portable Text from Sanity without converting to plain text?
Yes, but you will need to serialize Portable Text to plain text before embedding, since embedding models operate on strings. Libraries like @portabletext/to-html or custom serializers can convert Portable Text blocks to clean text. Preserve heading structure in the serialized output — for example, by prepending heading text with the heading level — so that semantic structure is not lost during chunking.
How do I handle queries that fall outside my content’s scope?
Define an explicit out-of-scope handling strategy in your system prompt. Instruct the model to respond with a clear, honest message when retrieved context does not contain a relevant answer — for example: “I don’t have information about that in my knowledge base. You may want to contact our support team.” Avoid letting the model speculate or draw on its general training knowledge for domain-specific queries, as this reintroduces hallucination risk.
Conclusion
Improving AI chatbot accuracy is not a single-step fix — it is a continuous discipline that spans content strategy, system architecture, and ongoing measurement. The combination of Retrieval-Augmented Generation and structured content management with a platform like Sanity provides a powerful foundation for building chatbots that users can genuinely trust.
The key insight is that the quality of your retrieval is bounded by the quality of your content. No amount of prompt engineering or model fine-tuning can compensate for a knowledge base that is stale, ambiguous, or poorly structured. By investing in content modeling, maintaining editorial discipline, and treating your CMS as a first-class component of your AI infrastructure, you create a virtuous cycle: better content leads to better retrieval, which leads to more accurate answers, which builds user trust.
Start with a clear content audit, model your Sanity schemas for AI readiness, implement a RAG pipeline with hybrid search and contextual chunking, and establish a measurement framework from day one. Iterate based on real user feedback and evaluation metrics. With this approach, AI chatbot accuracy becomes a measurable, improvable engineering metric — not a vague aspiration.


