The Chatbot Architecture: A Blog That Answers Questions About Itself
My blog can answer questions about itself. You can ask it “how does the chatbot work?” and it retrieves this very post to answer you. That’s either very cool or very meta. Probably both.
The Full Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ USER'S BROWSER │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ operationalsemantics.dev (Cloudflare Pages) │ │
│ │ ┌──────────────────┐ │ │
│ │ │ BlogChat.astro │ ◄── Chat widget component │ │
│ │ │ (Frontend) │ │ │
│ │ └────────┬─────────┘ │ │
│ └───────────┼───────────────────────────────────────────────────┘ │
└──────────────┼──────────────────────────────────────────────────────┘
│ POST { query: "..." }
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE WORKERS │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ blog-chat-worker (Proxy) │ │
│ │ - CORS handling │ │
│ │ - Auth token injection │ │
│ └────────┬──────────────────────────────────────────────────────┘ │
└───────────┼─────────────────────────────────────────────────────────┘
│ POST + Bearer token
▼
┌─────────────────────────────────────────────────────────────────────┐
│ CLOUDFLARE AI SEARCH (AutoRAG) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Crawler │─▶│ Vectorize │─▶│ Reranker │─▶│ Workers AI│ │
│ │ (Sitemap) │ │ (BGE-M3) │ │ (BGE) │ │ (Qwen3) │ │
│ └────────────┘ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Three components:
- Chat Widget - The frontend UI on the blog
- Proxy Worker - Handles CORS and hides the API token
- AutoRAG - The AI brain that retrieves and generates answers
Component 1: The Chat Widget
The chat interface is a single Astro component: BlogChat.astro
What it does:
- Renders a floating “Ask AI” button in the bottom-right corner
- Expands into a chat window when clicked
- Sends user questions to the proxy worker
- Displays AI responses
The core logic (simplified):
const WORKER_URL = 'https://blog-chat-worker.myronkoch-dev.workers.dev';
async function askQuestion(query) {
const response = await fetch(WORKER_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query })
});
const data = await response.json();
if (data.success && data.result?.response) {
return data.result.response;
} else {
return 'Sorry, I encountered an error.';
}
}
No API keys in the frontend. No complex state management. Just a fetch call to the proxy.
Component 2: The Proxy Worker
Why do we need this?
Two reasons:
- CORS - Browsers block cross-origin requests. The worker adds the right headers.
- Security - The AI Search API needs an auth token. We can’t put that in frontend code.
The worker code (25 lines that do everything):
export default {
async fetch(request, env) {
// Handle CORS preflight
if (request.method === 'OPTIONS') {
return new Response(null, {
headers: {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'POST, OPTIONS',
'Access-Control-Allow-Headers': 'Content-Type',
},
});
}
// Get the query from request body
const { query } = await request.json();
// Call AI Search API with auth
const response = await fetch(
`https://api.cloudflare.com/client/v4/accounts/${env.ACCOUNT_ID}/autorag/rags/${env.RAG_ID}/ai-search`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${env.AI_SEARCH_TOKEN}`,
},
body: JSON.stringify({ query }),
}
);
const result = await response.json();
// Return with CORS headers
return new Response(JSON.stringify(result), {
headers: {
'Content-Type': 'application/json',
'Access-Control-Allow-Origin': '*',
},
});
},
};
Configuration (wrangler.toml):
name = "blog-chat-worker"
main = "src/index.js"
compatibility_date = "2024-11-12"
[vars]
ACCOUNT_ID = "your-cloudflare-account-id"
RAG_ID = "your-autorag-instance-name"
# Secret - set via: wrangler secret put AI_SEARCH_TOKEN
The AI_SEARCH_TOKEN is stored as a secret, never in code.
Component 3: AutoRAG (The Brain)
Cloudflare’s AI Search (AutoRAG) handles the hard parts:
- Crawling the sitemap
- Chunking content
- Generating embeddings
- Storing vectors
- Retrieving relevant content
- Generating answers
Setup in Cloudflare Dashboard:
- Go to AI → AI Search
- Create new instance
- Configure:
| Setting | Value |
|---|---|
| Source Type | Sitemap |
| Sitemap URL | https://operationalsemantics.dev/sitemap-index.xml |
| Embedding Model | BGE-M3 |
| Reranker | bge-reranker-base (enabled) |
| Generation Model | Qwen3 30B |
| Query Rewrite | Enabled |
What happens when someone asks a question:
- Query Rewrite - The model reformulates the question for better retrieval
- Embedding - Question becomes a vector using BGE-M3
- Retrieval - Vectorize finds similar content chunks
- Reranking - BGE reranker scores and sorts results
- Generation - Qwen3 30B generates an answer using retrieved context
All of this happens in ~2-3 seconds.
The Query Flow in Detail
User: "How does the MCP Factory work?"
│
▼
┌─────────────────────────────────────────┐
│ 1. QUERY REWRITE │
│ "How does the MCP Factory work?" │
│ → "MCP Factory architecture process │
│ blockchain server generation" │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 2. EMBEDDING (BGE-M3) │
│ Query → 1024-dim vector │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 3. RETRIEVAL (Vectorize) │
│ Find top-k similar content chunks │
│ Returns: Post 11, 12, 13 chunks │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 4. RERANKING (BGE-reranker) │
│ Score each chunk against query │
│ Reorder by relevance │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 5. GENERATION (Qwen3 30B) │
│ Context: [retrieved chunks] │
│ Question: [user query] │
│ → Answer with citations │
└─────────────────────────────────────────┘
Model Selection Deep Dive
I tested three models for generation:
| Model | Accuracy | Style | Speed |
|---|---|---|---|
| Llama 3.3 70B | Good | Verbose | Medium |
| Llama 4 Scout 17B | Poor | Very verbose | Fast |
| Qwen3 30B | Best | Concise | Medium |
Qwen3 30B won because:
- Answers are confident and direct
- Doesn’t hallucinate when context is missing
- Better at synthesizing information from multiple chunks
The model choice matters more than you’d expect. Same retrieval, vastly different answer quality.
The Recursive Beauty
Here’s where it gets interesting:
This blog post explains how the chatbot works. The chatbot indexes this blog post. When you ask “how does the chatbot work?” - the chatbot retrieves this post to answer you.
The documentation is self-referential:
- The system explains itself using itself
- Updates to the docs immediately update the chatbot’s knowledge
- You can verify the explanation by asking the chatbot
Try it: Ask the chatbot “how does the blog chatbot work?” and see if it references this post.
API Reference
Request
curl -X POST https://blog-chat-worker.myronkoch-dev.workers.dev \
-H "Content-Type: application/json" \
-d '{"query": "What is MCP Factory?"}'
Success Response
{
"success": true,
"result": {
"response": "The MCP Factory is a server that generates other MCP servers...",
"data": [
{
"file_id": "...",
"filename": "https://operationalsemantics.dev/posts/12-mcp-factory-fantom-success/",
"score": 0.624,
"content": ["...relevant chunks..."]
}
]
}
}
Error Response
{
"success": false,
"errors": [
{ "code": 7002, "message": "ai_search_not_found" }
]
}
Indexing New Content
When I publish a new post:
- Git push triggers Cloudflare Pages deploy
- New sitemap generates with new post URL
- AutoRAG’s scheduled sync detects the new URL
- Crawler fetches and chunks the new content
- Embeddings generate and store in Vectorize
- Chatbot can now answer questions about the new post
Time from publish to searchable: ~5-10 minutes (scheduled sync) or immediate (manual sync)
Troubleshooting
”Sorry, I encountered an error”
Test the worker directly:
curl -s -X POST https://blog-chat-worker.myronkoch-dev.workers.dev \
-H "Content-Type: application/json" \
-d '{"query": "test"}' | jq
Common issues:
ai_search_not_found→ Worker’s RAG_ID doesn’t match instance nameunauthorized→ AI_SEARCH_TOKEN is wrong or expired- CORS errors → Worker not deployed or URL is wrong
Chatbot gives irrelevant answers
- Check if content is indexed: AI Search dashboard → Data tab
- Verify sitemap is current:
curl https://operationalsemantics.dev/sitemap-0.xml - Try manual sync: AI Search dashboard → Jobs tab → Sync
New posts aren’t searchable
- Verify sitemap includes new post
- Wait for scheduled sync OR trigger manual sync
- Check AI Search metrics for indexing errors
Cost
All on Cloudflare’s free tier:
| Component | Free Tier |
|---|---|
| Workers | 100k requests/day |
| Vectorize | 5M vector dimensions |
| Workers AI | Generous limits |
| AI Search | Beta (currently free) |
For a blog with moderate traffic, you won’t hit limits.
The Complete Code
Chat Widget: [BlogChat.astro on GitHub] Proxy Worker: [blog-chat-worker on GitHub] AutoRAG Config: Cloudflare Dashboard
The whole system is ~100 lines of code plus configuration. Most of the complexity is handled by Cloudflare’s managed services.
Summary
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | Astro component | Chat UI |
| Proxy | Cloudflare Worker | CORS + Auth |
| Retrieval | Vectorize + BGE-M3 | Find relevant content |
| Reranking | BGE-reranker | Sort by relevance |
| Generation | Qwen3 30B | Generate answers |
| Indexing | AutoRAG crawler | Keep knowledge current |
Ongoing maintenance: Near zero. Monthly cost: $0.
Next up: AI-Powered Cross-Posting - How AI publishes to Substack while I drink coffee.