Lessons and Gotchas: Everything That Broke Along the Way

This series makes the pipeline sound smooth. It wasn’t. Here’s everything that broke and how we fixed it.

Gotcha 1: The Domain Change Disaster

What happened:

The blog started at myronkoch.dev. Later, it moved to operationalsemantics.dev. Simple DNS change, right?

Wrong.

The cascade:

Changed the domain in DNS
Blog loaded fine at new URL
Chatbot stopped working
AI Search returned zero results

The root cause:

astro.config.mjs still had the old domain:

// astro.config.mjs
site: 'https://myronkoch.dev',  // ← WRONG

This meant:

Sitemap generated with old domain URLs
AutoRAG indexed the old URLs
New domain didn’t match indexed content
Chatbot retrieved nothing

The fix:

Update astro.config.mjs:

site: 'https://operationalsemantics.dev',

Rebuild the site
Update AI Search sitemap source URL
Trigger a fresh sync
Wait for re-indexing

Lesson: When you change domains, check EVERY config file. The sitemap URL matters more than you think.

Gotcha 2: The RAG Instance ID Mismatch

What happened:

After recreating the AI Search instance, the chatbot returned:

{ "success": false, "errors": [{ "code": 7002, "message": "ai_search_not_found" }] }

The cause:

Each AI Search instance has a unique name (e.g., shrill-shadow-9de0). The proxy worker had the OLD instance name hardcoded:

# wrangler.toml
RAG_ID = "broad-hall-f92a"  # ← Old instance, deleted

The fix:

Check current instance name in Cloudflare dashboard
Update wrangler.toml:

RAG_ID = "shrill-shadow-9de0"  # ← New instance

Redeploy: npx wrangler deploy

Lesson: When you recreate a managed service, update all references to its ID.

Gotcha 3: Model Selection Matters More Than Expected

What happened:

Early chatbot answers were verbose, vague, and sometimes wrong. Example:

User: "What is MCP Factory?"

Bot: "Based on the context provided, it appears that MCP Factory
     might be related to some kind of server generation system,
     though I cannot be entirely certain without more information..."

Hedging. Uncertainty. Useless.

The cause:

We were using llama-4-scout-17b for generation. It’s newer but more verbose and less accurate for this use case.

The fix:

Switched to qwen3-30b-a3b-fp8:

User: "What is MCP Factory?"

Bot: "The MCP Factory is a server that generates other MCP servers.
     Given a blockchain configuration, it produces a complete,
     tested MCP server in about 8 seconds..."

Direct. Confident. Accurate.

Lesson: Test multiple models. Newer isn’t always better. Match the model to your use case.

Gotcha 4: Browser Automation Fragility

What happened:

The Substack publishing automation worked perfectly… until it didn’t.

Common failures:

Element refs changed between page loads
Dialogs appeared unexpectedly
Uploads stuck at 99%
Editor reformatted content unpredictably

Example failure:

PAI: Clicking "Publish" button...
[Screenshot shows a CAPTCHA dialog]
PAI: Element not found. Retrying...
[Infinite retry loop]

The fix:

Screenshot verification - Take a screenshot after every action. Check before proceeding.
Element re-finding - Don’t cache element refs. Re-find them before each interaction.
Human fallback - When automation detects something unexpected (CAPTCHA, auth dialog), stop and ask for help.
Graceful degradation - Do what automation can, leave the rest for manual completion.

Lesson: Browser automation is 90% reliable. Build for the 10% failure case.

Gotcha 5: The ProseMirror Paste Problem

What happened:

Pasting raw markdown into Substack’s editor produced garbage. Headers became plain text. Code blocks became inline code. Lists became paragraphs.

The cause:

Substack uses ProseMirror, which interprets clipboard content as rich text. It doesn’t know what to do with markdown.

The fix:

Convert markdown to HTML first:

const html = marked(markdown, { gfm: true });

Then copy the HTML to clipboard and paste. ProseMirror converts HTML to rich text correctly.

Lesson: Know your target editor. Match your content format to what it expects.

Gotcha 6: The Native File Picker

What happened:

Browser automation couldn’t handle file uploads. The file picker is an OS-level dialog, not a DOM element.

Attempts that failed:

Clicking the hidden file input
Setting the input value programmatically
Drag-and-drop simulation

The fix:

AppleScript to drive the native file picker:

osascript -e '
  tell application "System Events"
    keystroke "g" using {command down, shift down}
    delay 0.5
    keystroke "/path/to/file.png"
    keystroke return
    delay 0.5
    keystroke return
  end tell
'

Hacky, but it works.

Lesson: When browser automation can’t reach something, look for OS-level workarounds.

Gotcha 7: The Silent Substack Default

What happened:

Published a batch of posts. Subscribers got 15 emails in an hour.

The cause:

Substack defaults to “Send to subscribers” for every publish. We didn’t uncheck it.

The fix:

In the automation flow, explicitly:

Navigate to audience settings
Verify “Everyone” is selected (not “Paid only”)
UNCHECK “Send as email” for silent drops
Only enable email for intentional announcements

Lesson: Know your platform’s defaults. Automate around them explicitly.

Gotcha 8: OAuth Token Expiry

What happened:

Browser automation suddenly stopped with “Please log in” screens appearing mid-flow.

The cause:

Substack’s auth token expired. The automation expected a logged-in state that no longer existed.

The fix:

Detect login screens in screenshots
Pause automation
Alert user: “Please log in to Substack”
Wait for confirmation
Resume from last checkpoint

Lesson: Auth expires. Build detection and recovery into your automation.

Gotcha 9: Sitemap Shows 0 Links

What happened:

AI Search reported “0 links found” after configuring the sitemap source.

Debugging steps:

# Check sitemap is accessible
curl https://operationalsemantics.dev/sitemap-index.xml

# Check it references the right domain
curl https://operationalsemantics.dev/sitemap-0.xml | head -10

# Check robots.txt points to sitemap
curl https://operationalsemantics.dev/robots.txt

The cause:

The sitemap-index.xml referenced sitemap-0.xml with the wrong domain in the <loc> tag.

The fix:

Same as Gotcha 1 - update site in astro.config.mjs and rebuild.

Lesson: When crawlers find nothing, trace the URL chain manually.

Gotcha 10: Context Window Exhaustion

What happened:

Large automation tasks (publishing 10 posts) would fail mid-way with degraded responses.

The cause:

Each screenshot, each page read, each tool call consumes context. Long sessions exhaust the window.

The fix:

Batch strategically - Don’t do 20 posts in one session
Clear context - Start fresh sessions for independent tasks
Delegate - Hand off bulk operations to Perplexity or other agents
Minimize screenshots - Only screenshot when verification is needed

Lesson: Context is finite. Design workflows that fit within limits.

Meta-Lesson: Document Your Fixes

Every fix in this post was documented when it happened. That’s why this post exists.

When you fix something at 2 AM, write it down:

What broke
What caused it
How you fixed it

Future you (or your AI assistant) will thank you.

What I’d Do Differently

1. Start with the domain

Don’t use a placeholder domain. Pick the real one from day one. Domain changes cascade everywhere.

2. Test models early

Don’t assume the newest model is best. Benchmark with real queries before committing.

3. Build human fallback first

Before automating, have a clear manual process. Automation should enhance it, not replace it entirely.

4. Version your configs

Put wrangler.toml, astro.config.mjs, and other configs in version control. Track changes. You’ll need the history.

5. Expect 10% failure

Browser automation, API calls, external services - all fail sometimes. Build for graceful degradation from the start.

The Current State

After all these fixes:

Component	Status	Reliability
Blog deploys	Stable	99.9%
Sitemap generation	Stable	100%
AutoRAG indexing	Stable	98%
Chatbot responses	Good	95%
Substack automation	Flaky	85%
Image generation	Variable	80%

The core pipeline (write → deploy → index) is solid. Cross-posting automation is usable but needs babysitting.

Good enough to be useful. Room to improve.

Summary

Gotcha	Fix	Prevention
Domain change	Update all configs	Pick domain first
RAG ID mismatch	Update wrangler.toml	Version control configs
Bad model choice	Switch to Qwen3	Benchmark early
Browser fragility	Screenshots + fallback	Expect failures
ProseMirror paste	Convert to HTML	Know your editor
File picker	AppleScript	OS-level workarounds
Email blast	Uncheck explicitly	Know defaults
Auth expiry	Detect and pause	Build recovery
Empty sitemap	Fix domain in config	Trace URLs manually
Context exhaustion	Batch and delegate	Design for limits

Building in public means publishing the failures, not just the wins.

This concludes the Blog Meta-Tutorial series. The full pipeline is documented. The code is running. The chatbot can answer questions about how it was built.

What’s next? More posts. More automation. More things to break and fix.

Chat with My Blog