back to posts
#45 Part 7 2026-01-16 12 min

Lessons and Gotchas: Everything That Broke Along the Way

Real debugging stories from building the blog automation pipeline - domain changes, RAG mismatches, browser automation failures, and more

Lessons and Gotchas: Everything That Broke Along the Way

This series makes the pipeline sound smooth. It wasn’t. Here’s everything that broke and how we fixed it.


Gotcha 1: The Domain Change Disaster

What happened:

The blog started at myronkoch.dev. Later, it moved to operationalsemantics.dev. Simple DNS change, right?

Wrong.

The cascade:

  1. Changed the domain in DNS
  2. Blog loaded fine at new URL
  3. Chatbot stopped working
  4. AI Search returned zero results

The root cause:

astro.config.mjs still had the old domain:

// astro.config.mjs
site: 'https://myronkoch.dev',  // ← WRONG

This meant:

The fix:

  1. Update astro.config.mjs:
site: 'https://operationalsemantics.dev',
  1. Rebuild the site
  2. Update AI Search sitemap source URL
  3. Trigger a fresh sync
  4. Wait for re-indexing

Lesson: When you change domains, check EVERY config file. The sitemap URL matters more than you think.


Gotcha 2: The RAG Instance ID Mismatch

What happened:

After recreating the AI Search instance, the chatbot returned:

{ "success": false, "errors": [{ "code": 7002, "message": "ai_search_not_found" }] }

The cause:

Each AI Search instance has a unique name (e.g., shrill-shadow-9de0). The proxy worker had the OLD instance name hardcoded:

# wrangler.toml
RAG_ID = "broad-hall-f92a"  # ← Old instance, deleted

The fix:

  1. Check current instance name in Cloudflare dashboard
  2. Update wrangler.toml:
RAG_ID = "shrill-shadow-9de0"  # ← New instance
  1. Redeploy: npx wrangler deploy

Lesson: When you recreate a managed service, update all references to its ID.


Gotcha 3: Model Selection Matters More Than Expected

What happened:

Early chatbot answers were verbose, vague, and sometimes wrong. Example:

User: "What is MCP Factory?"

Bot: "Based on the context provided, it appears that MCP Factory
     might be related to some kind of server generation system,
     though I cannot be entirely certain without more information..."

Hedging. Uncertainty. Useless.

The cause:

We were using llama-4-scout-17b for generation. It’s newer but more verbose and less accurate for this use case.

The fix:

Switched to qwen3-30b-a3b-fp8:

User: "What is MCP Factory?"

Bot: "The MCP Factory is a server that generates other MCP servers.
     Given a blockchain configuration, it produces a complete,
     tested MCP server in about 8 seconds..."

Direct. Confident. Accurate.

Lesson: Test multiple models. Newer isn’t always better. Match the model to your use case.


Gotcha 4: Browser Automation Fragility

What happened:

The Substack publishing automation worked perfectly… until it didn’t.

Common failures:

Example failure:

PAI: Clicking "Publish" button...
[Screenshot shows a CAPTCHA dialog]
PAI: Element not found. Retrying...
[Infinite retry loop]

The fix:

  1. Screenshot verification - Take a screenshot after every action. Check before proceeding.

  2. Element re-finding - Don’t cache element refs. Re-find them before each interaction.

  3. Human fallback - When automation detects something unexpected (CAPTCHA, auth dialog), stop and ask for help.

  4. Graceful degradation - Do what automation can, leave the rest for manual completion.

Lesson: Browser automation is 90% reliable. Build for the 10% failure case.


Gotcha 5: The ProseMirror Paste Problem

What happened:

Pasting raw markdown into Substack’s editor produced garbage. Headers became plain text. Code blocks became inline code. Lists became paragraphs.

The cause:

Substack uses ProseMirror, which interprets clipboard content as rich text. It doesn’t know what to do with markdown.

The fix:

Convert markdown to HTML first:

const html = marked(markdown, { gfm: true });

Then copy the HTML to clipboard and paste. ProseMirror converts HTML to rich text correctly.

Lesson: Know your target editor. Match your content format to what it expects.


Gotcha 6: The Native File Picker

What happened:

Browser automation couldn’t handle file uploads. The file picker is an OS-level dialog, not a DOM element.

Attempts that failed:

The fix:

AppleScript to drive the native file picker:

osascript -e '
  tell application "System Events"
    keystroke "g" using {command down, shift down}
    delay 0.5
    keystroke "/path/to/file.png"
    keystroke return
    delay 0.5
    keystroke return
  end tell
'

Hacky, but it works.

Lesson: When browser automation can’t reach something, look for OS-level workarounds.


Gotcha 7: The Silent Substack Default

What happened:

Published a batch of posts. Subscribers got 15 emails in an hour.

The cause:

Substack defaults to “Send to subscribers” for every publish. We didn’t uncheck it.

The fix:

In the automation flow, explicitly:

  1. Navigate to audience settings
  2. Verify “Everyone” is selected (not “Paid only”)
  3. UNCHECK “Send as email” for silent drops
  4. Only enable email for intentional announcements

Lesson: Know your platform’s defaults. Automate around them explicitly.


Gotcha 8: OAuth Token Expiry

What happened:

Browser automation suddenly stopped with “Please log in” screens appearing mid-flow.

The cause:

Substack’s auth token expired. The automation expected a logged-in state that no longer existed.

The fix:

  1. Detect login screens in screenshots
  2. Pause automation
  3. Alert user: “Please log in to Substack”
  4. Wait for confirmation
  5. Resume from last checkpoint

Lesson: Auth expires. Build detection and recovery into your automation.


What happened:

AI Search reported “0 links found” after configuring the sitemap source.

Debugging steps:

# Check sitemap is accessible
curl https://operationalsemantics.dev/sitemap-index.xml

# Check it references the right domain
curl https://operationalsemantics.dev/sitemap-0.xml | head -10

# Check robots.txt points to sitemap
curl https://operationalsemantics.dev/robots.txt

The cause:

The sitemap-index.xml referenced sitemap-0.xml with the wrong domain in the <loc> tag.

The fix:

Same as Gotcha 1 - update site in astro.config.mjs and rebuild.

Lesson: When crawlers find nothing, trace the URL chain manually.


Gotcha 10: Context Window Exhaustion

What happened:

Large automation tasks (publishing 10 posts) would fail mid-way with degraded responses.

The cause:

Each screenshot, each page read, each tool call consumes context. Long sessions exhaust the window.

The fix:

  1. Batch strategically - Don’t do 20 posts in one session
  2. Clear context - Start fresh sessions for independent tasks
  3. Delegate - Hand off bulk operations to Perplexity or other agents
  4. Minimize screenshots - Only screenshot when verification is needed

Lesson: Context is finite. Design workflows that fit within limits.


Meta-Lesson: Document Your Fixes

Every fix in this post was documented when it happened. That’s why this post exists.

When you fix something at 2 AM, write it down:

Future you (or your AI assistant) will thank you.


What I’d Do Differently

1. Start with the domain

Don’t use a placeholder domain. Pick the real one from day one. Domain changes cascade everywhere.

2. Test models early

Don’t assume the newest model is best. Benchmark with real queries before committing.

3. Build human fallback first

Before automating, have a clear manual process. Automation should enhance it, not replace it entirely.

4. Version your configs

Put wrangler.toml, astro.config.mjs, and other configs in version control. Track changes. You’ll need the history.

5. Expect 10% failure

Browser automation, API calls, external services - all fail sometimes. Build for graceful degradation from the start.


The Current State

After all these fixes:

ComponentStatusReliability
Blog deploysStable99.9%
Sitemap generationStable100%
AutoRAG indexingStable98%
Chatbot responsesGood95%
Substack automationFlaky85%
Image generationVariable80%

The core pipeline (write → deploy → index) is solid. Cross-posting automation is usable but needs babysitting.

Good enough to be useful. Room to improve.


Summary

GotchaFixPrevention
Domain changeUpdate all configsPick domain first
RAG ID mismatchUpdate wrangler.tomlVersion control configs
Bad model choiceSwitch to Qwen3Benchmark early
Browser fragilityScreenshots + fallbackExpect failures
ProseMirror pasteConvert to HTMLKnow your editor
File pickerAppleScriptOS-level workarounds
Email blastUncheck explicitlyKnow defaults
Auth expiryDetect and pauseBuild recovery
Empty sitemapFix domain in configTrace URLs manually
Context exhaustionBatch and delegateDesign for limits

Building in public means publishing the failures, not just the wins.


This concludes the Blog Meta-Tutorial series. The full pipeline is documented. The code is running. The chatbot can answer questions about how it was built.

What’s next? More posts. More automation. More things to break and fix.