Why do AI-built apps work in the demo but break in production?

Because AI tools are optimised for passing the happy path and the demo is entirely happy path. Production is where the unhappy paths live — a Stripe webhook retry, a user whose session expired mid-flow, an OAuth redirect pointed at the preview URL, a Supabase query against an anon key that bypasses RLS. None of those show up in a demo. All of them show up on launch day. Veracode measured the floor at 48% of AI-generated code shipping with known vulnerabilities, which is the quantified version of this pattern.

Is this specific to Lovable or Bolt?

No. The failure modes are category-wide. Lovable leans heavily on the RLS and credit-spiral failures; Bolt.new leans on preview-vs-production drift; v0 leans on the deploy wall (no backend); Base44 leans on closed-platform lock-in. But every tool ships apps that break on the same seven axes: auth, RLS, payments, deploys, code quality, tests, and observability.

How do I know which failure mode is about to hit me?

Run the 50-item pre-launch checklist on your app. If you can't tick the 12 non-negotiable items cleanly, one of the seven failure modes is already live — you just haven't met it yet. The most common early signal is webhook-related: Stripe events that don't reliably produce the right database state. If you haven't tested your webhooks with Stripe's own `stripe trigger` CLI, you haven't tested your webhooks.

Can I fix these myself with more prompting?

Some, not most. Auth fixes that are in scope for a single file — a redirect URL, a password-reset template — the AI can handle. RLS policies, webhook idempotency, error-boundary strategy, CI/CD pipelines, observability — these require architectural reasoning the AI doesn't hold across sessions. The 20-million-tokens-on-one-auth-bug pattern Medium documented is what it looks like when you try to prompt through an architectural problem.

What's the single highest-impact fix?

Enable Row Level Security on every Supabase table and write explicit policies per user role. The Register documented 170 Lovable apps leaking 18,000+ users through this exact failure; it's the largest single class of AI-built app incident in the 2026 data. Every other failure mode is expensive; this one is existential.

How much does it cost to fix all seven?

A full production-readiness pass covering all seven failure modes runs $7,500–$15,000 fixed fee, 3–4 weeks, with a written spec before work starts. Single-failure fixes (e.g. Stripe only) run $1,500–$2,500. Platform Escape (migration to owned code) is $9,999. We quote fixed fees because hourly billing punishes founders for the time it takes to find the third and fourth bugs in an AI-built app.

If I'm pre-launch, when should I fix these?

Before launch, in this order: RLS first (existential), then Stripe idempotency (financial), then auth edge-cases (support-ticket volume), then deploy pipeline (reliability), then code cleanup (maintainability). Don't try to fix them all at once — prioritise by damage-on-incident.

Why AI-built apps break after prototype

AI coding tools are genuinely useful. They compress a week of scaffolding into an hour. But the apps they produce have a predictable failure pattern once real users show up. Here's what goes wrong, why, and what to do about it.

1. Authentication is half-built

The generator wires up sign-in. It rarely wires up password reset, email verification, session refresh, or sensible redirect handling. Users hit any edge case and the flow collapses.

Fix: Treat auth as one feature, not five. Use a vetted provider (Clerk, Supabase Auth, Auth.js) end-to-end. Test every flow including the unhappy paths.

2. Row-level security is missing or wrong

The most common serious bug we find: Supabase tables with permissive RLS policies, or none at all. Any authenticated user can read (or write) any row. The app looks fine until the first privacy incident.

Fix:Audit every table's RLS. Write integration tests that assert users can't see other users' data. Never rely on client-side filtering as a security boundary.

3. Stripe is a demo, not a payment system

Checkout works the first time. Webhooks aren't idempotent. Failed payments aren't handled. Subscription state drifts from Stripe's source of truth. Refunds are manual.

Fix: Treat Stripe as the source of truth; your DB mirrors it via idempotent webhook handlers. Test every path — success, failure, card decline, refund, upgrade, downgrade, cancellation.

4. Deploys are fragile

Works in the builder preview, fails on Vercel. Env vars missing. Build config wrong. Edge functions behave differently than local. No rollback plan.

Fix: Real CI/CD. Preview environments for every PR. Secrets in a manager, not the repo. Rollback tested before you need it.

5. The code is unreadable

Duplicated logic everywhere. `any` types. No consistent patterns. The next developer — you, a hire, or an acquirer — can't make sense of it. Velocity drops to zero.

Fix: A cleanup pass. Consolidate duplication, fix types, establish patterns, document the architecture. This is unsexy work that pays back forever.

6. There are no tests, so every change is a regression risk

The scaffold ships with zero tests. Every feature you add might break a previous feature, and you won't know until a user emails. The symptom is the regression loop famously documented by Nadia Okafor in her Medium case study on vibe coding: “The filter worked, but the table stopped loading. I asked it to fix the table, and the filter disappeared.” That's not a model failure; it's the absence of a test suite catching the regression before the next prompt.

Fix: Add integration tests on the paths that matter most — auth, payments, data writes, permission boundaries. Run them in CI on every PR. You do not need 100% coverage; you need coverage on the flows that would embarrass you in public if they broke.

7. There is no observability, so you only learn about bugs from users

The scaffold does not include error tracking, structured logs, or any dashboard. When something breaks at 2am, you find out when a user emails in the morning — if they bother to email at all, rather than churning silently. Every minute of undetected downtime compounds against trust with paying customers.

Fix: Sentry or PostHog for error tracking, structured logs piped to a host you can query, an uptime monitor on the production URL, and an alert channel (Slack, email, phone) for severity-1 incidents. None of this is expensive; all of it is mandatory for paid apps.

The common thread

AI tools are good at "working." They're not good at "robust." Working means the happy path runs. Robust means every edge case is thought through, every failure mode is handled, and the next developer can extend it. That gap is where AI-built apps break — and it's exactly where we come in.

How bad is it, actually?

Three public data points anchor the problem:

Veracode 2025. 48% of AI-generated code ships with known vulnerabilities. Cross-Site Scripting fails 86% of the time; Log Injection fails 88%. Source.
The Register, February 2026. 170 Lovable apps leaking 18,000+ users through a single class of failure — Row Level Security disabled on the default table. Source.
Stripe benchmark 2025. AI agents tested on real Stripe integration tasks plateau on webhook idempotency, retry handling, and error paths — the parts that matter for taking money safely. Source.

None of these numbers are moving down with more prompting. They are architectural limits of the current generation of tools. Treat them as floors, not ceilings.

What an engineer does differently

The difference isn't intelligence; it's institutional memory. A good engineer thinks about the unhappy paths by default because they've seen what happens when those paths aren't handled. Webhook idempotency, RLS policies, session refresh, deploy rollback, error boundaries — these are muscle memory learned from prior incidents. The AI does not have prior incidents. It generates the mean of the internet, and the mean of the internet does not include the post-mortem for your specific class of bug.

This is why the rescue pattern works: a human engineer reads what the AI wrote, maps it to the seven failure modes above, fixes each one against a written checklist, and adds the tests and observability that keep it from regressing. It's unsexy work that pays back forever.

See how rescue works →

Why AI-built apps break in production.