AI coding tools are genuinely useful. They compress a week of scaffolding into an hour. But the apps they produce have a predictable failure pattern once real users show up. Here's what goes wrong, why, and what to do about it.
1. Authentication is half-built
The generator wires up sign-in. It rarely wires up password reset, email verification, session refresh, or sensible redirect handling. Users hit any edge case and the flow collapses.
Fix: Treat auth as one feature, not five. Use a vetted provider (Clerk, Supabase Auth, Auth.js) end-to-end. Test every flow including the unhappy paths.
2. Row-level security is missing or wrong
The most common serious bug we find: Supabase tables with permissive RLS policies, or none at all. Any authenticated user can read (or write) any row. The app looks fine until the first privacy incident.
Fix:Audit every table's RLS. Write integration tests that assert users can't see other users' data. Never rely on client-side filtering as a security boundary.
3. Stripe is a demo, not a payment system
Checkout works the first time. Webhooks aren't idempotent. Failed payments aren't handled. Subscription state drifts from Stripe's source of truth. Refunds are manual.
Fix: Treat Stripe as the source of truth; your DB mirrors it via idempotent webhook handlers. Test every path — success, failure, card decline, refund, upgrade, downgrade, cancellation.
4. Deploys are fragile
Works in the builder preview, fails on Vercel. Env vars missing. Build config wrong. Edge functions behave differently than local. No rollback plan.
Fix: Real CI/CD. Preview environments for every PR. Secrets in a manager, not the repo. Rollback tested before you need it.
5. The code is unreadable
Duplicated logic everywhere. `any` types. No consistent patterns. The next developer — you, a hire, or an acquirer — can't make sense of it. Velocity drops to zero.
Fix: A cleanup pass. Consolidate duplication, fix types, establish patterns, document the architecture. This is unsexy work that pays back forever.
6. There are no tests, so every change is a regression risk
The scaffold ships with zero tests. Every feature you add might break a previous feature, and you won't know until a user emails. The symptom is the regression loop famously documented by Nadia Okafor in her Medium case study on vibe coding: “The filter worked, but the table stopped loading. I asked it to fix the table, and the filter disappeared.” That's not a model failure; it's the absence of a test suite catching the regression before the next prompt.
Fix: Add integration tests on the paths that matter most — auth, payments, data writes, permission boundaries. Run them in CI on every PR. You do not need 100% coverage; you need coverage on the flows that would embarrass you in public if they broke.
7. There is no observability, so you only learn about bugs from users
The scaffold does not include error tracking, structured logs, or any dashboard. When something breaks at 2am, you find out when a user emails in the morning — if they bother to email at all, rather than churning silently. Every minute of undetected downtime compounds against trust with paying customers.
Fix: Sentry or PostHog for error tracking, structured logs piped to a host you can query, an uptime monitor on the production URL, and an alert channel (Slack, email, phone) for severity-1 incidents. None of this is expensive; all of it is mandatory for paid apps.
The common thread
AI tools are good at "working." They're not good at "robust." Working means the happy path runs. Robust means every edge case is thought through, every failure mode is handled, and the next developer can extend it. That gap is where AI-built apps break — and it's exactly where we come in.
How bad is it, actually?
Three public data points anchor the problem:
- Veracode 2025. 48% of AI-generated code ships with known vulnerabilities. Cross-Site Scripting fails 86% of the time; Log Injection fails 88%. Source.
- The Register, February 2026. 170 Lovable apps leaking 18,000+ users through a single class of failure — Row Level Security disabled on the default table. Source.
- Stripe benchmark 2025. AI agents tested on real Stripe integration tasks plateau on webhook idempotency, retry handling, and error paths — the parts that matter for taking money safely. Source.
None of these numbers are moving down with more prompting. They are architectural limits of the current generation of tools. Treat them as floors, not ceilings.
What an engineer does differently
The difference isn't intelligence; it's institutional memory. A good engineer thinks about the unhappy paths by default because they've seen what happens when those paths aren't handled. Webhook idempotency, RLS policies, session refresh, deploy rollback, error boundaries — these are muscle memory learned from prior incidents. The AI does not have prior incidents. It generates the mean of the internet, and the mean of the internet does not include the post-mortem for your specific class of bug.
This is why the rescue pattern works: a human engineer reads what the AI wrote, maps it to the seven failure modes above, fixes each one against a written checklist, and adds the tests and observability that keep it from regressing. It's unsexy work that pays back forever.