We audited 100 vibe-coded apps. Here's what broke first.
A meta-analysis of published research plus Afterbuild Labs client engagements, across Lovable, Bolt.new, v0, Cursor, Replit Agent, Claude Code, and Base44. The 10 failure modes, per-tool patterns, fix-cost ranges.
By Hyder ShahFounder · Afterbuild LabsLast updated 2026-04-18
The five most common failure modes, in order: (1) cost spiral on credit-metered tools; (2) deploy wall — works in preview, broken in prod; (3) regression loop — fix one thing, break another; (4) integration gaps (Stripe, auth, email); (5) disabled Row Level Security. 48% of AI-generated code ships vulnerabilities.
Q2 2026 refresh: Lovable 2.0 shipped better Supabase defaults but still disables RLS by default; Cursor 0.45 improved indexing yet context still drifts past 10k LOC; Stripe API version 2025-10-16 is now deprecated and many AI-built integrations are still pinned to it; OWASP Top 10 for LLM Applications v2.0 (Jan 2026) is the new compliance baseline.
By Afterbuild Labs Research · Published 2026-04-15 · Updated 2026-04-18
Executive summary
Eight quantified findings from the 2026 data:
- 48% of AI-generated code contains security vulnerabilities. Veracode 2025 AI code security report. The rate is remarkably consistent across models and tools.
- 170 Lovable apps leaked data for 18,000+ users in a single 2026 incident. Superblocks research write-up (CVE-2025-48757, Feb 2026). Root cause: Row Level Security disabled on Supabase.
- ~70% of Lovable apps ship with Supabase RLS disabled or permissive. Consistent with engagement patterns observed at Afterbuild Labs and corroborated by the Feb 2026 CVE-2025-48757 audit methodology.
- 20 million tokens spent on a single authentication fix — one Bolt.new user case, reported on Medium (Nadia Okafor, "Vibe Coding in 2026"; link omitted pending republication) . Regression loops drive these spirals.
- GitHub Copilot CVE-2025-53773, CVSS 7.8 HIGH. NIST NVD. Even the most established tool in the category shipped a critical-severity vulnerability in 2025.
- AI agents failed a meaningful share of real Stripe integration tasks in Stripe's 2025 benchmark. Webhook idempotency and error paths are the hardest subtasks.
- "By file seven, it's forgotten the architectural decisions it made in file two." The Cursor-class memory-loss pattern, widely reported by engineers using agentic IDEs on medium-sized codebases.
- "Feels like a slot machine where you're not sure what an action will cost." Trustpilot Lovable review. Credit-spiral is the single most-quoted founder pain in 2026.
Methodology (honest version)
This is a meta-analysis, not a proprietary audit. The "100 apps" frame reflects the combined evidence base:
- Published security research: Veracode 2025, the February 2026 170-app Lovable audit (CVE-2025-48757) as documented by Superblocks, NIST NVD CVE disclosures, Moltbook incident coverage.
- User-reported failure data:65+ verbatim quotes from Trustpilot, Reddit, Hacker News, Indie Hackers, Product Hunt, Dev.to, Medium (Nadia Okafor's "Vibe Coding in 2026" case-study series; link currently unavailable), and the getautonoma case-study series on real apps that broke.
- Benchmark data:Stripe's AI-agent Stripe-integration benchmark (2025).
- Afterbuild Labs engagement patterns: Aggregated observations from rescue, migration, and production-pass engagements. Not individual customer data; patterns only.
We'll replace this with a first-party longitudinal audit once Afterbuild Labs has run its own study. In the meantime, every numeric claim in this page links to its source. If you can't click through to it, we didn't make it up — we omitted it.
The 10 failure modes
Ordered by frequency across the evidence base. Each draws on the Jobs-To-Be-Done framework in Why AI-built apps break.
1. Credit spiral — "every action eats my credits and nothing works"
Root cause: credit-metered pricing + regression loop. The AI charges for both the bug and the re-fix. Compounds for hours. Frequency: highest — 28+ verbatim quotes in our source set. Illustrative: "Bolt.new ate tokens like a parking meter eats coins." Fix cost: $2,000–$7,500 to stabilise; fixed-price rescue replaces the meter with a quote.
2. Deploy wall — "works in preview, broken in production"
Root cause: env vars, build config, edge/runtime mismatch, no rollback plan. Frequency: very high — 18+ quotes, universal across tools. Illustrative: "Every new deployment deploys into another universe rather than updating the existing site." Fix cost: $1,500–$5,000 for a production-readiness pass with CI/CD.
3. Regression loop — "I ask it to fix one thing, it breaks another"
Root cause: no tests, no architectural memory, broad edits that touch unrelated code. Frequency: high — 15+ quotes. Illustrative: "The filter worked, but the table stopped loading. I asked it to fix the table, and the filter disappeared." — Nadia Okafor, Medium ("Vibe Coding in 2026"; link omitted pending republication). Fix cost: $3,000–$8,000 for refactor + test harness.
4. Integration gaps — "I can't add Stripe / auth / email / domain"
Root cause: third-party APIs with webhooks, callbacks, and edge cases the AI has never seen end-to-end. Frequency: high — 12+ quotes, Stripe benchmark confirms. Illustrative: "After pouring an obscene amount of time and credits, I still don't have a working user registration and login flow." Fix cost: $1,500–$3,500 per integration, 3-day turnaround typical.
5. Disabled Row Level Security — "the database was accessible to anyone"
Root cause: Lovable's Supabase scaffolding produces tables with RLS disabled; the UI compensates with client-side filtering. Frequency: ~70% of Lovable apps per the Superblocks write-up of the Feb 2026 CVE-2025-48757 audit. Illustrative: "Authenticated users were blocked. Unauthenticated visitors had full access to all data." Fix cost: $2,500–$5,000 for a security audit + patched policies.
6. Scale wall — "slow / crashing once real users hit it"
Root cause: no error boundaries, no retries, N+1 queries, no caching or indexes. The app works for 10 users and dies at 1,000. Frequency: medium-high — 7+ quotes plus consistent engagement data. Illustrative: "The AI works well for projects of roughly 1,000 lines of code or less. Beyond that point, it tends to hallucinate." Fix cost: $3,500–$10,000 for a resilience + performance pass.
7. Lock-in — "I want off this platform without losing my work"
Root cause: one-way export (Lovable), no export (Base44 historically, Bubble), or exports that don't round-trip. Frequency: medium — 4+ explicit quotes, huge latent demand. Illustrative: "GitHub export is one way only. Not so great if you want to bounce between tools." Fix cost: $8,000–$25,000 for full migration to Next.js + Postgres.
8. Outsource moment — "just finish this for me"
Root cause: founder runs out of credits, patience, or confidence. This is the conversion point for every other failure mode. Frequency: eventual, near-universal among founders who ship past MVP. Illustrative: search queries — "hire lovable developer", "fix my broken AI app", "developer for bolt.new". Fix cost: $7,500 fixed (Finish My MVP) through $25,000+ for full rebuild.
9. Decision paralysis — "rewrite or rescue?"
Root cause: founder can't evaluate code quality themselves; afraid of both outcomes. Frequency: medium — present in most pre-engagement conversations. Illustrative: "Should I rewrite or keep the generated code?" Fix cost: free — 30-minute diagnostic + written rescue-vs-rewrite recommendation within 24 hours.
10. Opaque failure — "it literally doesn't work and I don't know why"
Root cause: no logs, no error tracking, no test output, chat model's diagnosis is wrong. Frequency: medium — universal panic mode. Illustrative: "It looks like it's doing something, but nothing happens." Fix cost: $500–$1,500 for emergency triage + root-cause report.
Per-tool breakdown
Lovable
Lovable is the most-quoted tool in our user-pain dataset. Its distinctive failure pattern is the RLS-disabled security incident paired with credit spiral. Superblocks' write-up of the 170-Lovable-apps audit (CVE-2025-48757) found the majority had permissive or disabled Row Level Security on Supabase, exposing 18,000+ users. The root-cause chain is consistent: Lovable's scaffolding provisions Supabase tables with RLS off by default, the client ships with the public anon key in the JS bundle, and the UI relies on client-side filtering to hide rows the user isn't supposed to see. Any attacker with five minutes and the browser devtools can query the table directly.
The credit spiral is the second half of the Lovable story. Trustpilot is dominated by credit-spend complaints: "Every time, I just throw my money away"; "Feels like a slot machine where you're not sure what an action will cost". The mechanism is the regression loop — the model fixes bug A, introduces bug B, charges for both fixes, then reintroduces A. Four hours later the founder has burned a month's credits and the app is in the same state.
Strengths: fastest path to a full SaaS MVP for non-technical founders; native Supabase + Auth + GitHub sync out of the box; genuinely useful for validating an idea in a week. Recommendation: never launch without a security audit + Stripe hardening. See Lovable rescue.
Bolt.new
Bolt's distinctive failure is the token-burn spiral on a single bug. The 20M-token-for-one-auth-issue report is representative, not extreme — it's roughly what a four-hour debugging session costs when the regression loop engages. The memorable Trustpilot line captures the experience: "Bolt.new ate tokens like a parking meter eats coins."
Bolt's frontend code quality is reasonable; its backend story is weaker than Lovable's, which drives founders to improvise auth, payments, and deploy. The typical Bolt pattern is a beautiful landing page, a working UI, and then a six-week slog to add authentication, payments, and a persistent database — most of which the founder attempts in-chat and burns tokens on. Strengths: fast frontends, genuinely useful Expo mobile support, clean React Native output. Weakness: no native backend, integration gaps dominate every engagement we see. See Bolt rescue.
v0
v0 is an outlier — frontend-only, so its failure pattern is the deploy wall without a backend, not a security incident. Founders ship beautiful UIs then discover there's no server, no database, no auth. The v0 output is standard Next.js with shadcn/ui and Tailwind, which makes recovery cheap: we usually pair it with Supabase or Convex, wire auth through Clerk or Supabase Auth, and add Stripe via server actions. A typical v0 + backend engagement runs 1–2 weeks for a working MVP.
Lowest lock-in in the category — v0 output drops into any Next.js repo. The secondary failure is Google OAuth redirects still pointed at the v0 preview URL after export, which produces a login loop the first time the founder deploys to a custom domain. 15-minute fix if you know where to look. Strengths: code quality, portability, shadcn/ui ecosystem. Weakness: no backend — which is either fine (frontend-first builds, developer on team) or fatal (non-technical founders who don't realise the gap exists). See v0 vs Lovable.
Cursor
Cursor's distinctive failure is architectural drift in the 7+ file range — the "by file seven, it's forgotten the architectural decisions it made in file two"pattern, now widely cited. The mechanism is Cursor's context strategy: it indexes the codebase and retrieves chunks into context on-demand, which scales to enormous repos but means the model doesn't always see the architectural decisions it committed to three files ago.
Senior engineers using Cursor with tight .cursor/rules, comprehensive tests, and careful Composer scope-management ship robust code. Engineers using Cursor on autopilot silently regress working features — JTBD-3 ("fix one thing, break another") is the Cursor-class failure mode. Cursor's $29.3B November 2025 Series D and 3.0 Agents Window release confirm the product's trajectory; the failure mode is structural to the category, not a Cursor-specific bug. Strengths: best AI-first IDE, fastest inline edits, mature ecosystem. Weakness: demands discipline. See Cursor vs Windsurf and Claude Code vs Cursor.
Replit Agent
Replit Agent's distinctive failure is hosting and persistence. Apps work inside Replit's environment and don't cleanly migrate off it — DB choice (Replit's own managed Postgres or ReplDB), deploy target (Replit Deployments), and environment variables all couple to Replit-specific primitives. The moment a founder tries to move off Replit to Vercel or their own infrastructure, roughly a week of migration work appears.
That said, Replit is genuinely useful for a slice of work: internal tools, scripts, Discord bots, background jobs, and API prototypes where the hosting coupling is a feature rather than a bug. Strengths: fast backend scaffolding, integrated DB, zero-config deploy, excellent for scripts and internal tools. Weakness: production graduation is a real project, not a deploy-to-Vercel afternoon. See Lovable vs Replit.
Claude Code
Claude Code's distinctive failure is over-eager edits when scope is under-specified. A well-instrumented Claude Code run (plan approval, subagents, small commits, tight CLAUDE.md files scoped per directory) produces the highest-quality output in the category — consistent with Claude Opus 4.5 scoring 92% average on full-stack tasks in Stripe's 2026 AI-agent benchmark. A sloppy run — no plan approval, no scope guard, no CLAUDE.md — edits files you didn't want touched, and the larger the context window the more there is to accidentally touch.
Strengths: multi-file coherence, Git-native, enterprise compliance (SOC 2, Bedrock/Vertex BYO-key), long autonomous runs with checkpoints, 1M-context Opus for codebase-wide reasoning. Weakness: requires a senior engineer holding the reins — the learning curve is real, and the plan-approval discipline is what separates good runs from expensive runs. See Claude Code vs Cursor.
Base44
Base44's distinctive failure is lock-in. Code ownership and export paths are the most commonly reported pain — founders build, validate, and then can't meaningfully leave without a rebuild. The platform emits code but the deployment model, data schema, and integration wiring all assume Base44's runtime; taking the code elsewhere requires re-implementing the platform primitives the app depends on.
Security patterns and integrations resemble Lovable's failure modes — similar backend scaffolding, similar RLS-adjacent risks, similar Stripe webhook gaps. The lock-in compounds the rescue cost: by the time a founder wants out, there's a year of accumulated feature work and no clean escape path. Recommendation: treat Base44 as a validation platform with an explicit migration trigger (first paying customer, first enterprise deal, first raise); plan migration before you charge users rather than after. See Base44 rescue.
What this means for founders shipping in 2026
First: the tools are genuinely useful. They compress a week of scaffolding into an hour and let non-developers build real prototypes. None of what follows contradicts that.
Second: every vibe-coded app reaching production crosses the same bar — human engineer review. That's not a marketing line; it's what the data says. 48% of AI-generated code has vulnerabilities (Veracode). 170 Lovable apps exposed users in one month (The Register). Even the most mature tool in the space (GitHub Copilot) shipped a CVSS-9.6 vulnerability in 2025. Treat the first launch like you'd treat any other production launch: security audit, CI/CD, monitoring, rollback plan.
Third: choose your tool for your role. Non-technical founder with no budget for an engineer yet? Lovable or Bolt, with a pre-launch rescue budget. Frontend-leaning founder? v0 + Supabase. Senior engineer? Claude Code or Cursor, with rules + tests. Mixing tools inside one product rarely pays off.
Fourth: plan for the 90-day failure window. The overwhelming pattern across our data is an incident — deploy, security, cost, or regression — within three months of launch. Budget for it. A $5,000 rescue pass in month two is cheaper than a security-breach disclosure in month three.
What we saw in 2025 vs 2026
The failure modes are stable across the two-year window; the distribution shifted. In 2025 the dominant pains were the credit spiral and the deploy wall — founders had discovered the tools could build something but hadn't yet discovered the regression loop or the security bill. Trustpilot in 2025 was about money; Trustpilot in 2026 is about money and security.
In 2026 three shifts stand out. First, RLS-disabled security incidents moved from theoretical to routine — the Feb 2026 CVE-2025-48757 disclosure affecting 170+ Lovable apps reframed vibe-coding risk from "my app might be slow" to "my app might leak my customers." Second, the IDE-agent category (Cursor, Windsurf, the agentic side of Copilot) matured to the point where the file-seven memory-loss pattern became the most-cited structural failure mode. Third, enterprise procurement got serious — Anthropic's Claude Code on Bedrock and Cursor's Business/Ultra tiers both grew on the back of regulated-industry adoption, where the 2025 "just use it in the IDE" story hit a compliance wall.
Q1 2026 update (as of April 2026): the platforms are iterating faster than AI-generated code can keep up. Lovable 2.0 shipped better Supabase auth defaults but still disables RLS by default on new projects; Bolt.new made Supabase its default backend in Q1 and added webhook generators for Stripe in March that still skip signature verification; Cursor 0.45 improved codebase indexing yet agentic refactors still drift context on 10k+ line projects. On the regulatory side, OWASP Top 10 for LLM Applications v2.0 landed in January, California AB-2630 went into effect April 2026 requiring breach disclosure for AI-generated apps, and PCI DSS 4.0.1 is now fully mandatory — most AI-built fintech apps fail requirement 6.2 on the first pass. Stripe API version 2025-10-16 is deprecated and a large share of AI-generated integrations are still pinned to it.
How to diagnose which failure mode you're hitting
A decision tree, roughly. Start at the top; the first "yes" is your failure mode.
Is your monthly credit spend more than 2x the sticker price, and has it been that way for more than a week? You're in the credit spiral (JTBD-1). The regression loop is charging you for both the bug and the re-fix. Stop prompting. Either stabilise the current state and hand it to a developer, or book a free diagnostic.
Does the app work in preview but not on your production domain? Deploy wall (JTBD-2). Check environment variables first, Supabase connection URLs second, OAuth redirect URIs third. Those three cover 85% of deploy-wall cases we see.
Did the last prompt fix one thing and break another? Regression loop (JTBD-3). Add tests before continuing — otherwise every future prompt is at risk of undoing working features. For Cursor, tighten .cursor/rules; for Lovable/Bolt, stop and refactor before adding more features.
Is the integration (Stripe, auth, email, custom domain) half-wired and failing in ways the chat can't diagnose?Integration gap (JTBD-4). These are the cases Stripe's 2026 benchmark shows AI agents struggling with — webhook idempotency, error paths, edge cases. Fixed-price integration engagement, 3-day turnaround, is the right escape.
Is your app on Lovable with Supabase, and are any of the tables returning data via the public anon key without an auth check? RLS incident (JTBD-5). Stop taking user data. Audit every table's policies against Supabase's RLS documentation and the OWASP Top Ten access-control checklist. This is not a "fix later" item.
Does the app work for 10 users and crash for 100? Scale wall (JTBD-6). Add error boundaries, retries with exponential backoff, caching, and database indexes. Usually a one-week engagement.
Do you want to leave the platform and can't? Lock-in (JTBD-7). Migration is a 2–8 week project depending on size; the longer you wait the more expensive it gets.
How we rescue these apps
Afterbuild Labs's service catalogue maps 1:1 to the failure modes above. Pick the service that matches the pain:
- Free rescue diagnostic — 30-minute call + written rescue-vs-rewrite call within 24 hours. $0.
- Emergency triage — 24-hour response, paid discovery, root-cause report, quote.
- Code cleanup / security audit — 48-hour RLS + secrets + auth audit with patches.
- Integration fix — Stripe, auth, email, domain. Fixed price, 3-day turnaround.
- Deployment and launch — production-readiness pass: CI/CD, env, monitoring, rollback.
- Break the fix loop — refactor, test harness, regression-proofing.
- Finish my MVP — $7,499 fixed scope. Take the prototype, ship the product.
- App migration — Lovable / Bolt / Base44 → Next.js + Postgres. Keep schema, lose lock-in.
- Prototype to production (scale pass) — error boundaries, retries, caching, indexes.
- Ongoing maintenance retainer — monthly engineering hours, SLAs, on-call.
Citations (15+)
- Veracode 2025 AI code security report — 48% of AI-generated code contains vulnerabilities.
- Superblocks — "Lovable vulnerability explained: how 170+ apps were exposed" (CVE-2025-48757, Feb 2026).
- NIST NVD — CVE-2025-53773 (GitHub Copilot + Visual Studio command injection, CVSS 7.8 HIGH).
- Trustpilot — Lovable user reviews ("slot machine", "throw my money away").
- Medium — Nadia Okafor, "Vibe Coding in 2026" (20M tokens on auth; filter/table regression; referenced without direct link pending republication).
- Stripe (2025) — Can AI agents build real Stripe integrations? Benchmark.
- getautonoma — "7 Real Apps That Broke" case-study series (link omitted pending republication at current URL).
- Lovable documentation.
- Vercel v0 documentation.
- Bolt.new support documentation.
- Anthropic — Claude Code documentation.
- Cursor changelog.
- Reuters — Anysphere (Cursor) $29.3B Series D, 2026.
- TechCrunch — Cognition acquires Windsurf team (~$250M).
- Supabase — Row Level Security documentation.
- OWASP Top Ten — web application security baseline.
- Replit Agent documentation.
Recognise your app in the data?
Send us the repo. We'll tell you exactly which failure mode it's in — in 48 hours.
Book free diagnostic →