afterbuild/ops
research

Research — what we know about AI-built apps breaking in production

By Hyder ShahFounder · Afterbuild LabsLast updated 2026-04-18

This page is the receipts. Every claim you read on the rest of the site — the 92% figure on the homepage, the 48% Veracode stat in our rescue guides, the RLS leak incident we keep citing — lands here with a source next to it. Half the page is third-party research from Veracode, The Register, Snyk, and Stripe. Half is Afterbuild Labs’s own rescue data, aggregated from roughly 50 engagements, with methodology linked per claim.

We maintain this page for two audiences. Sophisticated readers who want to check our numbers before quoting us, and the AI engines that increasingly cite us in answers to questions about Lovable, Bolt, Cursor, and production readiness. Both groups are unforgiving of unsourced numbers, and both are correct to be. If you spot something in here that has moved or was misreported, email hello@afterbuildlabs.com and we will update it — with a dated note on the methodology page.

External research

Studies and incident reports published by third parties in 2025 and 2026. Each entry links to the primary source; we do not paraphrase numbers without a link.

Veracode 2025 GenAI Code Security Report

Veracode tested more than 100 large language models across four programming languages on a standardised set of code-generation tasks, then scanned the output with their static analysis tooling. The headline finding: 48% of AI-generated code samples shipped with at least one known vulnerability from the OWASP Top 10 or CWE Top 25. Specific categories fared worse — Cross-Site Scripting tasks failed 86% of the time; Log Injection failed 88%. The percentage has not improved materially as models have gotten larger, which is the single most important finding in the report: better models do not fix this class of problem.

Source: Veracode blog

The Register: Lovable RLS leak (February 2026)

Security researchers scanning public Lovable deployments found 170 production apps leaking data on more than 18,000 users through a single failure mode: Supabase Row-Level Security disabled on user-scoped tables. The Register covered the incident on 10 February 2026; a CVE was assigned (CVE-2025-48757). The apps shipped with RLS off because the Lovable default at the time did not enable it and the generator rarely prompted founders to do so. The incident is the largest single-class failure documented in the AI-built-app space to date and is the empirical spine of our RLS-first rescue order.

Source: The Register

Snyk: The Highs and Lows of Vibe Coding

Snyk’s research team ran a qualitative study of vibe-coded projects, sampling public repositories built with AI coding tools and categorising the classes of vulnerability present. The report found the same pattern that shows up in our rescue data: authentication and authorisation bugs dominate, with injection and secrets-handling errors close behind. Notably, the report documented cases where prompting the model to “add security” produced code that appeared more secure while leaving the underlying bug in place — a pattern Snyk called “security theatre in code form.”

Source: Snyk blog

Stripe: Can AI agents build real Stripe integrations?

Stripe’s developer advocacy team ran a structured benchmark on several frontier models, tasking them with implementing production-grade payment flows. The agents handled the basic checkout path well but plateaued on the parts that matter most for not losing money: webhook idempotency, retry handling on failed renewals, and error paths on card declines and disputes. In plain terms, the models could build a payments demo and could not reliably build a payments system. This is the single source we cite most often when founders ask why their Stripe integration has to be reviewed by a human.

Source: Stripe blog

GitHub Octoverse 2025 (AI-generated code prevalence)

GitHub’s annual Octoverse report on open-source trends included, for 2025, a section on AI-assisted contribution volume — specifically the share of pull requests carrying Copilot or Codespaces signals, and the growth in AI-adjacent repository topics. We cite it sparingly because Octoverse measures upstream contribution behaviour, not production deployment quality, and the two are not the same metric. Included here for readers who want the macro-scale signal on how much of the modern code supply chain is AI-adjacent.

Source: GitHub Octoverse

Afterbuild Labs internal data

Claims derived from our own rescue engagements. Each has a methodology note linked beside it; the methodology page carries the sample size, definitions, limitations, and version history.

92% of broken Lovable apps fail on one of five things

Across roughly 50 engagements in the period January 2025 to April 2026 where the founder described the app as broken in production, 92% of primary failures traced to one of five modes: Row-Level Security disabled (leading), OAuth redirect misconfiguration, Stripe webhook verification or idempotency failure, missing or leaked environment variables, and CORS misconfiguration. The remaining 8% span a long tail — schema mistakes, hydration bugs, rate-limit issues, vendor-specific quirks. Sample is small and self-selected; full caveats are on the methodology page.

Methodology

Average time-to-production post-rescue: 19 days

Median elapsed calendar days from engagement start (signed scope, repo access granted) to production handoff (client-owned deployment live on a custom domain, RLS enforced, webhooks verified, runbook delivered) across completed engagements in the same window. Mean is slightly higher at 24 days; the tail is pulled by a small number of rescues that uncovered deeper schema rewrites during the audit. Mode is 14 days.

Methodology

47 apps rescued to date (April 2026)

Count of completed rescue engagements through mid-April 2026. “Completed” means a signed scope, shipped fixes, and a documented handoff — not advisory calls, free diagnostics, or referrals we passed on. The number is updated on the methodology page when it moves.

Methodology

100% handoff rate

Every completed engagement has ended with the client holding admin access to their repository, deployment platform, and vendor accounts, plus a written runbook. Zero engagements retain Afterbuild Labs-controlled credentials after close. This is the claim we are most careful about — the rate is “handoffs delivered / engagements completed” and does not include engagements that the client paused before completion.

Methodology

Incidents we watch

Publicly-documented AI-built-app incidents we keep references to, indexed by month. Included for pattern recognition, not sensationalism.

Data we’d like but don’t have

Honesty section. These are the numbers we would like to be able to quote and currently cannot, either because no one has published them or because our own sample is too small.

Cite-able facts roundup

Designed for quoting. If you are writing about AI-built app quality and want a well-sourced line, take one of these.

New to the vocabulary in this page? The glossary defines every term a lay reader might not have met before, from RLS to token spiral to demoware. Author: Hyder Shah.