We Audited 50 Vibe Coded Apps

Over the past 18 months, we've done production readiness audits on 50 apps built with AI coding tools. Cursor, Lovable, Bolt, Replit, v0 — the full spectrum. Every. Single. One. Had the same five categories of problems.

Not "most of them." Not "a concerning number." All fifty.

These weren't hobby projects. They were funded startups — seed through Series B — building SaaS platforms, two-sided marketplaces, internal tools, and customer-facing dashboards. Teams that raised real money, hired real engineers, and shipped real products to real users. Products that worked perfectly in demos, passed investor due diligence, and then fell apart the moment traffic exceeded a dozen concurrent sessions.

Here's what we found, with exact numbers.

The Apps We Audited

A quick profile of the 50 apps in our dataset:

AI tools used: Cursor (18), Lovable (9), Bolt (8), Replit (7), v0 (5), GitHub Copilot-heavy workflows (3)
Funding stage: Pre-seed (6), Seed (19), Series A (16), Series B (9)
App type: B2B SaaS (22), Marketplaces (8), Internal tools (7), Consumer apps (6), AI wrappers (7)
Stack: React/Next.js dominated (38/50), with Python backends (14), Node/Express (28), and a smattering of Go and Rails

We won't name companies. But these aren't outliers — if you've built a production app using AI coding tools in the past two years, your app almost certainly has every pattern described below.

Finding #1: No Error Boundaries Anywhere (43 out of 50)

Forty-three of fifty apps had zero error boundaries. Not "insufficient error handling" — literally zero ErrorBoundary components, zero try-catch blocks around API calls, zero fallback UI states.

What this looks like in production: A single failed API call — a timeout, a 500, a malformed response — crashes the entire page. The user sees a white screen. No error message, no retry button, no graceful degradation. Just nothing. They refresh, hit the same endpoint, get the same white screen, and close the tab.

Why AI tools produce this: LLMs generate happy-path code. You prompt "build me a dashboard that shows user analytics," and you get a component that fetches data and renders it. The fetch succeeds in every test because the dev server responds in 4ms over localhost. The LLM never considers what happens when the API takes 30 seconds, returns a 502, or sends back HTML instead of JSON.

The numbers: Across apps where we instrumented error tracking before the audit (11 apps that had at least basic Sentry), we measured an average of 3.2 unhandled exceptions per user session. That's not 3.2 errors per day. Per session. Each user hits roughly three fatal errors every time they use the app, and most of those errors silently break functionality without any visible feedback.

The seven-line fix that none of these apps had:

class ErrorBoundary extends React.Component {
  state = { hasError: false };
  static getDerivedStateFromError() { return { hasError: true }; }
  render() {
    if (this.state.hasError) return <div>Something went wrong. <button onClick={() => this.setState({hasError: false})}>Try again</button></div>;
    return this.props.children;
  }
}

Wrap your route-level components. Wrap your data-fetching sections. Wrap anything that talks to an external service. This isn't complex engineering — it's the seatbelt your AI coding tool never installed.

Finding #2: Database Queries That Work Until They Don't (47 out of 50)

This was the most universal finding. Forty-seven of fifty apps had database query patterns that work fine in development and detonate in production.

The N+1 epidemic. Every ORM-based app in our dataset had N+1 query patterns. The classic: fetch a list of 20 items, then run a separate query for each item's related data. In dev, with 20 rows and a local database, this takes 200ms. In production, with 10,000 rows and a network hop to your database, it takes 14 seconds. We measured one app hitting 847 queries on a single page load — a page that displayed a table of 200 records, each with three related entities.

Connection pooling: nonexistent. Thirty-nine apps had no connection pooling configured. The default behavior for most Node.js database drivers is to open a new connection per query. Under 50 concurrent users — not 50,000, just 50 — these apps exhausted their database connection limits and started throwing ECONNREFUSED errors. Users would see intermittent failures: the app works, then doesn't, then works again, with no pattern they can identify.

Missing indexes on every JOIN column. We checked every foreign key column across all fifty apps. The average app had 12 foreign key relationships. The average number of indexes on those columns: 1.8. That means 85% of JOIN operations were doing full table scans. At 1,000 rows, nobody notices. At 100,000 rows, every page load takes 8 seconds.

Why AI tools produce this: When you prompt "create a user dashboard with their orders and order items," the LLM writes code that works. It fetches users, loops through them, fetches orders for each user, loops through those, fetches items for each order. Functionally correct. Architecturally catastrophic. The LLM doesn't know your production data volume, and it has no incentive to optimize for a scale you haven't described.

Finding #3: Authentication That Looks Right But Isn't (38 out of 50)

This one keeps us up at night. Thirty-eight apps had authentication implementations that appeared functional — you could log in, log out, see protected pages — but contained fundamental security gaps.

JWT in localStorage (31 apps). Storing JSON Web Tokens in localStorage is a textbook XSS vulnerability. Any cross-site scripting attack — and AI-generated code is full of unsanitized user inputs and raw HTML rendering — gives an attacker full access to the token. Httponly cookies exist for exactly this reason. None of these 31 apps used them.

No token refresh (29 apps). Tokens expire. When they do, the user gets silently logged out — usually in the middle of a workflow. They fill out a long form, hit submit, and get redirected to login. Their form data is gone. In 29 apps, there was no refresh token mechanism, no silent re-authentication, no session extension logic. The token expired after anywhere from 15 minutes to 24 hours, and the app just stopped working.

Every user is admin (26 apps). Twenty-six apps had no role-based access control beyond "logged in" and "logged out." API routes that should require admin privileges — deleting users, exporting all data, modifying billing — were accessible to any authenticated user. In 9 of those apps, some API routes had no authentication middleware at all. Anyone with the URL could hit them.

The pattern: AI tools generate authentication scaffolding. It handles the common case — sign up, log in, show a protected page. But the edge cases that determine whether your auth is actually secure — token storage, refresh flows, role checks on every API route, CSRF protection — get skipped because you didn't explicitly prompt for them, and the LLM doesn't volunteer what you didn't ask for.

Finding #4: Zero Observability (50 out of 50)

This was the only finding that hit all fifty apps. Not one had production-grade observability.

No structured logging (50/50). Every app used console.log. Some used it prolifically — we found one app with 847 console.log statements. But console.log in production is write-only storage. It goes to stdout, gets captured by your hosting provider's log viewer (maybe), and becomes unsearchable noise within hours. Structured logging — where every log entry has a severity level, timestamp, request ID, and contextual metadata — didn't exist in any of these apps.

No error tracking (46/50). Four apps had Sentry installed. None of those four had it configured correctly — missing source maps, no environment tags, no user context. The other 46 had nothing. When a user hit a bug, the only way the team found out was when the user emailed support. Or, more commonly, when the user left and never came back.

No performance monitoring (49/50). One app had basic Datadog APM. The other 49 had no visibility into response times, throughput, error rates, or resource utilization. They couldn't answer "is the app slow right now?" without opening it themselves and clicking around.

The consequence is invisible failure. Your app can be down for hours and you won't know. Your slowest API endpoint can take 45 seconds and you won't know. Your error rate can be 23% and you won't know. We audited one app that had been returning 500 errors on its payment endpoint for 11 days before anyone noticed. Eleven days of failed payments, zero alerts.

This is why vibe coded apps crash in production — not because the code is wrong, but because nobody knows when it breaks. And the hidden cost of vibe coding means every week without observability compounds the damage.

Finding #5: Deployment Is "Git Push and Pray" (44 out of 50)

Forty-four apps had no deployment infrastructure beyond pushing to main and letting Vercel/Railway/Render auto-deploy.

No staging environment (41/50). Code goes from a developer's laptop directly to production. For non-technical founders, our production deployment guide explains why staging matters and how to set it up. No intermediate environment where you can test with production-like data, catch migration issues, or verify that new features don't break existing ones. One team deployed a database migration that added a NOT NULL column without a default value. Every existing row became invalid. Every query failed. They found out because their CEO texted "the app is down."

No database migration strategy (37/50). Schema changes happen through Prisma's db push or raw SQL executed manually. No migration files, no version history, no rollback path. When a migration goes wrong — and it will — the only option is "figure it out in production while users are watching."

No rollback plan (44/50). If a deployment breaks production, how do you revert? In 44 apps, the answer was "revert the git commit and deploy again," which takes 3-8 minutes on most platforms. During those minutes, the app is broken. Six apps had no answer at all.

No health checks (42/50). The hosting platform shows a green checkmark because the process started. But "the process started" and "the app is working" are different things. Without health checks that verify database connectivity, external API access, and core functionality, you can deploy a technically-running but completely broken application and your monitoring will report everything is fine.

The Pattern Behind the Patterns

These five findings aren't random. They share a root cause: AI coding tools optimize for "does it work?" and never ask "will it keep working?"

When you prompt Cursor or Lovable to build a feature, the LLM's success metric is: does the generated code produce the expected output when you run it? That's a reasonable metric for code generation. It's a catastrophic metric for production software.

Production software needs to handle the cases where things don't work — when the database is slow, when the user does something unexpected, when the third-party API changes its response format, when traffic spikes 10x, when the deploy introduces a subtle regression. None of these cases appear in a prompt. None of them surface in local development. All of them surface in production, usually at 2 AM, usually the week after your launch.

The gap between "working code" and "production-ready code" is not a matter of polish. It's a structural difference in how the software is built. And right now, AI tools build working code. Production engineering is what closes the gap.

What to Check First

If you're reading this and thinking "that sounds like my app," here's a triage order:

If you want to understand the full financial impact of these patterns, read about the hidden cost of vibe coding. If you are weighing whether to fix these yourself or hire help, see our comparison of DIY fixes vs hiring experts.

Add error tracking today. Sentry's free tier covers most startups. You'll see your real error rate within 24 hours, and it'll be worse than you expect.
Audit your auth. Move JWTs to httponly cookies. Add middleware to every API route. Check your RBAC.
Add database indexes. Run EXPLAIN ANALYZE on your five slowest queries. Add indexes to every column that appears in a WHERE or JOIN clause.
Set up a staging environment. Mirror your production config. Deploy there first. Always.
Add health checks. A simple /health endpoint that pings your database and returns 200 or 500. Configure your hosting platform to check it.

These aren't optional optimizations. They're the minimum requirements for software that paying users depend on. Skip them and you're building on sand.

Frequently Asked Questions

What exactly is a vibe coded app audit?

A production readiness audit for apps built with AI coding tools — Cursor, Lovable, Bolt, Replit, v0, or heavy Copilot usage. We review the codebase for the five categories above plus security vulnerabilities, performance bottlenecks, and architecture decisions that won't scale. The output is a prioritized list of issues with specific fixes, typically 40-80 items ranked by severity and effort.

How do I know if my AI-built app needs an audit?

If your app was built primarily with AI coding tools and handles real user data or real money, it needs an audit. The patterns in this article appeared in every single app we reviewed — the question isn't whether these issues exist in your codebase, but how severe they are. Warning signs: intermittent errors you can't reproduce, performance degradation as your user base grows, or users reporting bugs you can't find in your logs.

Can I fix these issues myself without a professional audit?

You can fix many of them. The "What to Check First" section above covers the highest-impact starting points. Where teams typically get stuck is on systemic issues — N+1 queries buried across dozens of components, authentication logic threaded through the entire app, or deployment pipelines that need to be rebuilt from scratch. A professional audit saves time by mapping every instance of every pattern, not just the ones you happen to find.

Why do AI coding tools consistently produce these same problems?

Large language models generate code that satisfies the prompt. If you ask for a user dashboard, you get a user dashboard — one that works in the conditions where it was tested (local dev, small data, single user). Defensive coding, observability, security hardening, and deployment infrastructure are never part of the prompt because they're not features. They're the engineering practices that keep features working over time. The LLM doesn't volunteer them because you didn't ask.

How long does a production readiness audit typically take?

For a typical early-stage SaaS app (50K-200K lines of code), we complete the audit in 5-7 business days. The output is a detailed report with every issue categorized, prioritized, and paired with a specific remediation plan. Most teams can fix the critical issues within 2-3 weeks. The full remediation — including observability, CI/CD, staging environments, and load testing — typically takes 4-8 weeks.

What's the cost of NOT fixing these issues?

We tracked outcomes for 30 of the 50 apps over six months post-audit. Teams that didn't address the findings had a median of 4.7 hours of unplanned downtime per month, lost an estimated 12-18% of active users to reliability issues, and spent 35% of engineering time on firefighting rather than feature development. The compound effect is severe: every month you delay, the codebase accumulates more fragile patterns that make future fixes harder.

Is vibe coding fundamentally flawed, or just immature?

The tools are extraordinary for velocity. We've seen teams go from idea to working prototype in days instead of months. The problem isn't the approach — it's the stopping point. Teams ship the prototype as the product. Vibe coding gets you to 80% fast. The remaining 20% — error handling, security, observability, deployment — is what separates a demo from a product. That last 20% is where production engineering comes in.

If you recognized your app in these patterns, you're not alone. Most AI-built apps need the same five categories of fixes. The good news: these are solved problems. The fixes are well-understood, and the path from "works in demo" to "works in production" is shorter than you think.

Get a free production readiness audit and find out exactly where your app stands — before your users find out for you.