How I Would Fix webhooks failing silently in a Next.js and Stripe client portal Using Launch Ready.
The symptom is usually ugly and expensive: a customer pays, Stripe says the event was sent, but your Next.js client portal never updates. The founder sees...
How I Would Fix webhooks failing silently in a Next.js and Stripe client portal Using Launch Ready
The symptom is usually ugly and expensive: a customer pays, Stripe says the event was sent, but your Next.js client portal never updates. The founder sees "successful payment" in Stripe, support gets confused users, and the app looks broken even though the checkout itself worked.
The most likely root cause is not "Stripe is down". It is usually one of these: the webhook endpoint is returning a 2xx too early, signature verification is failing and the error is hidden, the route is deployed in the wrong runtime, or Cloudflare/proxy settings are blocking the request before it reaches your app. The first thing I would inspect is the actual delivery history in Stripe, then the server logs for that exact request ID, then the Next.js webhook route code and deployment config.
Triage in the First Hour
1. Open Stripe Dashboard -> Developers -> Webhooks.
- Check recent event deliveries.
- Look for status codes, retry count, and response body.
- If Stripe shows 2xx but your portal did not update, the bug is inside your handler logic or downstream write path.
2. Inspect the failed event details.
- Compare `event.id`, `type`, and timestamp against your app logs.
- Confirm whether the same event was retried multiple times.
- Repeated retries often mean your endpoint is timing out or throwing after partial work.
3. Check application logs in your host.
- Vercel logs, Render logs, Fly logs, or your container logs.
- Search by Stripe event ID if you log it.
- If you do not log event IDs today, that is already part of the problem.
4. Verify the webhook route file.
- In Next.js App Router: `app/api/stripe/webhook/route.ts`
- In Pages Router: `pages/api/stripe-webhook.ts`
- Confirm raw body handling and that no JSON parser breaks signature verification.
5. Check environment variables in production.
- `STRIPE_WEBHOOK_SECRET`
- `STRIPE_SECRET_KEY`
- Any database URL or auth secret used by the handler
- Confirm they exist in production, not just local `.env`.
6. Review Cloudflare and DNS settings.
- Make sure the webhook URL resolves to the correct origin.
- Check WAF rules, bot protection, redirects, and SSL mode.
- A bad redirect chain can turn a clean POST into a broken request.
7. Inspect deployment runtime behavior.
- Confirm Node runtime vs Edge runtime.
- Stripe webhooks generally need Node-compatible request handling for raw body verification.
- If this route runs on Edge by accident, expect weird failures.
8. Look at database writes tied to webhook processing.
- If payment updates depend on a DB transaction that fails halfway through, Stripe may see success while your portal stays stale.
- Check for deadlocks, unique constraint errors, or missing indexes.
9. Confirm background jobs or queues if used.
- If webhook processing enqueues work but the queue worker is down, nothing updates after receipt.
- That creates "silent failure" symptoms even when webhook delivery succeeded.
10. Reproduce with a test event from Stripe CLI or dashboard resend.
- Use one known event type like `checkout.session.completed`.
- Compare expected state change to actual state change in staging first.
stripe listen --forward-to localhost:3000/api/stripe/webhook stripe trigger checkout.session.completed
Root Causes
| Likely cause | What it looks like | How I confirm it | |---|---|---| | Raw body parsing breaks signature verification | Stripe shows 400 or signature error | Check route code for `req.json()` before verifying signature | | Wrong webhook secret in production | Local works, prod fails silently | Compare dashboard secret to deployed env var | | Route running on Edge instead of Node | Strange runtime errors or missing crypto behavior | Inspect route config and deployment logs | | Redirects or Cloudflare blocking POSTs | No app log entry at all | Review Cloudflare firewall events and redirect chains | | Handler returns success before DB write completes | Stripe shows 2xx but portal data never changes | Trace code path and confirm write happens before response | | Idempotency/event replay logic is wrong | Duplicate rows or skipped updates | Check event storage table and unique constraints |
1. Raw body parsing breaks signature verification
Stripe webhook signatures depend on the exact raw payload. If Next.js parses JSON first, then verifies later with altered bytes, validation fails.
I confirm this by checking whether the handler uses `await req.json()` before calling `constructEvent`. If yes, that route needs to be rewritten to read raw text first.
2. Wrong webhook secret in production
This happens constantly after deploys. The team updates `.env.local`, but production still points at an old secret from a different endpoint.
I confirm this by comparing the endpoint secret shown in Stripe Dashboard with the deployed environment variable value. If they do not match exactly, every signed request will fail verification.
3. Route running on Edge instead of Node
A lot of AI-built apps accidentally deploy API routes with incompatible runtime assumptions. Webhook handlers need predictable server-side crypto behavior and access to raw request bodies.
I confirm this by checking route config and platform logs. If there is any `runtime = "edge"` setting on this route, I remove it unless there is a very specific reason to keep it there.
4. Cloudflare or redirect rules intercepting requests
If you put Cloudflare in front of the app during launch without checking POST behavior, you can break webhooks without noticing. Common problems are forced redirects from HTTP to HTTPS across multiple hops or WAF rules treating Stripe as suspicious traffic.
I confirm this by checking Cloudflare security events and testing direct origin access versus proxied access. If direct origin works but proxied fails, I know where to fix it.
5. Success response returned too early
This one hurts because it looks healthy from Stripe's side. The handler receives the event, returns 200 immediately, then later crashes while updating Postgres or calling an internal service.
I confirm this by reading code flow carefully: verify signature first, parse event second, persist idempotency record third, perform business update fourth, respond last only after durable writes succeed.
6. Missing idempotency protection
Stripe retries events when delivery fails or times out. Without storing processed event IDs you can double-create records or skip updates because of race conditions between retries.
I confirm this by checking whether processed event IDs are stored with a unique constraint. If not, duplicate delivery can create inconsistent portal state fast.
The Fix Plan
My fix plan is boring on purpose because boring fixes ship faster and break less.
1. Freeze changes to payment-related code until I have one working test path. 2. Add logging around every webhook step:
- received request
- verified signature
- parsed event type
- database write start
- database write success
- final response status
3. Make sure the route reads raw body correctly before verification. 4. Force this webhook route to run in Node runtime if needed. 5. Validate production secrets against Stripe Dashboard values. 6. Add idempotency storage for `event.id` with a unique constraint. 7. Move any slow side effects into a queue after durable persistence if they are not required for immediate user state changes. 8. Reduce external dependencies inside the webhook handler so one slow API does not block receipt. 9. Tighten Cloudflare rules only after confirming legitimate requests pass through cleanly. 10. Redeploy to staging first if available; otherwise use a low-risk maintenance window for production rollout.
If I were implementing this myself in Next.js App Router, I would keep it simple:
- verify signature using raw text
- write one durable record
- update portal state
- return 200 only after success
- log every failure with enough context to debug later
That sequence reduces silent failures more than any fancy architecture change.
Regression Tests Before Redeploy
Before shipping anything back to users, I want these checks passing:
1. Signature verification test
- Valid signed payload succeeds.
- Invalid signature fails with 400.
- Acceptance criterion: no unsigned payload can reach business logic.
2. Event replay test
- Send same Stripe event twice.
- Acceptance criterion: second delivery does not duplicate records or resend notifications.
3. Database write test
- Confirm subscription/payment state updates correctly after `checkout.session.completed`.
- Acceptance criterion: portal reflects payment within 10 seconds of successful delivery.
4. Failure-path test
- Simulate DB outage or rejected write.
- Acceptance criterion: handler returns non-2xx so Stripe retries instead of losing data silently.
5. Deployment config test
- Confirm correct env vars exist in production build output and runtime logs do not expose secrets.
- Acceptance criterion: no secret values appear in logs or client bundles.
6. Cloudflare path test ```bash curl -i https://yourdomain.com/api/stripe/webhook \ --data '{"test":"payload"}'
Use this only as a connectivity check from allowed tooling; do not treat it as proof of valid signing behavior. 7. Monitoring alert test - Trigger one known failing delivery intentionally in staging if possible. - Acceptance criterion: alert fires within 5 minutes for repeated failures or zero deliveries over a set period. 8b? No extra item needed here; keep scope tight: - p95 webhook processing time under 500 ms for normal events - zero uncaught exceptions during replay tests - at least 90 percent branch coverage on webhook handler logic if feasible ## Prevention I would put guardrails around three layers: code review, monitoring, and launch operations. **Code review guardrails** - Never merge webhook changes without an explicit raw-body check review item. - Require idempotency handling for every Stripe event processor. - Reject handlers that perform long-running work before persisting state. - Review least privilege on any internal service account used by webhooks. **Security guardrails** - Store secrets only server-side and rotate them after any suspected leak. - Lock down CORS so browser clients cannot impersonate server-to-server endpoints. - Keep Cloudflare WAF rules documented so legitimate webhooks are not blocked during future hardening changes. - Log event IDs and request metadata without logging full payloads containing sensitive customer data unless necessary for debugging and allowed by policy. **Monitoring guardrails** - Alert on: - zero successful webhook deliveries for 15 minutes during active checkout traffic - three consecutive failed deliveries for one endpoint - sudden increase in retry count - Track: - delivery success rate - p95 processing latency under 500 ms - duplicate event rate below 1 percent - Keep uptime monitoring on both homepage checkout flow and webhook endpoint health checks where appropriate. **UX guardrails** A silent payment failure becomes a support problem fast because users think they bought access but cannot enter their portal fully updated state matters as much as backend correctness here . I would show clear payment confirmation states: - pending sync - confirmed access granted - action required if payment succeeded but entitlement sync failed That reduces tickets while giving support something concrete to say instead of "we are looking into it". ## When to Use Launch Ready Use Launch Ready when you have a working product that keeps breaking at launch boundaries: DNS confusion, email setup gaps, SSL issues, broken redirects , missing monitoring , secrets chaos , or deployments that look fine until real traffic hits them . This sprint fits best when you need me to make the system production-safe fast instead of spending weeks guessing across tools . - domain setup - email authentication with SPF/DKIM/DMARC - Cloudflare configuration - SSL provisioning - redirects and subdomains - caching and DDoS protection basics - production deployment checks - environment variables and secrets review - uptime monitoring setup - handover checklist What I need from you before I start: 1. Access to hosting platform admin . 2 . Access to Cloudflare . 3 . Access to Stripe dashboard . 4 . Repo access plus current deployment branch . 5 . A short list of what should happen after payment succeeds . If your issue is specifically silent webhooks inside a Next.js plus Stripe client portal , I would usually pair Launch Ready with a small rescue scope so I can fix both infrastructure misconfigurations and application-level delivery bugs without dragging this into a longer rebuild . ## Delivery Map
flowchart TD A[Founder problem] --> B[cyber security audit] B --> C[Launch Ready sprint] C --> D[Production fixes] D --> E[Handover checklist] E --> F[Launch or scale]
## References 1 . Roadmap.sh API Security Best Practices https://roadmap.sh/api-security-best-practices 2 . Roadmap.sh Cyber Security https://roadmap.sh/cyber-security 3 . Roadmap.sh Code Review Best Practices https://roadmap.sh/code-review-best-practices 4 . Stripe Webhooks Documentation https://docs.stripe.com/webhooks 5 . Next.js Route Handlers Documentation https://nextjs.org/docs/app/building-your-application/routing/route-handlers --- ## Take the next step If this is a problem in your product right now, here is what to do next: - **[Use the free Cyprian tools](/tools)** - estimate cost, score app risk, check launch readiness, or pick the right service sprint. - **[Book a discovery call](/contact)** - I will tell you honestly whether you need a sprint or if you can DIY the next step. *Written by Cyprian Tinashe Aarons - senior full-stack and AI engineer helping founders rescue, launch, automate, and scale AI-built products.*
Cyprian Tinashe Aarons — Senior Full Stack & AI Engineer
Cyprian helps founders rescue, secure, deploy, and automate AI-built apps with production-grade engineering, launch systems, and AI integration.