How I Would Fix webhooks failing silently in a Supabase and Edge Functions client portal Using Launch Ready.
The symptom is usually ugly in the same way every time: the portal says 'saved' or 'sent', but nothing arrives downstream, no retry happens, and support...
How I Would Fix webhooks failing silently in a Supabase and Edge Functions client portal Using Launch Ready
The symptom is usually ugly in the same way every time: the portal says "saved" or "sent", but nothing arrives downstream, no retry happens, and support only hears about it when a customer complains. In a Supabase plus Edge Functions setup, the most likely root cause is not "webhooks are broken" but "the event was never reliably recorded, dispatched, or observed".
The first thing I would inspect is the full request path: the client portal action, the database write in Supabase, the Edge Function invocation, and any delivery logs from the webhook target. If there is no durable event record before the network call, that is where silent failure starts.
Triage in the First Hour
1. Check Supabase logs first.
- Open the project logs for API requests and Edge Functions.
- Look for 4xx and 5xx responses around the time users reported missing webhook activity.
- Confirm whether the function was invoked at all.
2. Inspect the Edge Function deployment status.
- Verify the latest deploy actually reached production.
- Confirm environment variables are present in prod, not just local.
- Check whether a recent rollback or failed build changed behavior.
3. Review function execution traces.
- Look for early exits, uncaught exceptions, timeout warnings, and JSON parsing failures.
- Confirm whether logs are being written before and after the outbound request.
4. Check the database event table.
- If you store outbound events, confirm rows were created with status values like pending, sent, failed, or retrying.
- If there is no durable event row, you have an observability gap and probably an architecture gap too.
5. Inspect secrets and headers.
- Verify webhook signing secrets, API keys, and provider tokens are available in production.
- Check for expired keys or wrong secret names after a redeploy.
6. Test one known-good payload manually.
- Reproduce from a staging or admin screen with a simple payload.
- Compare expected headers, body shape, and response codes.
7. Check Cloudflare and DNS if delivery depends on public endpoints.
- Confirm SSL is valid and not serving stale certs.
- Make sure WAF rules or bot protection are not blocking legitimate callback traffic.
8. Review recent code changes.
- Focus on any refactor around async handling, error swallowing, retries, or response parsing.
- A single missing await can make a webhook look "successful" while it never completes.
Root Causes
| Likely cause | How to confirm | Why it fails silently | | --- | --- | --- | | Missing await or swallowed promise rejection | Inspect function code for fire-and-forget calls without logging or error handling | The request starts but errors are never surfaced | | No durable event record before dispatch | Check if an outbound event row exists before calling the webhook | If the function crashes mid-flight, there is nothing to retry | | Bad env vars in production | Compare local `.env` values with Supabase dashboard secrets | The code runs with empty URLs or invalid tokens | | Timeout or cold start issues | Review execution duration and provider timeout settings | The function stops before delivery completes | | Webhook target returns non-2xx responses | Inspect response status logging from outbound calls | The app treats failed delivery as success | | Auth or signature mismatch | Compare signing logic against provider docs and verify timestamp/secret format | The receiver rejects valid-looking traffic |
A common pattern in client portals is that developers log only "webhook sent" after `fetch()` starts, not after they confirm a 2xx response. That creates fake confidence and delayed support load.
The Fix Plan
I would fix this in one controlled sprint instead of patching random files. The goal is simple: make delivery observable, retryable, and safe to ship without breaking current customer flows.
1. Add an outbound events table if one does not exist.
- Store event id, user id, payload hash, target URL name, status, attempt count, last error, timestamps.
- This gives you auditability and retry control.
2. Change dispatch to be transactional in behavior.
- Write the event row first with status `pending`.
- Only then invoke the Edge Function delivery step.
- Update status to `sent` only after a confirmed 2xx response.
3. Add explicit error handling around every network call.
- Log response status codes and truncated error bodies.
- Never swallow exceptions inside `try/catch` without rethrowing or persisting failure state.
4. Add retries with backoff for transient failures only.
- Retry 3 times max for 429s and 5xx responses.
- Do not retry on obvious permanent failures like bad signatures or 400-level validation errors.
5. Make secrets explicit in production config.
- Verify all required variables exist at startup.
- Fail fast if any secret is missing instead of continuing with partial config.
6. Separate internal app errors from external delivery errors.
- A user should see "queued" or "processing" if delivery is async.
- Do not claim completion until you have confirmation from the webhook target.
7. Add idempotency protection.
- Use an event key so duplicate retries do not create duplicate downstream actions.
- This matters in client portals where users may click twice or refresh during slow responses.
8. Tighten API security controls while you are here.
- Validate payload shape before dispatching anything externally.
- Sign outbound webhooks if your receiver expects verification.
- Restrict CORS so only your portal origin can trigger internal actions.
- Keep least privilege on Supabase service roles and edge secrets.
Here is a small diagnostic pattern I would use to stop silent failure:
const res = await fetch(webhookUrl, {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-webhook-event-id': eventId,
'x-webhook-signature': signature,
},
body: JSON.stringify(payload),
});
const text = await res.text();
if (!res.ok) {
console.error('webhook_failed', { status: res.status, body: text.slice(0, 500) });
throw new Error(`Webhook failed with ${res.status}`);
}
console.log('webhook_sent', { status: res.status });That one change alone removes a lot of false positives because it forces you to treat non-2xx as failure instead of pretending delivery succeeded.
Regression Tests Before Redeploy
I would not ship this fix until I had proof that failures are visible and recoverable. For this kind of portal issue, QA should be focused on business impact: missed notifications, duplicate deliveries, broken onboarding steps, and support escalation risk.
Acceptance criteria:
1. Successful webhook path
- Given a valid portal action,
when the edge function runs, then an outbound event row is created, then delivery returns 2xx, then status becomes `sent`.
2. Failed webhook path
- Given a forced 500 from the receiver,
when dispatch runs, then status becomes `failed`, then last error is stored, then retry count increments correctly.
3. Missing secret path
- Given one required env var is absent,
when deployment starts, then the function fails fast with a clear log message, then no partial send occurs.
4. Duplicate request path
- Given two identical submissions within 10 seconds,
when idempotency keys match, then only one downstream action happens.
5. Timeout path
- Given a slow receiver response over your timeout threshold,
when dispatch exceeds limit, then it records timeout failure instead of hanging forever.
6. Manual smoke test
- Trigger one real event from staging into a safe test endpoint.
- Confirm logs show request id, response code, duration under 2 seconds p95 for normal cases.
I would also run one exploratory pass on mobile because founders often miss that admin actions get triggered from phones during support work. If buttons are cramped or feedback is unclear on small screens, people double-submit and create duplicate events.
Prevention
The best prevention here is boring engineering discipline tied to business outcomes.
- Add monitoring on event failure rate.
- Alert if failures exceed 2 percent over 15 minutes.
- Alert if no successful deliveries occur within a normal business window.
- Track p95 execution time for Edge Functions.
- Keep p95 under 800 ms for internal processing where possible.
- If it drifts above 2 seconds during peak usage, investigate cold starts or upstream slowness.
- Log correlation IDs end-to-end.
- One id should follow portal action -> database row -> edge function -> external receiver.
- This cuts debugging time from hours to minutes.
- Review changes through an API security lens.
- Check auth boundaries on who can trigger webhooks from the portal.
- Validate inputs before they touch downstream systems.
- Limit secret access to only what each function needs.
- Add basic rate limiting where appropriate.
- Prevent repeated clicks or automation abuse from flooding your queue and burning support time.
- Keep UX honest about state.
- Show "queued", "sending", "sent", or "failed".
``` mermaid flowchart TD A[Portal] --> B[DB Row] B --> C[Edge Fn] C --> D[Receiver] C --> E[Logs] D --> F[Status Update] E --> F
- If something fails silently again after this work: -, it means either logging regressed or someone bypassed the durable event flow entirely. ## When to Use Launch Ready I would use Launch Ready when you need this fixed fast without turning your whole product into a side quest. It fits if your portal already works enough to sell but webhook reliability is blocking onboarding completion, internal ops alerts, billing updates, CRM syncs, or customer notifications. It includes domain setup if needed by your release flow,, email authentication with SPF/DKIM/DMARC,, Cloudflare,, SSL,, redirects,, subdomains,, deployment,, secrets,, caching,, DDoS protection,, uptime monitoring,, production handover,, and a checklist so you are not guessing after launch.. What I would ask you to prepare: - Supabase project access with admin rights - Edge Function repo access - Production and staging webhook URLs - Any signing secrets or receiver docs - A list of exact flows that must never fail - One example payload that represents real customer data If your current state includes missed deliveries,, unclear logs,, broken retries,, or staff manually re-sending events,, this sprint pays for itself quickly by reducing support load and lost trust.. ## Delivery Map
flowchart TD A[Founder problem] --> B[API security audit] B --> C[Launch Ready sprint] C --> D[Production fixes] D --> E[Handover checklist] E --> F[Launch or scale]
## References - https://roadmap.sh/api-security-best-practices - https://roadmap.sh/code-review-best-practices - https://roadmap.sh/qa - https://supabase.com/docs/guides/functions - https://supabase.com/docs/guides/database/webhooks --- ## Take the next step If this is a problem in your product right now, here is what to do next: - **[Use the free Cyprian tools](/tools)** - estimate cost, score app risk, check launch readiness, or pick the right service sprint. - **[Book a discovery call](/contact)** - I will tell you honestly whether you need a sprint or if you can DIY the next step. *Written by Cyprian Tinashe Aarons - senior full-stack and AI engineer helping founders rescue, launch, automate, and scale AI-built products.*
Cyprian Tinashe Aarons — Senior Full Stack & AI Engineer
Cyprian helps founders rescue, secure, deploy, and automate AI-built apps with production-grade engineering, launch systems, and AI integration.