Skip to content

Monitoring & alerts

Honest list of what's wired today. No aspirational dashboards.

ConcernWhere
Is prod up?curl -sI https://eu.tappass.ai/api/health/live + GCP Console → Cloud Run → tappass-prod → Metrics
Latency + error rate per endpointGCP Console → Cloud Run → Metrics (request count, latencies, instance count)
Recent errors / stack tracesSentry — tappass-backend (prod) and staging-tappass-backend
Frontend JS errorsSentry — tappass-frontend (prod) and staging-tappass-frontend
Product usage / eventsPostHog EU → eu.posthog.com — filter by env super-property
Raw logs, flexible queriesCloud Logging (gcloud logging read … or the console)
Edge / WAF trafficCloudflare Dashboard → Analytics
DB CPU, connections, diskCloud SQL → tappass-db → Monitoring
Uptime probe historyCloud Monitoring → Uptime checks → tappass health / tappass signup

All on prod; staging is quiet.

SourceTriggerChannel
Cloud Monitoring uptime check/api/health/live fails >30s from any of 3 regionsEmail alert (notification channel)
Sentry backendUnhandled exception rate, new issue typeSentry email + Slack app if connected
Sentry frontendSame, on browser SDKSame
CloudflareWAF blocked patterns (dashboard only, no push)You check when a customer complains

That's it. No Grafana, no Alertmanager, no PagerDuty, no SLO burn alerts. If you page-worthy signal doesn't fire from one of the above, we don't know about it yet — which is why uptime check triage exists.

  • Custom dashboards: when GCP Console stops being enough to diagnose an incident (i.e. we need p99 per-endpoint splits or per-org breakdowns that Cloud Logging can't materialise quickly).
  • SLO burn alerts: when we have a contractual SLA that pays credits on breach (see Support SLAs).
  • PagerDuty: when we have on-call people to page — today, the team reads #incidents Slack during working hours and Cloud Monitoring emails out-of-hours.
Terminal window
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.status>=500' \
--project=tappass-prod --limit=30 --freshness=10m \
--format='value(timestamp,httpRequest.status,httpRequest.requestUrl,httpRequest.latency)'

Telemetry init confirmation (after a deploy)

Section titled “Telemetry init confirmation (after a deploy)”
Terminal window
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND jsonPayload.event=~"sentry_initialized|posthog_initialized"' \
--project=tappass-prod --limit=4 --freshness=5m \
--format='value(jsonPayload.event,jsonPayload.env,jsonPayload.release)'
Terminal window
gcloud logging read \
'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \
--project=tappass-prod --limit=10 --freshness=30m \
--format='value(timestamp,resource.labels.revision_name,textPayload)'

See OOM / crashloop for what to do when those fire.

Running the core server locally (python -m tappass) emits no Sentry events and no PostHog events — even if a .env file has real keys. Explicit opt-in: TAPPASS_ENABLE_DEV_TELEMETRY=1. See tappass/observability/telemetry.py for the gate.