Monitoring & alerts
Honest list of what's wired today. No aspirational dashboards.
Where to look first
Section titled “Where to look first”| Concern | Where |
|---|---|
| Is prod up? | curl -sI https://eu.tappass.ai/api/health/live + GCP Console → Cloud Run → tappass-prod → Metrics |
| Latency + error rate per endpoint | GCP Console → Cloud Run → Metrics (request count, latencies, instance count) |
| Recent errors / stack traces | Sentry — tappass-backend (prod) and staging-tappass-backend |
| Frontend JS errors | Sentry — tappass-frontend (prod) and staging-tappass-frontend |
| Product usage / events | PostHog EU → eu.posthog.com — filter by env super-property |
| Raw logs, flexible queries | Cloud Logging (gcloud logging read … or the console) |
| Edge / WAF traffic | Cloudflare Dashboard → Analytics |
| DB CPU, connections, disk | Cloud SQL → tappass-db → Monitoring |
| Uptime probe history | Cloud Monitoring → Uptime checks → tappass health / tappass signup |
Alerts we actually have
Section titled “Alerts we actually have”All on prod; staging is quiet.
| Source | Trigger | Channel |
|---|---|---|
| Cloud Monitoring uptime check | /api/health/live fails >30s from any of 3 regions | Email alert (notification channel) |
| Sentry backend | Unhandled exception rate, new issue type | Sentry email + Slack app if connected |
| Sentry frontend | Same, on browser SDK | Same |
| Cloudflare | WAF blocked patterns (dashboard only, no push) | You check when a customer complains |
That's it. No Grafana, no Alertmanager, no PagerDuty, no SLO burn alerts. If you page-worthy signal doesn't fire from one of the above, we don't know about it yet — which is why uptime check triage exists.
What would trigger adding something
Section titled “What would trigger adding something”- Custom dashboards: when GCP Console stops being enough to diagnose an incident (i.e. we need p99 per-endpoint splits or per-org breakdowns that Cloud Logging can't materialise quickly).
- SLO burn alerts: when we have a contractual SLA that pays credits on breach (see Support SLAs).
- PagerDuty: when we have on-call people to page — today, the
team reads
#incidentsSlack during working hours and Cloud Monitoring emails out-of-hours.
Key log queries
Section titled “Key log queries”Recent prod 5xx
Section titled “Recent prod 5xx”gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.status>=500' \ --project=tappass-prod --limit=30 --freshness=10m \ --format='value(timestamp,httpRequest.status,httpRequest.requestUrl,httpRequest.latency)'Telemetry init confirmation (after a deploy)
Section titled “Telemetry init confirmation (after a deploy)”gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND jsonPayload.event=~"sentry_initialized|posthog_initialized"' \ --project=tappass-prod --limit=4 --freshness=5m \ --format='value(jsonPayload.event,jsonPayload.env,jsonPayload.release)'OOM signals
Section titled “OOM signals”gcloud logging read \ 'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \ --project=tappass-prod --limit=10 --freshness=30m \ --format='value(timestamp,resource.labels.revision_name,textPayload)'See OOM / crashloop for what to do when those fire.
Dev telemetry is off by default
Section titled “Dev telemetry is off by default”Running the core server locally (python -m tappass) emits no
Sentry events and no PostHog events — even if a .env file has
real keys. Explicit opt-in: TAPPASS_ENABLE_DEV_TELEMETRY=1. See
tappass/observability/telemetry.py for the gate.
Also see
Section titled “Also see”- Incident response — the 10-minute triage.
- Cold-start / uptime alert — the uptime-check false-positive pattern.
- Reference → Vendors — who owns the account for each tool above.