Monitoring & alerts

Honest list of what's wired today. No aspirational dashboards.

Where to look first

Concern	Where
Is prod up?	`curl -sI https://eu.tappass.ai/api/health/live` + GCP Console → Cloud Run → `tappass-prod` → Metrics
Latency + error rate per endpoint	GCP Console → Cloud Run → Metrics (request count, latencies, instance count)
Recent errors / stack traces	Sentry — `tappass-backend` (prod) and `staging-tappass-backend`
Frontend JS errors	Sentry — `tappass-frontend` (prod) and `staging-tappass-frontend`
Product usage / events	PostHog EU → `eu.posthog.com` — filter by `env` super-property
Raw logs, flexible queries	Cloud Logging (`gcloud logging read …` or the console)
Edge / WAF traffic	Cloudflare Dashboard → Analytics
DB CPU, connections, disk	Cloud SQL → `tappass-db` → Monitoring
Uptime probe history	Cloud Monitoring → Uptime checks → `tappass health` / `tappass signup`

Alerts we actually have

All on prod; staging is quiet.

Source	Trigger	Channel
Cloud Monitoring uptime check	`/api/health/live` fails >30s from any of 3 regions	Email alert (notification channel)
Sentry backend	Unhandled exception rate, new issue type	Sentry email + Slack app if connected
Sentry frontend	Same, on browser SDK	Same
Cloudflare	WAF blocked patterns (dashboard only, no push)	You check when a customer complains

That's it. No Grafana, no Alertmanager, no PagerDuty, no SLO burn alerts. If you page-worthy signal doesn't fire from one of the above, we don't know about it yet — which is why uptime check triage exists.

What would trigger adding something

Custom dashboards: when GCP Console stops being enough to diagnose an incident (i.e. we need p99 per-endpoint splits or per-org breakdowns that Cloud Logging can't materialise quickly).
SLO burn alerts: when we have a contractual SLA that pays credits on breach (see Support SLAs).
PagerDuty: when we have on-call people to page — today, the team reads #incidents Slack during working hours and Cloud Monitoring emails out-of-hours.

Key log queries

Recent prod 5xx

gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND httpRequest.status>=500' \
  --project=tappass-prod --limit=30 --freshness=10m \
  --format='value(timestamp,httpRequest.status,httpRequest.requestUrl,httpRequest.latency)'

Telemetry init confirmation (after a deploy)

gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND jsonPayload.event=~"sentry_initialized|posthog_initialized"' \
  --project=tappass-prod --limit=4 --freshness=5m \
  --format='value(jsonPayload.event,jsonPayload.env,jsonPayload.release)'

OOM signals

gcloud logging read \
  'resource.type=cloud_run_revision AND resource.labels.service_name=tappass AND textPayload=~"Memory limit"' \
  --project=tappass-prod --limit=10 --freshness=30m \
  --format='value(timestamp,resource.labels.revision_name,textPayload)'

See OOM / crashloop for what to do when those fire.

Dev telemetry is off by default

Running the core server locally (python -m tappass) emits no Sentry events and no PostHog events — even if a .env file has real keys. Explicit opt-in: TAPPASS_ENABLE_DEV_TELEMETRY=1. See tappass/observability/telemetry.py for the gate.

Also see

Incident response — the 10-minute triage.
Cold-start / uptime alert — the uptime-check false-positive pattern.
Reference → Vendors — who owns the account for each tool above.