Monitoring & alerts
Dashboards
Section titled “Dashboards”| What | Where |
|---|---|
| Cloud Run health (prod) | GCP Console → Cloud Run → tappass → Metrics |
| Request latency, error rate | Grafana → “TapPass — API Health” |
| Pipeline step latency | Grafana → “Pipeline Steps” |
| Audit trail throughput | Grafana → “Audit Ingest” |
| Postgres CPU / IOPS | Cloud SQL → Instance → Monitoring |
| Cloudflare edge traffic | Cloudflare → Analytics → docs.tappass.ai (or app.tappass.ai) |
Alerts
Section titled “Alerts”Routed via Alertmanager → PagerDuty → Slack #incidents.
| Alert | Threshold | Severity |
|---|---|---|
api_error_rate | > 1% over 5 min | page |
p95_latency | > 500ms over 10 min | warn |
pipeline_block_rate_spike | > 3× 1h baseline | warn |
audit_chain_break | any event with invalid hash | page |
cloud_run_instance_flap | 3+ restarts in 10 min | page |
postgres_cpu | > 85% for 15 min | warn |
disk_free | < 15% any node | warn |
| Service | SLO |
|---|---|
/v1/chat/completions | 99.9% availability, p95 < 500ms |
/audit/integrity | 99.95% availability (compliance-critical) |
| Docs sites | 99.9% availability |
SLO burn alerts fire at 2% and 5% monthly budget consumption.
On-call workflow
Section titled “On-call workflow”See On-call rotation. When paged:
- Acknowledge in PagerDuty within 5 min
- Thread in
#incidents— one thread per incident - Follow the matching runbook
- Write a postmortem if customer-visible or > 30 min duration