Skip to content

Monitoring & alerts

WhatWhere
Cloud Run health (prod)GCP Console → Cloud Run → tappass → Metrics
Request latency, error rateGrafana → “TapPass — API Health”
Pipeline step latencyGrafana → “Pipeline Steps”
Audit trail throughputGrafana → “Audit Ingest”
Postgres CPU / IOPSCloud SQL → Instance → Monitoring
Cloudflare edge trafficCloudflare → Analytics → docs.tappass.ai (or app.tappass.ai)

Routed via Alertmanager → PagerDuty → Slack #incidents.

AlertThresholdSeverity
api_error_rate> 1% over 5 minpage
p95_latency> 500ms over 10 minwarn
pipeline_block_rate_spike> 3× 1h baselinewarn
audit_chain_breakany event with invalid hashpage
cloud_run_instance_flap3+ restarts in 10 minpage
postgres_cpu> 85% for 15 minwarn
disk_free< 15% any nodewarn
ServiceSLO
/v1/chat/completions99.9% availability, p95 < 500ms
/audit/integrity99.95% availability (compliance-critical)
Docs sites99.9% availability

SLO burn alerts fire at 2% and 5% monthly budget consumption.

See On-call rotation. When paged:

  1. Acknowledge in PagerDuty within 5 min
  2. Thread in #incidents — one thread per incident
  3. Follow the matching runbook
  4. Write a postmortem if customer-visible or > 30 min duration