Total Services
12
across all environments
Healthy
9
operating normally
Warnings
2
needs attention
Critical
1
immediate action required
worker-queue — Active Incident
Memory leak detected on worker-queue — process restart scheduled for off-peak (02:00 UTC)
1h ago
staging-db — Active Incident
staging-db disk at 88.7% — scheduled cleanup of old WAL files during next maintenance window
1h ago
🖥 Service Health 12 services
Service Status CPU / Memory Disk Response SSL Env
api-gateway-01
api-gw-01.proviso.internal
healthy
12.4%
44.2%
38.1%
89ms
47d ✓ valid
production
api-gateway-02
api-gw-02.proviso.internal
healthy
18.7%
51.3%
40.2%
94ms
47d ✓ valid
production
cdn-edge-us
cdn-us.proviso.internal
healthy
6.8%
28.4%
45.2%
18ms
120d ✓ valid
production
monitoring-agent
monitor.proviso.internal
healthy
2.1%
19.8%
15.7%
22ms
180d ✓ valid
production
postgres-primary
db-primary.proviso.internal
healthy
23.6%
68.9%
71.2%
120ms
365d ✓ valid
production
postgres-replica
db-replica.proviso.internal
healthy
9.2%
55.1%
70.8%
115ms
365d ✓ valid
production
redis-cache
redis-01.proviso.internal
healthy
4.3%
61.7%
18.3%
12ms
365d ✓ valid
production
web-server-01
web-01.proviso.internal
healthy
8.1%
32.0%
22.5%
45ms
31d ✓ valid
production
web-server-02
web-02.proviso.internal
warning
72.3%
81.4%
67.8%
340ms
31d ✓ valid
production
worker-queue
worker-01.proviso.internal
critical
94.1%
87.2%
91.4%
520ms
14d ✓ valid
production
staging-db
staging-db.proviso.internal
warning
55.8%
74.3%
88.7%
280ms
60d ✓ valid
staging
staging-web
staging-web.proviso.internal
healthy
15.2%
40.1%
28.9%
110ms
60d ✓ valid
staging
🤖 Agent Actions 2 open
🟡
staging-db disk at 88.7% — scheduled cleanup of old WAL files during next maintenance window
staging-db 1h ago ⚡ active
🔴
Memory leak detected on worker-queue — process restart scheduled for off-peak (02:00 UTC)
worker-queue 1h ago ⚡ active
🔴
Rolled back staging-db migration v0.9.4 — column type mismatch caused 100% error rate
staging-web 2h ago ✓ resolved
🟡
Restarted nginx after 502 spike (error rate 12% → 0.1% in 40s)
api-gateway-01 3h ago ✓ resolved
🟡
Cleared /tmp after disk usage reached 92% — freed 8.4 GB
worker-queue 5h ago ✓ resolved
🟢
Auto-renewed SSL certificate for web-01.proviso.internal (was 7 days to expiry)
web-server-01 7h ago ✓ resolved
🟡
Provisioned 2 additional web-02 instances — p95 latency was 1.8s during traffic spike
web-server-02 9h ago ✓ resolved
🟢
Increased postgres-primary connection pool from 50 → 100 (connection wait >200ms detected)
postgres-primary 13h ago ✓ resolved
🟢
Rightsized redis-cache instance from r6g.xlarge → r6g.large — saving $48/mo (avg utilization 28%)
redis-cache 1d ago ✓ resolved
🟡
Restarted api-gateway-01 worker pool — 3 zombie processes consuming 31% CPU
api-gateway-01 1d ago ✓ resolved
🟢
Rebuilt 2 bloated postgres indexes — query time for /api/search improved 340ms → 41ms
postgres-primary 1d ago ✓ resolved
🟢
Daily postgres-primary backup verified — 4.2 GB, restore tested successfully
postgres-primary 2d ago ✓ resolved
🟡
Blocked 1,847 requests from 3 IPs showing brute-force pattern on /api/auth
api-gateway-02 3d ago ✓ resolved
🟢
Updated nginx rate limits on web-01 — max_conns increased to handle traffic growth
web-server-01 3d ago ✓ resolved
🟢
All 12 services passed health checks — 0 incidents in last 24h
monitoring-agent 4d ago ✓ resolved