Proviso — Infrastructure Dashboard

Total Services

across all environments

Healthy

operating normally

Warnings

needs attention

Critical

immediate action required

⚠

worker-queue — Active Incident

Memory leak detected on worker-queue — process restart scheduled for off-peak (02:00 UTC)

1h ago

⚠

staging-db — Active Incident

staging-db disk at 88.7% — scheduled cleanup of old WAL files during next maintenance window

1h ago

🖥 Service Health 12 services

Service	Status	CPU / Memory	Disk	Response	SSL	Env
api-gateway-01 api-gw-01.proviso.internal	healthy	12.4% 44.2%	38.1%	89ms	47d ✓ valid	production
api-gateway-02 api-gw-02.proviso.internal	healthy	18.7% 51.3%	40.2%	94ms	47d ✓ valid	production
cdn-edge-us cdn-us.proviso.internal	healthy	6.8% 28.4%	45.2%	18ms	120d ✓ valid	production
monitoring-agent monitor.proviso.internal	healthy	2.1% 19.8%	15.7%	22ms	180d ✓ valid	production
postgres-primary db-primary.proviso.internal	healthy	23.6% 68.9%	71.2%	120ms	365d ✓ valid	production
postgres-replica db-replica.proviso.internal	healthy	9.2% 55.1%	70.8%	115ms	365d ✓ valid	production
redis-cache redis-01.proviso.internal	healthy	4.3% 61.7%	18.3%	12ms	365d ✓ valid	production
web-server-01 web-01.proviso.internal	healthy	8.1% 32.0%	22.5%	45ms	31d ✓ valid	production
web-server-02 web-02.proviso.internal	warning	72.3% 81.4%	67.8%	340ms	31d ✓ valid	production
worker-queue worker-01.proviso.internal	critical	94.1% 87.2%	91.4%	520ms	14d ✓ valid	production
staging-db staging-db.proviso.internal	warning	55.8% 74.3%	88.7%	280ms	60d ✓ valid	staging
staging-web staging-web.proviso.internal	healthy	15.2% 40.1%	28.9%	110ms	60d ✓ valid	staging

🤖 Agent Actions 2 open

🟡

staging-db disk at 88.7% — scheduled cleanup of old WAL files during next maintenance window

staging-db 1h ago ⚡ active

🔴

Memory leak detected on worker-queue — process restart scheduled for off-peak (02:00 UTC)

worker-queue 1h ago ⚡ active

🔴

Rolled back staging-db migration v0.9.4 — column type mismatch caused 100% error rate

staging-web 2h ago ✓ resolved

🟡

Restarted nginx after 502 spike (error rate 12% → 0.1% in 40s)

api-gateway-01 3h ago ✓ resolved

🟡

Cleared /tmp after disk usage reached 92% — freed 8.4 GB

worker-queue 5h ago ✓ resolved

🟢

Auto-renewed SSL certificate for web-01.proviso.internal (was 7 days to expiry)

web-server-01 7h ago ✓ resolved

🟡

Provisioned 2 additional web-02 instances — p95 latency was 1.8s during traffic spike

web-server-02 9h ago ✓ resolved

🟢

Increased postgres-primary connection pool from 50 → 100 (connection wait >200ms detected)

postgres-primary 13h ago ✓ resolved

🟢

Rightsized redis-cache instance from r6g.xlarge → r6g.large — saving $48/mo (avg utilization 28%)

redis-cache 1d ago ✓ resolved

🟡

Restarted api-gateway-01 worker pool — 3 zombie processes consuming 31% CPU

api-gateway-01 1d ago ✓ resolved

🟢

Rebuilt 2 bloated postgres indexes — query time for /api/search improved 340ms → 41ms

postgres-primary 1d ago ✓ resolved

🟢

Daily postgres-primary backup verified — 4.2 GB, restore tested successfully

postgres-primary 2d ago ✓ resolved

🟡

Blocked 1,847 requests from 3 IPs showing brute-force pattern on /api/auth

api-gateway-02 3d ago ✓ resolved

🟢

Updated nginx rate limits on web-01 — max_conns increased to handle traffic growth

web-server-01 3d ago ✓ resolved

🟢

All 12 services passed health checks — 0 incidents in last 24h

monitoring-agent 4d ago ✓ resolved