Operations Runbooks
Step-by-step guides for common operational scenarios. Each runbook covers diagnosis, resolution, and prevention.
Agent Not Connecting
When a mezd fails to register or loses its tunnel:
- Check agent logs for connection errors:
Check agent logs bash
journalctl -u mezd -n 100 --no-pager - Verify the join token has not expired:
List active tokens bash
mezctl tokens ls - Check certificate expiry on the agent:
Inspect agent certificate bash
openssl x509 -in /var/lib/mezd/agent.crt -noout -dates - Check the reverse tunnel — verify the agent can reach port
3024 on the proxy:
Test tunnel connectivity bash
curl -v telnet://mezite.example.com:3024 - Check network / firewall rules — ensure ports 3024 and 3025 are open from the agent to the Mezite server.
Backup and Restore
Mezite stores all state in PostgreSQL. Regular backups are critical.
- Database backup with pg_dump: Backup PostgreSQL bash
pg_dump -h localhost -U mezite -d mezite -F c -f mezite-backup-$(date +%Y%m%d).dump - Using the built-in scripts: Mezite backup and restore bash
# Full backup (database + CA keys) mezite-backup --output /backups/mezite-$(date +%Y%m%d).tar.gz # Restore from backup mezite-restore --input /backups/mezite-20260324.tar.gz - Verify the backup by restoring to a test instance before relying on it for disaster recovery.
CA Certificate Expiry
Monitor CA certificate expiry. Rotate when the remaining lifetime drops below 90 days.
- Check expiry: Check CA expiry bash
mezctl status | grep -i "ca.*expir" # Or directly inspect the certificate openssl x509 -in /var/lib/mezite/host-ca.crt -noout -enddate openssl x509 -in /var/lib/mezite/user-ca.crt -noout -enddate - Rotate the host CA: Rotate host CA bash
mezctl ca rotate --type=host --grace-period=72h - Rotate the user CA: Rotate user CA bash
mezctl ca rotate --type=user --grace-period=72h - The grace period allows both old and new CAs to be trusted simultaneously while agents and users obtain new certificates.
Database Performance
Troubleshooting and tuning PostgreSQL for Mezite.
- Identify slow queries: Find slow queries sql
SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC LIMIT 10; - Run VACUUM: Vacuum and analyze sql
VACUUM ANALYZE; - Check connection pool usage — ensure
max_connectionsin PostgreSQL is set higher than the sum of all Mezite instances' pool sizes. - Lock contention — check for blocked queries:
Check for lock contention sql
SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid, blocking.query AS blocking_query FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks bk ON bk.locktype = bl.locktype AND bk.relation = bl.relation AND bk.pid != bl.pid JOIN pg_stat_activity blocking ON blocking.pid = bk.pid WHERE NOT bl.granted;
Security Hardening Checklist
- TLS: Ensure all Mezite ports (3025, 3080, 3023, 3024) use TLS. Terminate TLS at the proxy, not at an external load balancer, to preserve mutual TLS.
- Authentication: Require MFA for all human users. Use short-lived certificates (default 12h). Disable password auth if an SSO connector is configured.
- Authorization: Apply least-privilege roles. Restrict SSH logins by node label. Require access requests for privileged roles.
- Network: Restrict auth port (3025) to internal networks. Expose only the proxy port (3080) and SSH port (3023) publicly. Use firewall rules to limit agent tunnel access (3024).
- Audit: Enable session recording. Export audit logs to a SIEM. Set audit log retention to at least 1 year.
Zero-Downtime Upgrade
- Pre-flight checks: Pre-flight bash
# Verify current version mezctl status # Back up the database mezite-backup --output /backups/pre-upgrade-$(date +%Y%m%d).tar.gz - Run database migrations: Run migrations bash
mezhub migrate --postgres-uri="postgres://mezite:secret@localhost:5432/mezite" - Rolling restart of mezhub instances: Replace one instance
at a time. Verify health with
/healthzbefore proceeding to the next. - Upgrade agents: Agents are backward-compatible with newer
servers. Upgrade them after the server rollout is complete.
Upgrade agent binary bash
sudo systemctl stop mezd sudo cp mezd-new /usr/local/bin/mezd sudo systemctl start mezd - Rollback: If issues arise, stop the new instances, restore the database backup, and restart with the previous binary version.