Operations Runbooks

Step-by-step guides for common operational scenarios. Each runbook covers diagnosis, resolution, and prevention.

Agent Not Connecting

When a mezd fails to register or loses its tunnel:

  1. Check agent logs for connection errors:
    Check agent logs bash
    journalctl -u mezd -n 100 --no-pager
  2. Verify the join token has not expired:
    List active tokens bash
    mezctl tokens ls
  3. Check certificate expiry on the agent:
    Inspect agent certificate bash
    openssl x509 -in /var/lib/mezd/agent.crt -noout -dates
  4. Check the reverse tunnel — verify the agent can reach port 3024 on the proxy:
    Test tunnel connectivity bash
    curl -v telnet://mezite.example.com:3024
  5. Check network / firewall rules — ensure ports 3024 and 3025 are open from the agent to the Mezite server.

Backup and Restore

Mezite stores all state in PostgreSQL. Regular backups are critical.

  1. Database backup with pg_dump:
    Backup PostgreSQL bash
    pg_dump -h localhost -U mezite -d mezite -F c -f mezite-backup-$(date +%Y%m%d).dump
  2. Using the built-in scripts:
    Mezite backup and restore bash
    # Full backup (database + CA keys)
    mezite-backup --output /backups/mezite-$(date +%Y%m%d).tar.gz
    
    # Restore from backup
    mezite-restore --input /backups/mezite-20260324.tar.gz
  3. Verify the backup by restoring to a test instance before relying on it for disaster recovery.

CA Certificate Expiry

Monitor CA certificate expiry. Rotate when the remaining lifetime drops below 90 days.

  1. Check expiry:
    Check CA expiry bash
    mezctl status | grep -i "ca.*expir"
    
    # Or directly inspect the certificate
    openssl x509 -in /var/lib/mezite/host-ca.crt -noout -enddate
    openssl x509 -in /var/lib/mezite/user-ca.crt -noout -enddate
  2. Rotate the host CA:
    Rotate host CA bash
    mezctl ca rotate --type=host --grace-period=72h
  3. Rotate the user CA:
    Rotate user CA bash
    mezctl ca rotate --type=user --grace-period=72h
  4. The grace period allows both old and new CAs to be trusted simultaneously while agents and users obtain new certificates.

Database Performance

Troubleshooting and tuning PostgreSQL for Mezite.

  1. Identify slow queries:
    Find slow queries sql
    SELECT pid, now() - pg_stat_activity.query_start AS duration, query
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY duration DESC
    LIMIT 10;
  2. Run VACUUM:
    Vacuum and analyze sql
    VACUUM ANALYZE;
  3. Check connection pool usage — ensure max_connections in PostgreSQL is set higher than the sum of all Mezite instances' pool sizes.
  4. Lock contention — check for blocked queries:
    Check for lock contention sql
    SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid, blocking.query AS blocking_query
    FROM pg_stat_activity blocked
    JOIN pg_locks bl ON bl.pid = blocked.pid
    JOIN pg_locks bk ON bk.locktype = bl.locktype AND bk.relation = bl.relation AND bk.pid != bl.pid
    JOIN pg_stat_activity blocking ON blocking.pid = bk.pid
    WHERE NOT bl.granted;

Security Hardening Checklist

  • TLS: Ensure all Mezite ports (3025, 3080, 3023, 3024) use TLS. Terminate TLS at the proxy, not at an external load balancer, to preserve mutual TLS.
  • Authentication: Require MFA for all human users. Use short-lived certificates (default 12h). Disable password auth if an SSO connector is configured.
  • Authorization: Apply least-privilege roles. Restrict SSH logins by node label. Require access requests for privileged roles.
  • Network: Restrict auth port (3025) to internal networks. Expose only the proxy port (3080) and SSH port (3023) publicly. Use firewall rules to limit agent tunnel access (3024).
  • Audit: Enable session recording. Export audit logs to a SIEM. Set audit log retention to at least 1 year.

Zero-Downtime Upgrade

  1. Pre-flight checks:
    Pre-flight bash
    # Verify current version
    mezctl status
    
    # Back up the database
    mezite-backup --output /backups/pre-upgrade-$(date +%Y%m%d).tar.gz
  2. Run database migrations:
    Run migrations bash
    mezhub migrate --postgres-uri="postgres://mezite:secret@localhost:5432/mezite"
  3. Rolling restart of mezhub instances: Replace one instance at a time. Verify health with /healthz before proceeding to the next.
  4. Upgrade agents: Agents are backward-compatible with newer servers. Upgrade them after the server rollout is complete.
    Upgrade agent binary bash
    sudo systemctl stop mezd
    sudo cp mezd-new /usr/local/bin/mezd
    sudo systemctl start mezd
  5. Rollback: If issues arise, stop the new instances, restore the database backup, and restart with the previous binary version.