Troubleshooting

Concrete recipes for the SSH workflows that are most likely to break in production. Each section pairs the visible symptom with the commands you would run to diagnose and the most common fixes.

Agent Will Not Join

If mezd fails on first start:

  • Check the join token. Tokens are one-shot and TTL-bound.
    Verify and re-issue bash
    mezctl tokens list
    mezctl tokens create --roles=node --ttl=1h
  • Confirm reachability. The agent must reach both the auth service (MEZITE_AUTH_ADDR, default :3025) and the tunnel listener (MEZITE_PROXY_ADDR, default :3024).
    Connectivity probe bash
    nc -zv mezite.example.com 3025
    nc -zv mezite.example.com 3024
  • Check TLS wrap. When the proxy sits behind a TLS-terminating LB, agents must set MEZITE_TLS_WRAP=true so the tunnel handshake speaks TLS. Without it, the LB sees raw TCP, the agent's dial fails immediately, and the LB drops the connection without surfacing a TLS error.
  • Verify the host CA bundle. The agent pins the cluster's host CA on first registration. If the cluster's host CA has been rotated and the agent is restarting after disk loss, the old pin will refuse the new bundle. Delete the agent's MEZITE_DATA_DIR and re-join with a fresh token.

SSH Connection Fails

If msh ssh cannot establish a session:

  • Cert expiry. msh status shows the current cert's expiry. If expired, run msh login.
  • RBAC. The user must hold a role whose allow.node_labels matches the target and whose allow.logins includes the login you asked for. Watch the audit log for the denial:
    Trace the denial bash
    mezctl audit ls --type=access.denied --since=10m
    mezctl roles get <role-name>
    mezctl nodes get <node>
  • Source-IP pinning. If the user's certificate was issued with pin_source_ip, the connection is denied when the client IP changes mid-session (e.g. roaming Wi-Fi). The denial event is access.denied with code T4001W.
  • Per-session MFA. Roles with require_session_mfa require an MFA challenge at session-start. msh ssh --mfa-code=<code> pre-feeds the challenge.

CA Rotation Problems

CA rotation is a multi-phase state machine (see Architecture). If clients or agents start failing during a rotation, walk through the following checks before doing anything destructive:

  • Inspect current state.
    What phase, which keys bash
    mezctl ca status
    # Look for the phase per CA (user/host/spiffe) and the active/standby
    # key fingerprints. Anything stuck in "update-clients" or "update-servers"
    # for an unexpected duration is usually the underlying issue.
  • Find clients still on the old key. In phase update-clients, the cluster has minted new certs but some client population may still trust only the old key. Watch the audit log for access.denied events with TLS / SSH handshake errors. A common cause is a long-lived agent that missed the trust-bundle refresh because it lost connection. Restart the agent.
  • Find servers still using the old key. In phase update-servers, agents and the proxy should have reissued host certs under the new key. Use mezctl nodes ls and look at the host_cert_signer_fingerprint column to find any that have not. Restart the laggard.
  • Roll back if a phase advance broke clients you cannot redeploy.
    Escape hatch bash
    mezctl ca rollback --type=user
    Rollback reverts to the previous phase, restoring the trust bundle that previously worked. Do not delete the standby key until you have verified clients have re-synced.
  • KMS-backed clusters. Rotation calls kms:CreateKey and kms:CreateAlias against the cluster's KMS region. If the IAM principal lacks kms:CreateKey with the required tag, the rotation fails at the init phase and the cluster stays on the old key. Check CloudTrail.

Agentless OpenSSH Problems

Agentless nodes have no mezd running; the proxy reaches the node's existing sshd directly over TCP. Most failures are either host-key pinning mismatches or the proxy not being able to dial the node at all.

  • Host-key pin mismatch. The node was registered with mezctl nodes add --openssh --host-key-file=…; if the node's /etc/ssh/ssh_host_ed25519_key.pub has been regenerated, the pin no longer matches and the proxy refuses to connect. Re-pin:
    Re-pin the host key bash
    # On the node, print the current key
    sudo cat /etc/ssh/ssh_host_ed25519_key.pub
    
    # On the operator workstation, rotate the pin
    mezctl nodes rotate-host-key web-legacy-01 \
      --host-key="ssh-ed25519 AAAA..."
  • Proxy cannot dial node. The proxy needs a path to the node's sshd — usually port 22 on the registered address. Confirm with mezctl nodes get and verify firewall rules between the proxy and the node.
  • OpenSSH login restrictions. Agentless mode does not bypass the node's sshd config; the requested login must exist on the node, be allowed by AllowUsers / AllowGroups, and (for cert-based auth) TrustedUserCAKeys must point at the cluster's User CA bundle.

Time Skew and Certificate Validity

SSH certificates encode an absolute validity window. If a client's clock drifts more than a minute from the auth service, the cert is either "not yet valid" or "already expired" from the other side's perspective and the connection is refused.

  • Symptoms. msh ssh returns ssh: certificate not yet valid or ssh: certificate expired immediately after a successful msh login, even though msh status shows a healthy cert.
  • Fix. Ensure NTP / chrony is running on the workstation, every agent host, and the mezhub host. Air-gapped sites should at minimum have their hosts synchronised against a local NTP appliance.

Recording Upload Problems

Session recordings flow from the agent → auth service → storage backend (local filesystem or S3). Failures in any of those legs are currently surfaced via the agent's structured logs rather than as dedicated audit events — grep the mezd log for uploading session recording (async upload failures) or opening recording stream (node-sync stream failures).

  • Async (node) mode. The agent buffers the recording locally and uploads after the session ends. If the upload fails (S3 outage, expired credentials, disk full on the auth side), the agent logs an uploading session recording error and the recording stays on the agent's MEZITE_DATA_DIR until the next successful upload. Check disk space on the agent and ACLs / IAM on the storage backend.
  • Sync (node-sync) mode. The agent streams chunks in real time. If the recording stream cannot be opened (auth unreachable, credentials invalid), the agent logs opening recording stream (falling back to local-only) and degrades to local-only recording for the session. Confirm the auth address and the recording-storage backend are reachable.
  • S3 specifically. Verify the bucket exists and the principal has s3:PutObject on the prefix. With MEZITE_S3_AUTO_CREATE_BUCKET off (the production default), the bucket must be created out-of-band.

Login Fails

  • Local user. Confirm the user exists and is not locked.
    Inspect users and locks bash
    mezctl users list
    mezctl locks list
  • SSO connector. The most common cause is a stale proxy.public_addr or proxy.oidc_issuer_url. The IdP redirect URI must match what mezhub advertises in its discovery document.
    Connector sanity checks bash
    mezctl connectors list
    
    # Discovery document the cluster advertises
    curl -fsSL https://mezite.example.com/.well-known/openid-configuration
  • SAML. Check the IdP-side audit log for a signed assertion that the cluster received but rejected. Common causes: clock skew (see above), an expired IdP signing certificate, or Audience mismatch.
  • LDAP / AD. Confirm bind_dn and bind_password match the directory; if the connector maps roles via groups_to_roles, verify the user is actually in the expected group with ldapsearch.
  • GitHub OAuth. The user must be a member of one of the orgs/teams listed in teams_to_roles. Public-only membership does not satisfy GitHub's read:org visibility — the user has to confirm org membership is set to "Public" or grant the connector the appropriate scope.

Database Connection Issues

  • Verify the MEZITE_DB_* variables match the cluster's actual database. mezhub logs the redacted DSN on boot.
  • For PostgreSQL, confirm the user has the required privileges (the schema is owned by the configured user; migrations run as that user too).
    Probe the connection bash
    # Prefer PGPASSWORD over inlining the password into the URL —
    # the literal '****' in URL examples is a placeholder, not a valid password.
    PGPASSWORD="$MEZITE_DB_PASSWORD" \
      psql "postgres://mezite@localhost:5432/mezite?sslmode=disable" \
      -c "SELECT 1;"
  • For SQLite, check the file's parent directory is writable by the mezhub user. mezhub creates the file on first start; a permission error there surfaces as unable to open database file.

Backup and Restore

Day-2 operators want backup / restore documented separately from emergency runbooks. The high-level shape: take regular database backups, take regular recording-storage backups, and treat the CA signing keys as the most sensitive material on the cluster.

Database

  • SQLite. Use sqlite3 .backup on a schedule. The on-disk file alone is not safe to copy while mezhub has it open.
    Hot backup bash
    # Use a directory the mezhub user can write to.
    # /var/backups is typically root:root or root:backup on most distros —
    # create a mezhub-owned subdirectory first (or back up under the mezhub
    # user's home, e.g. ~/.mezite/backups).
    sudo install -d -o mezhub -g mezhub /var/backups/mezite
    sqlite3 /var/lib/mezite/mezhub.db \
      ".backup '/var/backups/mezite/mezhub-$(date +%F).db'"
  • PostgreSQL. Use your existing PG backup tooling (pg_dump, pg_basebackup, managed RDS snapshots, etc.). Coordinate with the schema migration in mezhub — a restore that's older than the running mezhub will need to re-run forward migrations on first start.

Recording storage

  • Local backend. Snapshot the recordings directory alongside the database backup. Recordings are append-only files plus an index — snapshotting them with tar / rsync is safe.
  • S3 backend. Enable S3 bucket versioning and cross-region replication on the bucket. Recordings are immutable once written; bucket-level versioning is what you want.

CA / key material

  • Software-encrypted keys. The CA signing keys are inside the database, AES-256-GCM-encrypted from MEZITE_CA_KEY_PASSPHRASE via Argon2id. A database backup is also a backup of the CA keys if the passphrase is also backed up — store the passphrase in a secrets manager (1Password / HashiCorp Vault / AWS Secrets Manager) and back up the secret separately from the database backup.
  • KMS-backed keys. The private key material never leaves AWS KMS, so a database backup contains the alias but not the key. KMS aliases are per-Region resources — they are not themselves multi-Region. For cross-Region DR, use AWS KMS multi-Region keys (a primary key in one Region replicated to others) and create an alias in each Region pointing at the local replica, or replicate the key material out-of-band and recreate the alias in the target Region. Without access to the original KMS key, a restored database cannot sign anything — which is the right safety property, but plan accordingly.
  • Audit HMAC chain. If MEZITE_AUDIT_HMAC_KEY is rotated, the tamper-detection chain on existing audit events is invalidated. Keep the previous value alongside the new one during transition and verify the chain after rotation.

Log Levels

Increase server log verbosity to diagnose issues. mezhub reads its log level from the log.level field in mezite.yaml or from the MEZITE_LOG_LEVEL environment variable. There is no --log-level CLI flag.

Set debug logging on mezhub bash
export MEZITE_LOG_LEVEL=debug
mezhub --config=mezite.yaml

Available levels: debug, info, warn, error.

The mezd agent runs at zap's production-default level and does not honour MEZITE_LOG_LEVEL. If you need verbose agent logs while debugging, restart mezd under journalctl -u mezd -f with full systemd output rather than trying to dial in the log level.