Troubleshooting
Concrete recipes for the SSH workflows that are most likely to break in production. Each section pairs the visible symptom with the commands you would run to diagnose and the most common fixes.
Agent Will Not Join
If mezd fails on first start:
- Check the join token. Tokens are one-shot and TTL-bound.
Verify and re-issue bash
mezctl tokens list mezctl tokens create --roles=node --ttl=1h - Confirm reachability. The agent must reach both the auth
service (
MEZITE_AUTH_ADDR, default:3025) and the tunnel listener (MEZITE_PROXY_ADDR, default:3024).Connectivity probe bashnc -zv mezite.example.com 3025 nc -zv mezite.example.com 3024 - Check TLS wrap. When the proxy sits behind a TLS-terminating
LB, agents must set
MEZITE_TLS_WRAP=trueso the tunnel handshake speaks TLS. Without it, the LB sees raw TCP, the agent's dial fails immediately, and the LB drops the connection without surfacing a TLS error. - Verify the host CA bundle. The agent pins the cluster's
host CA on first registration. If the cluster's host CA has been rotated
and the agent is restarting after disk loss, the old pin will refuse the
new bundle. Delete the agent's
MEZITE_DATA_DIRand re-join with a fresh token.
SSH Connection Fails
If msh ssh cannot establish a session:
- Cert expiry.
msh statusshows the current cert's expiry. If expired, runmsh login. - RBAC. The user must hold a role whose
allow.node_labelsmatches the target and whoseallow.loginsincludes the login you asked for. Watch the audit log for the denial:Trace the denial bashmezctl audit ls --type=access.denied --since=10m mezctl roles get <role-name> mezctl nodes get <node> - Source-IP pinning. If the user's certificate was issued
with
pin_source_ip, the connection is denied when the client IP changes mid-session (e.g. roaming Wi-Fi). The denial event isaccess.deniedwith codeT4001W. - Per-session MFA. Roles with
require_session_mfarequire an MFA challenge at session-start.msh ssh --mfa-code=<code>pre-feeds the challenge.
CA Rotation Problems
CA rotation is a multi-phase state machine (see Architecture). If clients or agents start failing during a rotation, walk through the following checks before doing anything destructive:
- Inspect current state. What phase, which keys bash
mezctl ca status # Look for the phase per CA (user/host/spiffe) and the active/standby # key fingerprints. Anything stuck in "update-clients" or "update-servers" # for an unexpected duration is usually the underlying issue. - Find clients still on the old key. In phase
update-clients, the cluster has minted new certs but some client population may still trust only the old key. Watch the audit log foraccess.deniedevents with TLS / SSH handshake errors. A common cause is a long-lived agent that missed the trust-bundle refresh because it lost connection. Restart the agent. - Find servers still using the old key. In phase
update-servers, agents and the proxy should have reissued host certs under the new key. Usemezctl nodes lsand look at thehost_cert_signer_fingerprintcolumn to find any that have not. Restart the laggard. - Roll back if a phase advance broke clients you cannot redeploy. Rollback reverts to the previous phase, restoring the trust bundle that previously worked. Do not delete the standby key until you have verified clients have re-synced.Escape hatch bash
mezctl ca rollback --type=user - KMS-backed clusters. Rotation calls
kms:CreateKeyandkms:CreateAliasagainst the cluster's KMS region. If the IAM principal lackskms:CreateKeywith the required tag, the rotation fails at theinitphase and the cluster stays on the old key. Check CloudTrail.
Agentless OpenSSH Problems
Agentless nodes have no mezd running; the proxy reaches the node's
existing sshd directly over TCP. Most failures are either host-key
pinning mismatches or the proxy not being able to dial the node at all.
- Host-key pin mismatch. The node was registered with
mezctl nodes add --openssh --host-key-file=…; if the node's/etc/ssh/ssh_host_ed25519_key.pubhas been regenerated, the pin no longer matches and the proxy refuses to connect. Re-pin:Re-pin the host key bash# On the node, print the current key sudo cat /etc/ssh/ssh_host_ed25519_key.pub # On the operator workstation, rotate the pin mezctl nodes rotate-host-key web-legacy-01 \ --host-key="ssh-ed25519 AAAA..." - Proxy cannot dial node. The proxy needs a path to the node's
sshd— usually port 22 on the registered address. Confirm withmezctl nodes getand verify firewall rules between the proxy and the node. - OpenSSH login restrictions. Agentless mode does not bypass
the node's
sshdconfig; the requested login must exist on the node, be allowed byAllowUsers/AllowGroups, and (for cert-based auth)TrustedUserCAKeysmust point at the cluster's User CA bundle.
Time Skew and Certificate Validity
SSH certificates encode an absolute validity window. If a client's clock drifts more than a minute from the auth service, the cert is either "not yet valid" or "already expired" from the other side's perspective and the connection is refused.
- Symptoms.
msh sshreturnsssh: certificate not yet validorssh: certificate expiredimmediately after a successfulmsh login, even thoughmsh statusshows a healthy cert. - Fix. Ensure NTP / chrony is running on the workstation,
every agent host, and the
mezhubhost. Air-gapped sites should at minimum have their hosts synchronised against a local NTP appliance.
Recording Upload Problems
Session recordings flow from the agent → auth service → storage backend
(local filesystem or S3). Failures in any of those legs are currently
surfaced via the agent's structured logs rather than as dedicated audit
events — grep the mezd log for
uploading session recording (async upload failures) or
opening recording stream (node-sync stream failures).
- Async (node) mode. The agent buffers the recording locally
and uploads after the session ends. If the upload fails (S3 outage, expired
credentials, disk full on the auth side), the agent logs an
uploading session recordingerror and the recording stays on the agent'sMEZITE_DATA_DIRuntil the next successful upload. Check disk space on the agent and ACLs / IAM on the storage backend. - Sync (node-sync) mode. The agent streams chunks in real
time. If the recording stream cannot be opened (auth unreachable, credentials
invalid), the agent logs
opening recording stream (falling back to local-only)and degrades to local-only recording for the session. Confirm the auth address and the recording-storage backend are reachable. - S3 specifically. Verify the bucket exists and the principal
has
s3:PutObjecton the prefix. WithMEZITE_S3_AUTO_CREATE_BUCKEToff (the production default), the bucket must be created out-of-band.
Login Fails
- Local user. Confirm the user exists and is not locked.
Inspect users and locks bash
mezctl users list mezctl locks list - SSO connector. The most common cause is a stale
proxy.public_addrorproxy.oidc_issuer_url. The IdP redirect URI must match whatmezhubadvertises in its discovery document.Connector sanity checks bashmezctl connectors list # Discovery document the cluster advertises curl -fsSL https://mezite.example.com/.well-known/openid-configuration - SAML. Check the IdP-side audit log for a signed assertion
that the cluster received but rejected. Common causes: clock
skew (see above), an expired IdP signing certificate, or
Audiencemismatch. - LDAP / AD. Confirm
bind_dnandbind_passwordmatch the directory; if the connector maps roles viagroups_to_roles, verify the user is actually in the expected group withldapsearch. - GitHub OAuth. The user must be a member of one of the orgs/teams
listed in
teams_to_roles. Public-only membership does not satisfy GitHub'sread:orgvisibility — the user has to confirm org membership is set to "Public" or grant the connector the appropriate scope.
Database Connection Issues
-
Verify the
MEZITE_DB_*variables match the cluster's actual database.mezhublogs the redacted DSN on boot. -
For PostgreSQL, confirm the user has the required privileges (the
schema is owned by the configured user; migrations run as that user
too).
Probe the connection bash
# Prefer PGPASSWORD over inlining the password into the URL — # the literal '****' in URL examples is a placeholder, not a valid password. PGPASSWORD="$MEZITE_DB_PASSWORD" \ psql "postgres://mezite@localhost:5432/mezite?sslmode=disable" \ -c "SELECT 1;" -
For SQLite, check the file's parent directory is writable by the
mezhub user.
mezhubcreates the file on first start; a permission error there surfaces asunable to open database file.
Backup and Restore
Day-2 operators want backup / restore documented separately from emergency runbooks. The high-level shape: take regular database backups, take regular recording-storage backups, and treat the CA signing keys as the most sensitive material on the cluster.
Database
- SQLite. Use
sqlite3 .backupon a schedule. The on-disk file alone is not safe to copy whilemezhubhas it open.Hot backup bash# Use a directory the mezhub user can write to. # /var/backups is typically root:root or root:backup on most distros — # create a mezhub-owned subdirectory first (or back up under the mezhub # user's home, e.g. ~/.mezite/backups). sudo install -d -o mezhub -g mezhub /var/backups/mezite sqlite3 /var/lib/mezite/mezhub.db \ ".backup '/var/backups/mezite/mezhub-$(date +%F).db'" - PostgreSQL. Use your existing PG backup tooling (
pg_dump,pg_basebackup, managed RDS snapshots, etc.). Coordinate with the schema migration in mezhub — a restore that's older than the running mezhub will need to re-run forward migrations on first start.
Recording storage
- Local backend. Snapshot the recordings directory alongside
the database backup. Recordings are append-only files plus an index — snapshotting
them with
tar/rsyncis safe. - S3 backend. Enable S3 bucket versioning and cross-region replication on the bucket. Recordings are immutable once written; bucket-level versioning is what you want.
CA / key material
- Software-encrypted keys. The CA signing keys are inside
the database, AES-256-GCM-encrypted from
MEZITE_CA_KEY_PASSPHRASEvia Argon2id. A database backup is also a backup of the CA keys if the passphrase is also backed up — store the passphrase in a secrets manager (1Password / HashiCorp Vault / AWS Secrets Manager) and back up the secret separately from the database backup. - KMS-backed keys. The private key material never leaves AWS KMS, so a database backup contains the alias but not the key. KMS aliases are per-Region resources — they are not themselves multi-Region. For cross-Region DR, use AWS KMS multi-Region keys (a primary key in one Region replicated to others) and create an alias in each Region pointing at the local replica, or replicate the key material out-of-band and recreate the alias in the target Region. Without access to the original KMS key, a restored database cannot sign anything — which is the right safety property, but plan accordingly.
- Audit HMAC chain. If
MEZITE_AUDIT_HMAC_KEYis rotated, the tamper-detection chain on existing audit events is invalidated. Keep the previous value alongside the new one during transition and verify the chain after rotation.
Log Levels
Increase server log verbosity to diagnose issues. mezhub
reads its log level from the log.level field in
mezite.yaml or from the
MEZITE_LOG_LEVEL environment variable. There is no
--log-level CLI flag.
export MEZITE_LOG_LEVEL=debug
mezhub --config=mezite.yaml
Available levels: debug, info,
warn, error.
The mezd agent runs at zap's production-default level and does
not honour MEZITE_LOG_LEVEL. If you need verbose agent logs
while debugging, restart mezd under
journalctl -u mezd -f with full systemd output rather than trying
to dial in the log level.