Mission Control Help

Plain-English guides for each tile. Click the anchors below to jump to a section.

Cloudflare Load Balancer (localreachwebdesign.com)

Green: primary healthy Yellow: failover in use Red: no healthy pools / API error

What this measures: Health of the West (primary) and East (failover) pools in Cloudflare.
Why it matters: If West is down, traffic should automatically move to East. If both fail, the site is offline.
Green: Primary (West) is healthy and serving traffic.
Yellow: Failover active — primary is unhealthy but another pool is serving traffic.
Red: No healthy pools or Cloudflare API error.
What to do:
- Yellow: Fix the unhealthy pool/origin (West or East) so traffic can return to primary.
- Red: Check Cloudflare dashboard, fix origin servers, and confirm DNS/LB status.
Examples: “Primary: lrwd-west · Active: lrwd-east” means we are on failover.

Advanced notes:

Primary is inferred from the first default pool; active is chosen from healthy default pools or fallback.
Mode values: primary | failover | degraded | down | unknown.
API errors include HTTP code + Cloudflare error codes/messages; token is never logged.

HTTP Checks

Green: all home pages load Yellow: deep pages slow/failing Red: home page failing

What this measures: Basic page loads for each site (home + one deeper page).
Why it matters: Shows if customers can reach the site right now.
Green: All enabled sites load.
Yellow: Deep pages failing or slow; home pages still OK.
Red: Home page failing for at least one site.
Slow headers: 1500–2200ms = warning; above 2200ms = fail. Click the slow Headers link to jump to the site.
What this measures: Headers = time-to-headers from the collector (fetch resolves when response headers arrive). It’s a proxy for perceived latency, not packet-level TCP TTFB.
Total: Time to body (headers + full response body read).
Edge debugging: Cloudflare edge info is captured (cf-ray + cf_colo) to show which PoP served the request.
What to do:
- Red: Open the site and confirm the error, check origin/server health, restart services if needed.
- Yellow: Investigate slowness (DB load, caching) or broken deep links.

Advanced notes:

Headers thresholds: warn ≥ 1500ms; fail > 2200ms (even if status=200).
Cloudflare Access-protected sites use client ID/secret headers from the environment.

SSL Expiry

Green: 30+ days left Yellow: < 30 days Red: < 7 days or error

What this measures: Days left on each site’s TLS certificate.
Why it matters: Expired certs break HTTPS and scare users.
Green: At least 30 days remain for all sites.
Yellow: Some certs expiring within 30 days.
Red: Errors or less than 7 days left.
What to do:
- Yellow: Confirm auto-renew is working or plan a manual renewal.
- Red: Renew immediately (Cloudflare/ACME), then re-run the collector.

Advanced notes:

TLS expiry checked via direct socket to the hostname on port 443.
Errors include invalid cert chain, timeout, or missing certificate.

Backup Integrity

Green: required files present Yellow: missing data / unknown Red: required files missing

What this measures: Latest backup folder contains required files.
Why it matters: A recent backup is useless if required files are missing.
WordPress requires: files.tar.gz, manifest.json, and db.sql.gz.
Non-WP requires: files.tar.gz + manifest.json.
What to do: If red, check the latest backup folder on TrueNAS and rerun the backup job.

Replication (WEST→EAST) (LRWD)

Green: ≤ 24h Yellow: 24–72h Red: > 72h or missing data

What this measures: Last LRWD content received on East and the age since the last sync.
Where it comes from: East wsync log end time + LRWD index.html mtime.
Why it matters: Confirms West → East replication is fresh enough for failover.
Red: SSH failed, missing timestamps, or replication is stale.
Timezones: Times are displayed in HST (Pacific/Honolulu) and labeled “HST”.

Uptime SLO (7/30 days/1 year)

Green: ≥ 99.9% Yellow: 99.0–99.9% Red: < 99.0%

What this measures: Availability trend from collector runs (home + deep checks).
Why it matters: Confirms you’re meeting uptime expectations beyond point‑in‑time checks.
Sampling: One data point per collector run; tile shows 7‑day, 30‑day, and 1‑year rollups.
What to do: If yellow/red, review recent incidents and collector errors.

Mission Control Context Prompt

Paste this at the top of future chats to provide full Mission Control context.

Mission Control Context Prompt (Leveling Prompt — UPDATED)

This is a leveling prompt. Acknowledge you understand this prompt, then WAIT for my task.
Use: Codex in `danger-full-access` sandbox mode (no approval prompts).

1) Mission Control summary (single source of truth)
- Mission Control is a static status dashboard for Local Reach Web Design.
- All UI renders from one snapshot JSON; the UI must not invent data.

2) Architecture flow (how updates move)
- Mission Control UI is static HTML/CSS/JS (no server rendering).
- The collector runs on TrueNAS (container-first), not the Mac.
  - Repo clone: /mnt/Piscine/infra/mission-control-collector/repo
  - Wrapper: /mnt/Piscine/infra/mission-control-collector/run-on-nas.sh
  - Logs: /mnt/Piscine/infra/mission-control-collector-logs/collector.log and collector.err.log
  - Schedule: Docker container mission-control-collector-scheduler (run on start + every 300s)
  - Manual run: /mnt/Piscine/infra/mc-run.sh (wraps run-on-nas.sh)
    - If run on the host without node, it auto-execs inside the scheduler container.
  - From Mac: ssh -t truenas "sudo -n /mnt/Piscine/infra/mc-run.sh"
- The scheduler container mounts /mnt/Piscine/infra/mission-control-collector/container-ssh into /root/.ssh for SSH access.
- Watchdog: systemd timer mission-control-collector-watchdog.timer keeps the scheduler container healthy after reboots.
- The collector uploads the snapshot to Cloudflare R2 (wrangler CLI).
- The Mission Control site reads that snapshot via a Cloudflare Pages Function:
  - API endpoint: /api/snapshot
  - Implemented in: functions/api/snapshot.js
  - Function loads snapshot from R2 and returns it to the browser
- The browser loads /api/snapshot and renders from that data.
- /api/snapshot is protected by Cloudflare Access in production.
  - Symptom: HTTP 302 redirect to Access login (Access domain).
  - Debug: confirm latest.json locally first, then fetch via Access/service token.

3) Dashboard layout model (3 layers)
Layer A - Impact Now (tiles, always visible)
Goal: "Are customers impacted right now?"
- Cloudflare Load Balancer status (active region + pool health)
- HTTP checks:
  - Headers = time-to-headers (fetch resolves when headers arrive)
  - Total = time-to-body (after reading response)
  - Edge = Cloudflare colo from cf-ray (cf_colo)
- SSL expiry runway
- Maintenance Due summary (next item + counts)
- Collector status (last run, age, trigger)
- Replication status tile (WEST->EAST content freshness)
- Sites (Layer A): show per-site Backup Freshness badge (✓/✗) from snapshot.backups.

Layer B - Will something break soon?
Goal: "What is trending toward failure?"
- Server resources / disk / early warnings (as implemented)

Layer C - Deep ops tables
Goal: "What needs action / details?"
- Sites table
- Servers table
- Maintenance table with runbook links

4) Key repos + files
Mission Control site repo (static dashboard):
- index.html (tile layout / sections)
- app.js (fetch snapshot, render tiles/tables)
- styles.css (UI styles)
- help.html (help docs and anchors)
- functions/api/snapshot.js (returns snapshot from R2)

Collector repo (runs on TrueNAS):
- /mnt/Piscine/infra/mission-control-collector/repo/collector.mjs (collects statuses + builds snapshot)
- /mnt/Piscine/infra/mission-control-collector/.env contains Cloudflare tokens / R2 config
- Output: /mnt/Piscine/infra/mission-control-collector/repo/latest.json

Backup system (runs on TrueNAS):
- /mnt/Piscine/infra/backup-websites/run-weekly.sh (weekly website backups, supports --only)
- /mnt/Piscine/infra/mission-control-collector/repo/backup-status.json (backup status input)

5) Snapshot structure (conceptual)
Snapshot JSON is the single source of truth and contains sections like:
- cloudflare_lb
- sites[].http.home.headers_ms (or legacy ttfb_ms) + total_ms + cf_ray + cf_colo
- ssl
- maintenance
- collector
- backups (from TrueNAS backup-status.json)
- replication.lrwd_west_to_east:
  - east_received_iso = timestamp of content on EAST
  - last_end_iso = last sync completion time
  - age_hours
  - status = ok/warn/fail
  - note (optional)

6) Replication (WEST->EAST) specifics
There is a pull-sync service on EAST that copies localreachwebdesign.com from WEST to EAST.
- EAST uses systemd: wsync.service + wsync.timer
- Script: /usr/local/sbin/wsync.sh reads /root/sync-map.csv
- Log: /var/log/wsync.log
- Current schedule is systemd timer (daily at 03:15 UTC unless changed)
- Collector checks EAST via SSH and includes replication status in snapshot.
- Collector may use MC_EAST_SSH_HOST=localreach-east-01 to apply the correct SSH key via TrueNAS config.
- Replication tile shows last sync + last content timestamp + age/status.
- Hot standby readiness uses nginx_active plus a localhost HTTP check with Host header; timestamps are informational only.

7) Website backups (weekly, TrueNAS)
- Weekly full-restore backups run on TrueNAS via /mnt/Piscine/infra/backup-websites/run-weekly.sh (cron: Sundays 03:30 UTC).
  - Cron schedule timezone may vary (UTC/PST/etc.); folder naming is ALWAYS HST.
- Backups live at:
  /mnt/Piscine/Home/1. Local Reach Web Design/Automated Backups/website-backups/<domain>/<YYYY-MM-DD>
- Folder naming is HST date (Pacific/Honolulu). Do not rename old folders.
- Each run writes files.tar.gz, db.sql.gz (WP only), and manifest.json with run_date_hst + timestamp_utc.
- Status file:
  /mnt/Piscine/infra/mission-control-collector/repo/backup-status.json
  - Per-site last_success_utc is updated ONLY after a successful backup for that site.
  - On failure: record an error for that site and DO NOT overwrite last_success_utc.
- Backup freshness UI should turn RED at 7 days (within 7 days = OK):
    * expected cadence: at least every 7 days
    * ok=true when age_days <= 7.0
    * ok=false when age_days > 7.0 or last_success_utc missing/invalid
- Collector embeds snapshot.backups from that file; UI shows ✓/✗ next to each site.

8) Timezone rules (display-only)
- Collector/snapshot timestamps remain UTC ISO (do not change sources).
- UI requirement: display ALL timestamps in Pacific/Honolulu (HST, no DST) and label them "HST".
- If UI formatting is inconsistent, fix by routing all date rendering through one helper in app.js configured for Pacific/Honolulu.

9) Debug order for missing data
1) Collector output local (latest.json)
2) Upload succeeded (R2 updated)
3) /api/snapshot returns it (or Cloudflare Access is blocking)
4) app.js renders it

10) Status thresholds (summary; app.js is truth if this drifts)
- If this summary conflicts with app.js, app.js is the source of truth.
- Cloudflare LB: red if lb.error or active pool unhealthy; yellow if missing pools, active != primary, or failover not ready; green when active == primary and failover pool healthy.
- HTTP: red if any site http.home.ok or http.deep.ok is false; warn if TTFB >= 1500ms (HTTP_TTFB_WARN_MS); fail if TTFB > 2200ms (HTTP_TTFB_FAIL_MS).
- SSL: warn when days_remaining < 30; fail when days_remaining < 7; OK when >= 30.
- Replication: if replication.east_backups.*.latest_epoch exists, tile forced green; backup line warn if age > 24h, fail if age > 72h; otherwise uses replication.lrwd_west_to_east.status ok/warn/fail.
- Collector: stale if generated_at/collector.last_run_at older than 12 minutes; invalid timestamp = red.
- Servers: fail if disk.root_used_pct >= 90 or server.ok === false (or server.severity red); warn if disk.root_used_pct >= 80, reboot_required true, updates_pending >= 100, or any service ok === false.
- Maintenance: overdue (days_until_due < 0) is red; warn window is 7 days.
- Backups: ok if backup-status.json says ok === true (age_days <= 7.0); otherwise red.

11) How to work on changes (rules)
- Make minimal edits and keep layout consistent with the 3-layer model.
- Do not change snapshot schema unless explicitly required.
- If adding a new status:
  1) Add collector logic in collector.mjs
  2) Add to latest.json payload
  3) Ensure upload to R2 works
  4) Ensure /api/snapshot returns it
  5) Render in app.js
  6) Add help section in help.html

12) What I want from you (ChatGPT/Codex)
When I ask for Mission Control changes:
- Assume this architecture and layout.
- Ask for the smallest missing detail only if absolutely necessary.
- Provide step-by-step "replace this / add this after this" instructions.
- Prefer safe, incremental changes and include quick verification commands.

13) Collector debug flags (only when needed)
- MC_DEBUG_CF_LB=1 enables extra Cloudflare LB debug lines:
  - CF LB pools attached | <id>:<name>
  - CF LB pools origins | <name>:<origin_count>

14) Completion requirement (deploy + version)
- When work is complete: commit, push, and deploy to the live Mission Control site.
- Report the deployed version as: UI v: <git short sha> (example: UI v: bdc9ab4)

Collector Status

Green: fresh run Yellow/Red: stale or failing

What this measures: When the monitoring collector last ran and how.
Why it matters: If the collector is stale, the dashboard may be showing old info.
Green: Ran recently (under 12 minutes ago).
Yellow/Red: Stale or errors running the collector.
What to do:
- Check the log: /mnt/Piscine/infra/mission-control-collector-logs/collector.log (and collector.err.log).
- Re-run from Mac: ssh -t truenas "sudo -n /mnt/Piscine/infra/mc-run.sh".

Advanced notes:

Collector runs in a TrueNAS container scheduler every 300s (timer trigger).
ENV is loaded from /mnt/Piscine/infra/mission-control-collector/.env for Cloudflare tokens and R2 upload.

Sites Table

Status: OK/WARN/FAIL Shows Backup + Headers + SSL days

What this shows: Each site’s domain, backup status, response time, SSL days left, and notes.
Why it matters: Quick way to see which site needs attention without digging.
Green: HTTP checks and SSL runway look good.
Yellow: Slow headers or SSL nearing renewal.
Red: HTTP failures or SSL almost expired.
What to do:
- Open the domain from the table and confirm the issue.
- Check the matching server card for CPU/disk/service warnings.

Advanced notes:

Status pill is derived from HTTP checks + SSL days < 7.
Backup column shows days since snapshot.backups.sites[domain].last_success_utc (computed client-side).
Backup pill is green when age_days <= 7.0; red when older or missing.
Rows are grouped by apex domain for quicker scanning.

Servers

Green: within thresholds Yellow: approaching limits Red: critical action required

What this shows: Disk, RAM, updates, reboot needs, backup age, and critical service health.
Why it matters: Prevents outages from full disks, stale backups, or dead services.
Green: All metrics in safe ranges and services up.
Yellow: Approaching thresholds (disk ≥ 80%, RAM ≥ 80%, many updates, older backups).
Red: Critical conditions (disk ≥ 90%, RAM ≥ 90%, services down, stale backups > 48h).
What to do:
- Follow the “Next steps” line on each card (cleanup disk, restart services, schedule reboots).
- For service issues: systemctl status <name>, check logs, restart if safe.

Advanced notes:

Required services: nginx, clp-php-fpm, mysql, redis-server (PM2 omitted unless added in config).
Recommendations are capped at 3 per server and ordered by severity.

Public Exposure

What this shows: Publicly listening ports on each server (non‑loopback).
Why it matters: Catches accidental exposure of admin or database services.
What to do: If an unexpected port appears, verify firewall rules and service configs.