Mission Control Help

Plain-English guides for each tile. Click the anchors below to jump to a section.

Cloudflare Load Balancer (localreachwebdesign.com)

Green: primary healthy Yellow: failover in use Red: no healthy pools / API error
Advanced notes:
  • Primary is inferred from the first default pool; active is chosen from healthy default pools or fallback.
  • Mode values: primary | failover | degraded | down | unknown.
  • API errors include HTTP code + Cloudflare error codes/messages; token is never logged.

HTTP Checks

Green: all home pages load Yellow: deep pages slow/failing Red: home page failing
Advanced notes:
  • Headers thresholds: warn ≥ 1500ms; fail > 2200ms (even if status=200).
  • Cloudflare Access-protected sites use client ID/secret headers from the environment.

SSL Expiry

Green: 30+ days left Yellow: < 30 days Red: < 7 days or error
Advanced notes:
  • TLS expiry checked via direct socket to the hostname on port 443.
  • Errors include invalid cert chain, timeout, or missing certificate.

Backup Integrity

Green: required files present Yellow: missing data / unknown Red: required files missing

Replication (WEST→EAST) (LRWD)

Green: ≤ 24h Yellow: 24–72h Red: > 72h or missing data

Uptime SLO (7/30 days/1 year)

Green: ≥ 99.9% Yellow: 99.0–99.9% Red: < 99.0%

Mission Control Context Prompt

Paste this at the top of future chats to provide full Mission Control context.

Mission Control Context Prompt (Leveling Prompt — UPDATED)

This is a leveling prompt. Acknowledge you understand this prompt, then WAIT for my task.
Use: Codex in `danger-full-access` sandbox mode (no approval prompts).

1) Mission Control summary (single source of truth)
- Mission Control is a static status dashboard for Local Reach Web Design.
- All UI renders from one snapshot JSON; the UI must not invent data.

2) Architecture flow (how updates move)
- Mission Control UI is static HTML/CSS/JS (no server rendering).
- The collector runs on TrueNAS (container-first), not the Mac.
  - Repo clone: /mnt/Piscine/infra/mission-control-collector/repo
  - Wrapper: /mnt/Piscine/infra/mission-control-collector/run-on-nas.sh
  - Logs: /mnt/Piscine/infra/mission-control-collector-logs/collector.log and collector.err.log
  - Schedule: Docker container mission-control-collector-scheduler (run on start + every 300s)
  - Manual run: /mnt/Piscine/infra/mc-run.sh (wraps run-on-nas.sh)
    - If run on the host without node, it auto-execs inside the scheduler container.
  - From Mac: ssh -t truenas "sudo -n /mnt/Piscine/infra/mc-run.sh"
- The scheduler container mounts /mnt/Piscine/infra/mission-control-collector/container-ssh into /root/.ssh for SSH access.
- Watchdog: systemd timer mission-control-collector-watchdog.timer keeps the scheduler container healthy after reboots.
- The collector uploads the snapshot to Cloudflare R2 (wrangler CLI).
- The Mission Control site reads that snapshot via a Cloudflare Pages Function:
  - API endpoint: /api/snapshot
  - Implemented in: functions/api/snapshot.js
  - Function loads snapshot from R2 and returns it to the browser
- The browser loads /api/snapshot and renders from that data.
- /api/snapshot is protected by Cloudflare Access in production.
  - Symptom: HTTP 302 redirect to Access login (Access domain).
  - Debug: confirm latest.json locally first, then fetch via Access/service token.

3) Dashboard layout model (3 layers)
Layer A - Impact Now (tiles, always visible)
Goal: "Are customers impacted right now?"
- Cloudflare Load Balancer status (active region + pool health)
- HTTP checks:
  - Headers = time-to-headers (fetch resolves when headers arrive)
  - Total = time-to-body (after reading response)
  - Edge = Cloudflare colo from cf-ray (cf_colo)
- SSL expiry runway
- Maintenance Due summary (next item + counts)
- Collector status (last run, age, trigger)
- Replication status tile (WEST->EAST content freshness)
- Sites (Layer A): show per-site Backup Freshness badge (✓/✗) from snapshot.backups.

Layer B - Will something break soon?
Goal: "What is trending toward failure?"
- Server resources / disk / early warnings (as implemented)

Layer C - Deep ops tables
Goal: "What needs action / details?"
- Sites table
- Servers table
- Maintenance table with runbook links

4) Key repos + files
Mission Control site repo (static dashboard):
- index.html (tile layout / sections)
- app.js (fetch snapshot, render tiles/tables)
- styles.css (UI styles)
- help.html (help docs and anchors)
- functions/api/snapshot.js (returns snapshot from R2)

Collector repo (runs on TrueNAS):
- /mnt/Piscine/infra/mission-control-collector/repo/collector.mjs (collects statuses + builds snapshot)
- /mnt/Piscine/infra/mission-control-collector/.env contains Cloudflare tokens / R2 config
- Output: /mnt/Piscine/infra/mission-control-collector/repo/latest.json

Backup system (runs on TrueNAS):
- /mnt/Piscine/infra/backup-websites/run-weekly.sh (weekly website backups, supports --only)
- /mnt/Piscine/infra/mission-control-collector/repo/backup-status.json (backup status input)

5) Snapshot structure (conceptual)
Snapshot JSON is the single source of truth and contains sections like:
- cloudflare_lb
- sites[].http.home.headers_ms (or legacy ttfb_ms) + total_ms + cf_ray + cf_colo
- ssl
- maintenance
- collector
- backups (from TrueNAS backup-status.json)
- replication.lrwd_west_to_east:
  - east_received_iso = timestamp of content on EAST
  - last_end_iso = last sync completion time
  - age_hours
  - status = ok/warn/fail
  - note (optional)

6) Replication (WEST->EAST) specifics
There is a pull-sync service on EAST that copies localreachwebdesign.com from WEST to EAST.
- EAST uses systemd: wsync.service + wsync.timer
- Script: /usr/local/sbin/wsync.sh reads /root/sync-map.csv
- Log: /var/log/wsync.log
- Current schedule is systemd timer (daily at 03:15 UTC unless changed)
- Collector checks EAST via SSH and includes replication status in snapshot.
- Collector may use MC_EAST_SSH_HOST=localreach-east-01 to apply the correct SSH key via TrueNAS config.
- Replication tile shows last sync + last content timestamp + age/status.
- Hot standby readiness uses nginx_active plus a localhost HTTP check with Host header; timestamps are informational only.

7) Website backups (weekly, TrueNAS)
- Weekly full-restore backups run on TrueNAS via /mnt/Piscine/infra/backup-websites/run-weekly.sh (cron: Sundays 03:30 UTC).
  - Cron schedule timezone may vary (UTC/PST/etc.); folder naming is ALWAYS HST.
- Backups live at:
  /mnt/Piscine/Home/1. Local Reach Web Design/Automated Backups/website-backups/<domain>/<YYYY-MM-DD>
- Folder naming is HST date (Pacific/Honolulu). Do not rename old folders.
- Each run writes files.tar.gz, db.sql.gz (WP only), and manifest.json with run_date_hst + timestamp_utc.
- Status file:
  /mnt/Piscine/infra/mission-control-collector/repo/backup-status.json
  - Per-site last_success_utc is updated ONLY after a successful backup for that site.
  - On failure: record an error for that site and DO NOT overwrite last_success_utc.
- Backup freshness UI should turn RED at 7 days (within 7 days = OK):
    * expected cadence: at least every 7 days
    * ok=true when age_days <= 7.0
    * ok=false when age_days > 7.0 or last_success_utc missing/invalid
- Collector embeds snapshot.backups from that file; UI shows ✓/✗ next to each site.

8) Timezone rules (display-only)
- Collector/snapshot timestamps remain UTC ISO (do not change sources).
- UI requirement: display ALL timestamps in Pacific/Honolulu (HST, no DST) and label them "HST".
- If UI formatting is inconsistent, fix by routing all date rendering through one helper in app.js configured for Pacific/Honolulu.

9) Debug order for missing data
1) Collector output local (latest.json)
2) Upload succeeded (R2 updated)
3) /api/snapshot returns it (or Cloudflare Access is blocking)
4) app.js renders it

10) Status thresholds (summary; app.js is truth if this drifts)
- If this summary conflicts with app.js, app.js is the source of truth.
- Cloudflare LB: red if lb.error or active pool unhealthy; yellow if missing pools, active != primary, or failover not ready; green when active == primary and failover pool healthy.
- HTTP: red if any site http.home.ok or http.deep.ok is false; warn if TTFB >= 1500ms (HTTP_TTFB_WARN_MS); fail if TTFB > 2200ms (HTTP_TTFB_FAIL_MS).
- SSL: warn when days_remaining < 30; fail when days_remaining < 7; OK when >= 30.
- Replication: if replication.east_backups.*.latest_epoch exists, tile forced green; backup line warn if age > 24h, fail if age > 72h; otherwise uses replication.lrwd_west_to_east.status ok/warn/fail.
- Collector: stale if generated_at/collector.last_run_at older than 12 minutes; invalid timestamp = red.
- Servers: fail if disk.root_used_pct >= 90 or server.ok === false (or server.severity red); warn if disk.root_used_pct >= 80, reboot_required true, updates_pending >= 100, or any service ok === false.
- Maintenance: overdue (days_until_due < 0) is red; warn window is 7 days.
- Backups: ok if backup-status.json says ok === true (age_days <= 7.0); otherwise red.

11) How to work on changes (rules)
- Make minimal edits and keep layout consistent with the 3-layer model.
- Do not change snapshot schema unless explicitly required.
- If adding a new status:
  1) Add collector logic in collector.mjs
  2) Add to latest.json payload
  3) Ensure upload to R2 works
  4) Ensure /api/snapshot returns it
  5) Render in app.js
  6) Add help section in help.html

12) What I want from you (ChatGPT/Codex)
When I ask for Mission Control changes:
- Assume this architecture and layout.
- Ask for the smallest missing detail only if absolutely necessary.
- Provide step-by-step "replace this / add this after this" instructions.
- Prefer safe, incremental changes and include quick verification commands.

13) Collector debug flags (only when needed)
- MC_DEBUG_CF_LB=1 enables extra Cloudflare LB debug lines:
  - CF LB pools attached | <id>:<name>
  - CF LB pools origins | <name>:<origin_count>

14) Completion requirement (deploy + version)
- When work is complete: commit, push, and deploy to the live Mission Control site.
- Report the deployed version as: UI v: <git short sha> (example: UI v: bdc9ab4)

Collector Status

Green: fresh run Yellow/Red: stale or failing
Advanced notes:
  • Collector runs in a TrueNAS container scheduler every 300s (timer trigger).
  • ENV is loaded from /mnt/Piscine/infra/mission-control-collector/.env for Cloudflare tokens and R2 upload.

Sites Table

Status: OK/WARN/FAIL Shows Backup + Headers + SSL days
Advanced notes:
  • Status pill is derived from HTTP checks + SSL days < 7.
  • Backup column shows days since snapshot.backups.sites[domain].last_success_utc (computed client-side).
  • Backup pill is green when age_days <= 7.0; red when older or missing.
  • Rows are grouped by apex domain for quicker scanning.

Servers

Green: within thresholds Yellow: approaching limits Red: critical action required
Advanced notes:
  • Required services: nginx, clp-php-fpm, mysql, redis-server (PM2 omitted unless added in config).
  • Recommendations are capped at 3 per server and ordered by severity.

Public Exposure