Skip to content

Agent Session External-Access Diagnostic Playbook

FIRST-PASS DRAFT. This is a skeleton written during the 2026-04-18 evening security.ops session. TODO markers throughout flag content to be filled in (primarily from Morgan's MCP diagnostic and Rick's interactive 1Password signin tomorrow). Do not treat as canonical yet.

Purpose

When an agent session cannot reach a service it legitimately needs — Slack, Monday.com, 1Password-backed credentials, GitHub API, Supabase — the symptoms look similar but the causes are disjoint. This playbook is the first-responder decision tree so agents stop improvising and start converging on root cause quickly.

The umbrella problem: "agent session can't reach external thing." Covers at least:

  • MCP servers not attached at session start (Slack, Monday, Drive, etc.)
  • 1Password CLI missing or not authenticated
  • Credentials missing from the vault
  • Config drift in ~/.claude.json or per-session .claude/ settings
  • Transient connectivity (travel wifi, VPN flapping)

Symptoms an agent observes

MCP-shaped: - ToolSearch { query: "slack" } returns "No matching deferred tools found" when Slack MCP is expected - Monday/Slack tools missing from the deferred-tools list at session start, while computer-use and Chrome MCPs attach normally - Posting to Slack fails silently or returns "tool not available"

Credential-shaped: - Script calling get_credential("MONDAY_API_TOKEN", "op://...", ...) returns None or raises FileNotFoundError / op: command not found - op read "op://..." errors with "No accounts configured" OR "item not found" OR "vault not accessible" - API call succeeds but returns 401/403 — token stale, rotated, or wrong vault

Connectivity-shaped: - Intermittent success — works, then fails, then works - TLS handshake timeouts, "connection reset by peer" - Correlates with known-flaky environment (plane wifi, hotel wifi, cellular tether)

Fast checks (run in this order)

Each check is cheap. Run them top-to-bottom; stop when you have a branch to follow.

  1. Connectivitycurl -sS -o /dev/null -w "%{http_code}\n" https://api.monday.com and https://slack.com. TODO: add ping + DNS resolution checks if curl shows anomalies.
  2. MCP attach state — in your own agent session, run ToolSearch { query: "slack", max_results: 5 } and ToolSearch { query: "monday", max_results: 5 }. If nothing returns, MCPs didn't attach this session.
  3. Is ONLY Slack + Monday gone, or everything beyond computer-use/Chrome? — the travel-wifi pattern drops Slack + Monday specifically (see prior memory). If Drive, Figma, Teams are also gone, treat as systemic; if only Slack + Monday, same pattern.
  4. op CLI present?which op. If missing, credential path #2 in the three-tier helper is broken for this host.
  5. op auth stateop account list then op whoami. Four possible outcomes:
  6. "No accounts configured" → Rick hasn't done interactive signin on this host yet
  7. Account listed, whoami succeeds → auth is live, proceed to vault check
  8. Account listed, whoami fails with session-expired → re-signin needed
  9. TODO: document the fourth mode (account listed but vault not shared with signed-in user)
  10. Credential helper output — for the script that's failing, instrument the get_credential() call to log which of the three tiers returned the value (env / op / local file). Silent fallthrough is the common failure mode.
  11. Config drift~/.claude.json MCP registration block should list slack and monday_com. Do NOT modify this file during diagnosis. Read-only. TODO: Morgan's diagnostic will document the exact expected structure.

Branches

Branch A — travel-wifi MCP handshake failure

Signal: Slack + Monday MCPs absent from deferred-tools list, connectivity is intermittent, CEO or agent is on plane / hotel / cellular.

Prior art: /Users/rhartley/.claude/projects/-Users-rhartley-Claude/memory/project_travel_mcp_degradation.md.

Action: - Do NOT retry reconnect while wifi is still flaky — handshake will fail again and you'll churn - Fall back to filesystem handoffs — session-dir drops are the durable artifact during travel windows - Queue Slack notifications to the pending-queue buffer (Morgan's diagnostic is designing this; see TODO below) - When connectivity returns clean, restart Claude Code (CEO approval required) to re-establish MCP attach - Do NOT file this as an incident — it's an environment symptom, not a bug

TODO (Morgan's MCP diagnostic): fill in the exact ToolSearch output shape that confirms "only Slack + Monday are missing" vs. "everything is missing." Also the buffer file location and flush mechanism once designed.

Branch B — 1Password auth not configured on this host

Signal: op CLI present, op account list returns empty, scripts failing with "No accounts configured for use with 1Password CLI."

Action: - This is expected on a fresh host. Rick does interactive signin once: op account add → enter sign-in address + email + secret key + master password. Alternative: enable 1Password desktop app CLI integration (Settings → Developer → "Integrate with 1Password CLI"). - Known constraint (confirmed 2026-04-21): op signin with desktop-app integration requires a local GUI session on the target machine. SSH connections to Mac Studio cannot complete the biometric handshake, even if Rick has authenticated on another device. op CLI auth is per-device, not per-account. This is a 1Password architecture constraint — the fix is a Service Account, not a retry. - After signin, verify with op read "op://Production/Monday.com-API-Token/credential" — should return the token string. - Agents do NOT attempt op account add — interactive only, Rick's hands on keyboard. - If agent scripts run in a context where the desktop app is not available (SSH, headless, launchd, travel window), use OP_SERVICE_ACCOUNT_TOKEN retrieved from macOS Keychain. Service Account provisioning policy: GO — scoped to Agent-Automation vault (read-only), stored in Keychain via security CLI, rotated quarterly. Full design at /Claude/sessions/cmo/security-ops-response-1password-service-account-2026-04-21.md. Pending Morgan execution scheduling.

Branch C — credential missing from vault

Signal: op CLI present, op whoami succeeds, but op read "op://Production/<item>/credential" returns "item not found" or "vault not accessible."

Action: - Verify the item path exactly — case-sensitive, slashes matter. - op vault list to confirm Production vault is visible to the signed-in account. - op item list --vault Production to confirm the named item exists. - If truly missing: flag to Rick. security.ops does NOT create credentials — that's an interactive step Rick owns. - File SEV3 Incident (Section 6E) documenting the missing credential so rotation/addition is tracked.

Branch D — Monday/Slack MCP config drift in ~/.claude.json

Signal: Connectivity clean, op working, other MCPs attaching — only Slack/Monday consistently missing across sessions AND across clean-connection restart attempts.

Action: - This is NOT the travel-wifi pattern — rule that out first (branch A). - Escalate to code.platform (CDO) — MCP registration is infra config, not security.ops's lane. - security.ops provides the diagnosis ("config drift suspected, not environment"), CDO provides the fix. - Do NOT modify ~/.claude.json yourself. - TODO (Morgan's MCP diagnostic): fill in the specific registration-block shape that should be present, so drift detection becomes mechanical rather than exploratory.

Escalation matrix

Finding Owner Severity
Travel-wifi MCP degradation security.ops (document), CEO (awareness) not-an-incident — environment signal
1Password not yet configured on new host Rick (interactive signin) SEV3
Credential missing from Production vault security.ops flag + Rick add SEV3 unless customer-facing → SEV1
Credential exposed/leaked (unexpected source) security.ops + CEO immediate SEV1
Config drift in ~/.claude.json code.platform (CDO) with security.ops diagnosis SEV2
Auth token rotated but scripts still using old security.ops rotation runbook SEV2

When to loop in code.platform (CDO) vs. fix in security.ops

Fix in security.ops (no handoff needed): - Credential hygiene — what's in the vault, what's referenced by scripts, retiring local-file fallbacks - op CLI installation and baseline verification - Auth state documentation and re-signin coordination with Rick - Credential rotation (per the credential-rotation-guide SOP) - Vault organization (Production vs. Development separation)

Loop in code.platform (CDO): - ~/.claude.json registration changes — MCP server config is infra - MCP server version upgrades or the plumbing that launches them - Hook infrastructure (session-start MCP attach detection) - Changes to the credential helper itself (get_credential() implementation) - CI/headless credential delivery (service-account token plumbing)

Joint (security.ops + code.platform): - Any change where security posture and infra config overlap — e.g., moving from local-file fallback to service-account tokens in CI. security.ops writes the threat model; code.platform implements.

Outstanding (TODO markers — fill in as the paired diagnostics land)

  • From Morgan's MCP diagnostic (/Claude/sessions/security.ops/morgan-dispatch-first-session-mcp-diagnostic-2026-04-18.md):
  • Exact expected MCP registration block structure in ~/.claude.json
  • Buffer file location + format for pending Slack notifications during travel (memory suggests /Claude/operations/buffers/slack-pending.jsonl; needs confirmation)
  • Flush mechanism (manual script first, hook later)
  • Session-start MCP attach detection hook proposal (scope separately with Morgan)
  • From Rick's interactive 1Password signin (tomorrow):
  • Confirm op read "op://Production/Monday.com-API-Token/credential" returns the token
  • Confirm script invocation from a non-interactive shell succeeds (the agent-session case that matters)
  • Document whether desktop-app integration or op account add is the chosen signin path
  • From Michelle's Items 2–4 (next-week scope):
  • Full credential-path audit across /Claude/scripts/ — which tokens, which helpers, which fallbacks
  • Retire-or-justify decision on local-file fallback (path #3 of the three-tier helper)
  • SOP Delta per Section 6E for whichever direction is chosen
  • Validator independence (Section 10A.7, whenever this playbook is formally promoted from draft):
  • Council review per SOP-EXEC-council-review-v1.0 recommended — this touches both security posture (credentials) and infra reliability (MCP). Panel: at minimum GPT-4o + Gemini. Owner of that invocation: sop.manager.

Revision log

  • 2026-04-18 v0.1 — first-pass skeleton written by security.ops during evening session. op CLI install verified (2.34.0 via brew cask). Auth not yet configured (Rick interactive signin pending). Branches populated from prior-art memory + Michelle's task + Morgan's dispatch. TODO markers throughout — this is explicitly a first pass, not a canonical SOP.