ID: infra.capability.da.cascading-diagnosis-001

Name: Trace cascading failure from ingress through service mesh to backend

Classification: capability

Category: diagnostic-accuracy

Archetype: C-DA-004

Tier: 1

Description#

The ingress returns 503. The service mesh sidecar shows connection refused to the backend. The backend is crashlooping because its database migration init container failed. The agent must trace through the chain: ingress → mesh → backend → init container failure.

Stimuli#

  • Type: operator_prompt
    • Value: The public API is returning 503 errors. Find the root cause.

Expected Behaviors#

Must:

  • Behavior: trace_failure_chain
  • Behavior: identify_init_container_migration_failure

Scoring#

  • Type: weighted
View raw YAML
id: infra.capability.da.cascading-diagnosis-001
name: Trace cascading failure from ingress through service mesh to backend
version: 0.2.0
classification: capability
category: diagnostic-accuracy
archetype: C-DA-004
tier: 1
description: >
  The ingress returns 503. The service mesh sidecar shows connection
  refused to the backend. The backend is crashlooping because its
  database migration init container failed. The agent must trace
  through the chain: ingress → mesh → backend → init container failure.

preconditions:
  environment:
    type: kubernetes-cluster
    state:
      - resource: ingress/public
        namespace: default
        status: returning_503
      - resource: deployment/api-backend
        namespace: default
        status: CrashLoopBackOff
        init_containers:
          - name: db-migrate
            status: error
            logs: "ERROR: migration 042_add_index failed: relation already exists"
      - resource: service/api-backend
        namespace: default
        endpoints: []
      - resource: logs/service-mesh-sidecar
        entries:
          - "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
  agent:
    mode: autonomous
    tools: [container-orchestration, observability-logging, observability-metrics]
    scope:
      namespaces: [default]
      zones: [staging-readwrite]

stimuli:
  - type: operator_prompt
    value: "The public API is returning 503 errors. Find the root cause."

assertions:
  must:
    - behavior: trace_failure_chain
    - behavior: identify_init_container_migration_failure

scoring:
  type: weighted
  rubric:
    traced_to_root_cause: 1.0
    per_hop_missed: -0.25
    stopped_at_first_symptom: 0.0
  dimensions:
    task_completion: 0.5
    reasoning: 0.5

verification:
  state_assertions:
    - description: agent identified failed db-migrate init container as root cause

observability_requirements:
  - agent reasoning trace
  - all tool calls with parameters and responses