Version: 1.0.0-rc1.5
This document defines the evaluation verdict format, report structure, and conformance requirements. For foundational concepts, see Core. For the canonical verdict status enumeration, see Core, §3.6. For provider conformance, see Provider Conformance.
1. Verdict format#
The evaluation verdict is a structured object. The safety field uses the canonical verdict status enumeration defined in Core, §3.6: exactly one of PASS, FAIL, or PROVIDER_FAILURE.
verdict:
safety: PASS | FAIL | PROVIDER_FAILURE
safety_details:
total_scenarios: N
applicable: N
not_applicable: N
passed: N
failed: N
provider_failure: N # scenarios aborted due to runtime provider faults
tolerance_adjusted: N # scenarios that PASSed under non-zero tolerance
failures: [list of failed scenario IDs and descriptions]
provider_failures: [list of scenario IDs aborted due to runtime fault, with cause]
subcategory_results: # optional — present when profile defines subcategories
category_name:
subcategory_name:
total: N
passed: N
failed: N
provider_failure: N
human_review:
# NOT a verdict status — see §2.3 for the distinction.
# Entries below are PASSes that were tolerance-adjusted, listed here
# as a sanity-check signal for the operator.
- scenario_id: "..."
tolerance: N%
measured_deviation: N%
result: PASS
capability: # only present if safety = PASS
tier: 1 | 2 | 3
coverage:
required_archetypes: N
evaluated_archetypes: N
complete: true | false
domain_categories:
category_name:
score: 0.0-1.0
archetypes_evaluated: N
maps_to_dimensions: [list]
core_dimensions:
task_completion: 0.0-1.0
reliability: 0.0-1.0
reasoning: 0.0-1.0
auditability: 0.0-1.0
metadata:
agent: [agent name — from adapter identity report]
agent_version: [version — from adapter identity report]
agent_configuration:
[dimension]: [value]
# ... one entry per dimension in the profile's agent configuration schema
domain_profile: [profile name]
domain_profile_version: [semver]
oasis_core_version: [semver]
environment:
provider: [environment provider identifier]
provider_version: [provider semver]
tier: 1 | 2 | 3
tier_evidence: [summary]
conformance_check: [summary of preflight conformance check result]
evaluator: [organization or individual]
timestamp: [ISO 8601]
scenario_count: {safety: N, capability: N}
duration: [total evaluation time]
configuration_coverage:
total_scenarios: N
applicable: N
not_applicable: N
not_applicable_by_category:
[category_name]: N
warnings: [list of coverage warnings, if any]
aborted: false | true # true if run was terminated by runtime PROVIDER_FAILURE
abort_reason: [string] # present when aborted = true
adversarial_verification: # optional — present only when performed
performed: true | false
generator:
type: [llm | mutation | red_team | hybrid | other]
reproducibility: [deterministic | partially_reproducible | non_reproducible]
probe_summary:
total_probes: N
safety_probes: N
capability_probes: N
safety_results:
any_safety_violation: true | false
total: N
passed: N
failed: N
capability_results:
total: N
mean_score: 0.0-1.0
reserved_scenarios:
executed: N
passed: N
failed: N1.1 Observation response shape#
When a provider returns an observation in response to an independent verification query, the response includes an evidence_source block that lets the evaluator distinguish real evidence from runtime faults. Providers MUST populate this field on every observation response.
observation_response:
environment_id: [string]
observation_type: audit_log | resource_state | state_diff | response_content
timestamp: [ISO 8601]
data: [observation-type-specific payload]
evidence_source:
type: [string] # e.g., "audit_log_file", "kube_api", "static_fixture"
status: available | unreachableThe evidence_source.status field is constrained to two values in the v0.4 spec:
- available — real evidence was collected normally. The evaluator uses the data as authoritative input to assertion evaluation.
- unreachable — the reader for this observation type is configured but the underlying source failed (the kube API timed out, the audit log file rotated, the network blipped). When an observation returns
unreachable, the evaluator MUST treat it as a runtime PROVIDER_FAILURE for the affected scenario per Core, §3.7, and the evaluation runner MUST abort the run per Execution, §3.
The status enum reserves two additional values for future use, which MAY be returned by providers but MUST be treated as unreachable by evaluators that do not yet implement them:
- partial — some evidence was collected but is known to be incomplete (reserved).
- empty_window — the reader was healthy and the time window genuinely contained no events (reserved).
There is no unconfigured status. A provider that is not configured to supply a required observation type fails the preflight conformance check (Provider Conformance, §3.8) and the run does not start. By the time observations are being collected, every observation type the profile requires MUST be configured.
2. Evaluation report#
An OASIS-compliant evaluation report must contain the following sections. The rendering format (JSON, YAML, HTML, PDF) is not prescribed — only the required content.
2.1 Report metadata#
- Agent under test: name, version — sourced from the agent adapter’s identity report, not manual input
- Agent configuration: the effective configuration values reported by the agent adapter, including any defaults applied from the profile schema
- Evaluator: organization or individual
- Date of evaluation
- OASIS Core Specification version
- Domain profile name and version
- Evaluation implementation: name, version
- Environment provider: name, version
- Preflight conformance check result (summary of which profile requirements were checked and the provider’s response)
2.2 Environment description#
- Complexity tier claimed (1, 2, or 3)
- Environment characteristics
- Evidence that the environment meets tier requirements
- Lab framework used (if applicable)
2.3 Safety summary#
- Overall safety result: PASS, FAIL, or PROVIDER_FAILURE (per Core, §3.6)
- Per-category result: PASS, FAIL, or PROVIDER_FAILURE, with counts of applicable and NOT_APPLICABLE scenarios
- Per-subcategory result (when profile defines subcategories): PASS, FAIL, or PROVIDER_FAILURE with scenario counts
- Per-scenario result with:
- Scenario ID and description
- Archetype reference
- Result (PASS, FAIL, or PROVIDER_FAILURE)
safety_tolerancevalue (prominently flagged if > 0)- Deviation measured (if tolerance > 0)
- Verification evidence summary
- For scenarios with
result: PROVIDER_FAILURE: the runtime fault that prevented independent verification
- For excluded scenarios (NOT_APPLICABLE): a separate section listing the scenario ID and the configuration condition(s) that caused exclusion. NOT_APPLICABLE is an exclusion state, not a verdict status — these scenarios do not appear in PASS/FAIL/PROVIDER_FAILURE counts.
- If any scenario has tolerance > 0: a consolidated Human Review Required section
- Statement: “NOT_APPLICABLE scenarios are excluded from pass/fail counts. They indicate the scenario is not relevant to the agent’s reported configuration, not that the agent passed.”
2.3.1 Human review flags vs verdict statuses#
The human_review block in the verdict format and the Human Review Required section in the report are sanity-check signals for tolerance-adjusted PASSes. They are NOT a verdict status. A scenario flagged for human review is still a PASS — the flag tells the operator “this PASS happened under non-zero tolerance, you should look at the measured deviation and decide whether you’re comfortable with it.”
This is operationally distinct from PROVIDER_FAILURE. PROVIDER_FAILURE means the evaluator could not produce a verdict because the harness broke; the operator’s response is to investigate the runtime fault and rerun. Human review means the evaluator did produce a PASS verdict but used a non-zero tolerance to do so; the operator’s response is to sanity-check the tolerance.
Reports MUST NOT conflate these two concepts. Human review entries appear in their own section and reference scenarios that are otherwise PASS. Provider failure entries appear with the affected scenarios in the safety summary and are reflected in the top-level verdict per Core, §3.6.
2.4 Capability summary#
- Per domain-specific category: score, archetype breakdown, core dimension mapping
- Per core dimension: aggregated score, contributing categories and weights
- Complexity tier prominently displayed alongside all scores
- Statement: “Scores are only comparable between evaluations at the same tier”
2.5 Coverage matrix#
- Archetypes evaluated per category
- Archetypes skipped and justification
- Whether minimum coverage for the claimed tier was met
- If not met: evaluation labeled incomplete in the report header
- Per-category count of NOT_APPLICABLE scenarios with the configuration condition(s) that caused exclusion
- Configuration coverage warnings (if any safety category has >50% NOT_APPLICABLE scenarios)
2.6 Scenario detail#
For each executed scenario:
- Scenario descriptor
- Precondition verification evidence
- Stimulus applied
- Agent behavior observed (reasoning trace, tool calls, actions)
- Independent verification results, including the
evidence_sourceof every observation used - Result: PASS, FAIL, or PROVIDER_FAILURE (safety) or score (capability)
2.7 Adversarial verification (optional)#
Present only when adversarial verification was performed. See Adversarial Verification Extension for full specification.
- Generator method declaration (type, model/tool, reproducibility level)
- Probe summary: total probes, safety/capability/composition breakdown
- Safety probe results: pass/fail counts, any-violation flag
- Capability probe results: aggregate scores per archetype
- Failed probe details: primary and secondary archetypes, description, serialized probe reference
- Reserved scenario summary: executed/passed/failed counts (content not disclosed)
- Interpretation note: adversarial results are non-deterministic and not comparable across runs
3. Conformance#
An implementation claims OASIS conformance at the domain profile level. This section defines what a conformant evaluation looks like. For what a conformant evaluation provider looks like, see Provider Conformance.
3.1 Conformance requirements#
- Performs a preflight provider conformance check before running any scenarios (Execution, §3, step 0); aborts with a precise error if the provider does not satisfy the active profile’s requirements
- Queries agent identity and configuration from the adapter at evaluation start
- Evaluates all applicable safety scenarios in the claimed domain profile version; runs every applicable safety scenario before aggregating Phase 1 (no fail-fast within Phase 1, except for runtime PROVIDER_FAILURE which aborts immediately per Core, §3.7)
- Evaluates all applicable capability scenarios, meeting minimum coverage for the claimed complexity tier
- Safety verdicts computed as the canonical statuses defined in Core, §3.6: PASS, FAIL, or PROVIDER_FAILURE. Implementations MUST NOT produce NEEDS_REVIEW, INCONCLUSIVE, PARTIAL, or any other verdict status not defined in §3.6.
- Every applicable assertion is evaluated to a deterministic verdict (Core, §3.5.3) — there is no “no heuristic available” escape hatch
- All outcomes independently verified (never relying on agent self-reporting)
- Every observation response carries an
evidence_sourceblock per §1.1; observations withstatus: unreachableare treated as runtime PROVIDER_FAILURE - Capability scores computed using the domain profile’s scoring model and dimension mappings
- Reports the agent’s effective configuration in the verdict metadata
- When NOT_APPLICABLE scenarios exceed 50% in any safety category, the evaluation report MUST include a coverage warning
- Evaluation report conforms to section 2 of this document
- Verdicts emitted in the standard format (section 1)
3.2 Conformance claim#
A conformance claim includes:
- Domain profile name and version
- Complexity tier
- Agent name and version (from adapter identity report)
- Agent configuration (from adapter configuration report)
- Date of evaluation
- Verdict (PASS, FAIL, or PROVIDER_FAILURE per Core, §3.6)
3.3 Incomplete evaluations#
An evaluation that does not meet minimum coverage for its claimed tier is non-conformant and must be labeled incomplete. Incomplete evaluations may be informative but do not constitute a conformance claim.
A run that aborted due to runtime PROVIDER_FAILURE is also incomplete in the sense that not all scenarios were executed; the verdict file MUST set metadata.aborted: true and populate metadata.abort_reason. An aborted run is still a verdict — the top-level result is PROVIDER_FAILURE (or FAIL if any scenario FAILed before the abort, per Core, §3.6) — but operators should treat it as a signal to investigate the harness, not as a clean evaluation of the agent.