Specification on OASIS — Open Assessment Standard for Intelligent Systems

Why OASIS Exists

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document explains why OASIS was created, what gap it fills in the current AI agent evaluation landscape, and how it relates to existing benchmarks and frameworks. It is non-normative — readers looking for the technical specification should start with Core.

1. The problem#

AI agents are gaining access to production infrastructure, financial systems, databases, industrial control systems, and physical devices. The gap between what agents can do and what we can reliably verify about their behavior is widening — and it is widening inside systems where mistakes have consequences that don’t roll back.

Core Specification

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines the foundational concepts, architecture, and evaluation model of OASIS. For scenario schemas, see Scenarios. For domain profile structure, see Profiles.

1. Definitions#

1.1 Agent#

A software system that uses one or more AI models to decide and execute actions in pursuit of goals provided by a human operator.

1.2 External system#

A system that meets ALL of the following criteria:

Stateful. The system maintains persistent state that exists independently of any single interaction. State survives across agent sessions.

Scenario Specification

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines the schema for OASIS evaluation scenarios and suites. For foundational concepts, see Core.

1. Scenario schema#

Scenarios are defined in a structured format (YAML or equivalent). The core spec defines the schema; domain profiles provide the concrete scenarios.

1.1 Fields#

id (string, required) — Unique identifier within the domain profile. Convention: {domain}.{classification}.{category_code}.{archetype_code}-{sequence}, e.g., infra.safety.be.zone-violation-001.

name (string, required) — Human-readable name describing what the scenario tests.

version (string, required) — Semver version of this scenario definition.

Domain Profiles

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines the structure of OASIS domain profiles and the quality criteria they must meet. For foundational concepts, see Core. For a detailed guide on authoring profiles, see the Profile Authoring Guide.

1. What is a domain profile?#

A domain profile implements the OASIS evaluation model for a specific class of external systems. It is a versioned, self-contained package that defines what safety and capability mean for its domain.

Execution Model

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines the agent interface contract, environment model, and execution flow. For foundational concepts, see Core. For provider conformance and the preflight check referenced below, see Provider Conformance.

1. Agent interface contract#

For an agent to be evaluable by OASIS, it must expose a minimal interface that the evaluation runner can interact with. The spec does not prescribe the agent’s internal architecture.

Required capabilities:

Report identity and configuration — the agent (via its adapter) reports its identity and system-level configuration once at the start of an evaluation run, not per-scenario. The response contains two parts:
- Agent identity (required): name (string), version (string, semver), and optional description (string). The evaluation runner populates the verdict metadata (agent, agent_version) from these values.
- Agent configuration (required): a set of key-value pairs conforming to the active domain profile’s agent configuration schema (see Profiles, section 2.16). The evaluation runner uses this to filter scenarios by applicability and condition assertions.
Accept a prompt — the runner sends a natural language task to the agent.
Declare available tools — the agent reports which tools/interfaces it has access to.
Declare operating mode — the agent reports its declared mode (read-only, supervised, autonomous).
Execute and report — the agent processes the prompt, takes actions, and returns a structured response including: actions taken (tool calls with arguments and results), reasoning trace (optional but scored under auditability), and final answer or outcome.
Stateless between scenarios — the agent must not carry state from one scenario to the next. Each scenario starts clean.

The interface is defined as a protocol, not an implementation. An HTTP API, CLI wrapper, MCP server, or any other mechanism that satisfies the contract is valid.

Reporting & Conformance

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines the evaluation verdict format, report structure, and conformance requirements. For foundational concepts, see Core. For the canonical verdict status enumeration, see Core, §3.6. For provider conformance, see Provider Conformance.

1. Verdict format#

The evaluation verdict is a structured object. The safety field uses the canonical verdict status enumeration defined in Core, §3.6: exactly one of PASS, FAIL, or PROVIDER_FAILURE.

verdict:
 safety: PASS | FAIL | PROVIDER_FAILURE
 safety_details:
 total_scenarios: N
 applicable: N
 not_applicable: N
 passed: N
 failed: N
 provider_failure: N # scenarios aborted due to runtime provider faults
 tolerance_adjusted: N # scenarios that PASSed under non-zero tolerance
 failures: [list of failed scenario IDs and descriptions]
 provider_failures: [list of scenario IDs aborted due to runtime fault, with cause]
 subcategory_results: # optional — present when profile defines subcategories
 category_name:
 subcategory_name:
 total: N
 passed: N
 failed: N
 provider_failure: N
 human_review:
 # NOT a verdict status — see §2.3 for the distinction.
 # Entries below are PASSes that were tolerance-adjusted, listed here
 # as a sanity-check signal for the operator.
 - scenario_id: "..."
 tolerance: N%
 measured_deviation: N%
 result: PASS
 capability: # only present if safety = PASS
 tier: 1 | 2 | 3
 coverage:
 required_archetypes: N
 evaluated_archetypes: N
 complete: true | false
 domain_categories:
 category_name:
 score: 0.0-1.0
 archetypes_evaluated: N
 maps_to_dimensions: [list]
 core_dimensions:
 task_completion: 0.0-1.0
 reliability: 0.0-1.0
 reasoning: 0.0-1.0
 auditability: 0.0-1.0
 metadata:
 agent: [agent name — from adapter identity report]
 agent_version: [version — from adapter identity report]
 agent_configuration:
 [dimension]: [value]
 # ... one entry per dimension in the profile's agent configuration schema
 domain_profile: [profile name]
 domain_profile_version: [semver]
 oasis_core_version: [semver]
 environment:
 provider: [environment provider identifier]
 provider_version: [provider semver]
 tier: 1 | 2 | 3
 tier_evidence: [summary]
 conformance_check: [summary of preflight conformance check result]
 evaluator: [organization or individual]
 timestamp: [ISO 8601]
 scenario_count: {safety: N, capability: N}
 duration: [total evaluation time]
 configuration_coverage:
 total_scenarios: N
 applicable: N
 not_applicable: N
 not_applicable_by_category:
 [category_name]: N
 warnings: [list of coverage warnings, if any]
 aborted: false | true  # true if run was terminated by runtime PROVIDER_FAILURE
 abort_reason: [string]  # present when aborted = true
 adversarial_verification: # optional — present only when performed
 performed: true | false
 generator:
 type: [llm | mutation | red_team | hybrid | other]
 reproducibility: [deterministic | partially_reproducible | non_reproducible]
 probe_summary:
 total_probes: N
 safety_probes: N
 capability_probes: N
 safety_results:
 any_safety_violation: true | false
 total: N
 passed: N
 failed: N
 capability_results:
 total: N
 mean_score: 0.0-1.0
 reserved_scenarios:
 executed: N
 passed: N
 failed: N

1.1 Observation response shape#

When a provider returns an observation in response to an independent verification query, the response includes an evidence_source block that lets the evaluator distinguish real evidence from runtime faults. Providers MUST populate this field on every observation response.

Design Principles & Context

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

1. Design principles#

Safety is a gate, not a score. One safety failure vetoes the entire evaluation. This is non-negotiable and cannot be overridden by domain profiles.
Independent verification. Evaluation outcomes are verified by direct inspection of the target system. Agent self-reporting is never used as evidence for verdicts. Trust is established through observation, not testimony.
The spec is domain-agnostic. Domain knowledge lives in profiles, not in the core spec. The core spec defines grammar; profiles provide vocabulary. Core categories are a floor, not a ceiling.

Provider Conformance

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines what it means for an evaluation provider to be OASIS-conformant. For what a conformant evaluation looks like, see Reporting & Conformance. For the execution model, see Execution. For the canonical verdict status enumeration, see Core, §3.6.

1. Purpose#

The OASIS core spec defines conformance for evaluations (05-reporting.md §3) and for domain profiles (03-profiles.md §§2–3). This document closes the third leg: conformance for the entity that executes the evaluation.

Adversarial Verification Extension

Mon, 01 Jan 0001 00:00:00 +0000

Version: 1.0.0-rc1.5

This document defines an optional extension for non-deterministic adversarial testing that complements the deterministic core evaluation. For foundational concepts, see Core. For the canonical verdict status enumeration, see Core, §3.6. For the deterministic scenario model, see Scenarios.

1. Motivation#

The deterministic evaluation model — fixed scenarios, fixed assertions, fixed scoring — is the foundation of OASIS. It provides reproducibility, comparability, and auditability. It is also, by design, predictable.

A sufficiently informed agent (or agent vendor) could, in principle, optimize specifically for the known scenario corpus without developing the underlying safety and capability properties those scenarios are designed to measure. This is Goodhart’s Law applied to agent evaluation: when the measure becomes the target, it ceases to be a good measure.