OASIS

Open Assessment Standard for Intelligent Systems

v1.0.0-rc1

OASIS is an open standard for evaluating AI agents that operate in real-world systems. It defines how to test whether an agent is safe to deploy and how capable it is — in that order. Safety is a gate, not a score: an agent that fails any safety scenario receives no capability score at all.

Read the spec Browse the Software Infrastructure profile

How an OASIS evaluation runs

Every OASIS evaluation proceeds through three phases. The order is normative.

Phase 0 — Provider Conformance Preflight

Before any scenario runs, the runner queries the provider's conformance endpoint and verifies that every requirement declared by the active profile is satisfied. A mismatch aborts the run before a single scenario executes. This guarantees that every reported result was produced under conditions the profile considers valid.

Phase 1 — Safety Gate

The runner executes every safety scenario in the profile and aggregates the verdicts. The gate is binary: any safety failure, above a profile-defined tolerance, halts the evaluation. No capability scenarios run. No capability score is produced. The report records the failure and stops.

Phase 2 — Capability Scoring

Only agents that clear the safety gate are scored on capability. The capability section of the report is meaningful precisely because it cannot exist without a passing safety gate above it.

What makes OASIS different

Safety is a gate, not a score

Existing benchmarks average safety into capability, hiding catastrophic failures behind good aggregate numbers. OASIS refuses to score capability until safety passes completely. There is no weighted blend, no partial credit, no "mostly safe."

Domain profiles, not generic benchmarks

A spec defines structure; a profile defines what safe and capable mean for a specific domain. Software Infrastructure is the first profile. Generic benchmarks cannot capture domain-specific risk, and OASIS does not pretend they can.

Independent verification by design

Provider conformance is checked at runtime against a published contract. Adversarial verification is a first-class extension, not an afterthought. The standard is built to be audited by parties other than the agent's vendor.

The specification

OASIS is nine documents. Seven are normative; two provide context.

Normative

Core — Foundational concepts, evaluation model, and architecture.
Scenarios — Schema for scenarios and scenario suites.
Profiles — Structure and quality criteria for domain profiles.
Execution — Agent interface contract and environment model.
Reporting & Conformance — Verdict format and report structure.
Provider Conformance — Requirements for OASIS-conformant evaluation providers.
Adversarial Verification — Optional extension for non-deterministic adversarial testing.

Non-normative

Motivation — Why OASIS exists and what gap it fills.
Design Principles — The principles the standard is built on.

Tools

Tooling for OASIS evaluations splits along two axes: tools that work across the spec, and tools that belong to a specific profile.

Spec-wide

oasisctl — the reference evaluation runner.
Drives the three-phase evaluation (preflight → safety gate → capability scoring) against any OASIS-conformant provider, for any profile. The contract it speaks is defined in Provider Conformance.

Profile-specific

Each profile may publish its own verification harnesses — environment providers, scenario fixtures, scoring tools — that satisfy that profile's provider conformance contract. These belong with the profile, not the spec.

Software Infrastructure — see the profile's provider implementation guide and conformance contract.

Reference evaluations

Reference evaluations will be published as conformant evaluation runs become available. The first will land alongside the v1.0 release.

Validated end-to-end

The release candidate was validated by executing the full SI v0.2 profile against a real AI agent (Joe) operating on a live Kubernetes cluster — not a simulation or mock environment.

Safety scenarios executed 21 — all with deterministic PASS/FAIL verdicts

Provision failures 0 — every scenario provisioned and executed cleanly

Missing-heuristic errors 0 — every assertion resolved to a definitive verdict

This confirms the spec is implementable, not theoretical. The evaluation pipeline, scenario schema, and verdict model all function as specified under real-world conditions.

Status

Current version v1.0.0-rc1 — release candidate, under review

Current profile Software Infrastructure — 7 safety categories (21 scenario archetypes), 7 capability categories (30 scenario archetypes), every archetype mapped to a real infrastructure failure mode

Reference tooling oasisctl (the reference runner) plus a reference adapter, working end-to-end against the SI profile

Next v1.0.0 final after the RC feedback period · second domain profile · adversarial verification reference implementation

OASIS is developed in the open. The specification, profiles, and tooling live at github.com/jaimegago/oasis-spec.