OASIS

Open Assessment Standard for Intelligent Systems

v1.0.0-rc1

Release candidate: OASIS v1.0.0-rc1 is feature-complete and validated end-to-end against a real AI infrastructure agent. Seeking external feedback before v1.0.0 stability guarantees.

OASIS is an open standard for evaluating AI agents that operate in real-world systems. It defines how to test whether an agent is safe to deploy and how capable it is — in that order. Safety is a gate, not a score: an agent that fails any safety scenario receives no capability score at all.

How an OASIS evaluation runs

Every OASIS evaluation proceeds through three phases. The order is normative.

Three-phase OASIS evaluation flow: Provider Conformance Preflight, then Safety Gate, then Capability ScoringA horizontal flow diagram showing the three sequential phases of an OASIS evaluation. Phase 0, Provider Conformance Preflight, verifies the provider meets profile requirements. An arrow leads to Phase 1, Safety Gate, where any failure halts the evaluation. A second arrow leads to Phase 2, Capability Scoring, which only runs if the safety gate passes.Provider ConformancePreflightVerify provider meetsprofile requirementsSafety GateAny failure haltsthe evaluationCapability ScoringOnly runs if thesafety gate passes

Phase 0 — Provider Conformance Preflight

Before any scenario runs, the runner queries the provider's conformance endpoint and verifies that every requirement declared by the active profile is satisfied. A mismatch aborts the run before a single scenario executes. This guarantees that every reported result was produced under conditions the profile considers valid.

Phase 1 — Safety Gate

The runner executes every safety scenario in the profile and aggregates the verdicts. The gate is binary: any safety failure, above a profile-defined tolerance, halts the evaluation. No capability scenarios run. No capability score is produced. The report records the failure and stops.

Phase 2 — Capability Scoring

Only agents that clear the safety gate are scored on capability. The capability section of the report is meaningful precisely because it cannot exist without a passing safety gate above it.

What makes OASIS different

Safety is a gate, not a score

Existing benchmarks average safety into capability, hiding catastrophic failures behind good aggregate numbers. OASIS refuses to score capability until safety passes completely. There is no weighted blend, no partial credit, no "mostly safe."

Domain profiles, not generic benchmarks

A spec defines structure; a profile defines what safe and capable mean for a specific domain. Software Infrastructure is the first profile. Generic benchmarks cannot capture domain-specific risk, and OASIS does not pretend they can.

Independent verification by design

Provider conformance is checked at runtime against a published contract. Adversarial verification is a first-class extension, not an afterthought. The standard is built to be audited by parties other than the agent's vendor.

The specification

OASIS is nine documents. Seven are normative; two provide context.

Normative

Non-normative

Tools

Tooling for OASIS evaluations splits along two axes: tools that work across the spec, and tools that belong to a specific profile.

Spec-wide

  • oasisctl — the reference evaluation runner.

    Drives the three-phase evaluation (preflight → safety gate → capability scoring) against any OASIS-conformant provider, for any profile. The contract it speaks is defined in Provider Conformance.

Profile-specific

Each profile may publish its own verification harnesses — environment providers, scenario fixtures, scoring tools — that satisfy that profile's provider conformance contract. These belong with the profile, not the spec.

Reference evaluations

Reference evaluations will be published as conformant evaluation runs become available. The first will land alongside the v1.0 release.

Validated end-to-end

The release candidate was validated by executing the full SI v0.2 profile against a real AI agent (Joe) operating on a live Kubernetes cluster — not a simulation or mock environment.

Safety scenarios executed 21 — all with deterministic PASS/FAIL verdicts
Provision failures 0 — every scenario provisioned and executed cleanly
Missing-heuristic errors 0 — every assertion resolved to a definitive verdict

This confirms the spec is implementable, not theoretical. The evaluation pipeline, scenario schema, and verdict model all function as specified under real-world conditions.

Status

Current version v1.0.0-rc1 — release candidate, under review
Current profile Software Infrastructure — 7 safety categories (21 scenario archetypes), 7 capability categories (30 scenario archetypes), every archetype mapped to a real infrastructure failure mode
Reference tooling oasisctl (the reference runner) plus a reference adapter, working end-to-end against the SI profile
Next v1.0.0 final after the RC feedback period · second domain profile · adversarial verification reference implementation

OASIS is developed in the open. The specification, profiles, and tooling live at github.com/jaimegago/oasis-spec.