OASIS
Open Assessment Standard for Intelligent Systems
OASIS is an open standard for evaluating AI agents that operate in real-world systems. It defines how to test whether an agent is safe to deploy and how capable it is — in that order. Safety is a gate, not a score: an agent that fails any safety scenario receives no capability score at all.
How an OASIS evaluation runs
Every OASIS evaluation proceeds through three phases. The order is normative.
Phase 0 — Provider Conformance Preflight
Before any scenario runs, the runner queries the provider's conformance endpoint and verifies that every requirement declared by the active profile is satisfied. A mismatch aborts the run before a single scenario executes. This guarantees that every reported result was produced under conditions the profile considers valid.
Phase 1 — Safety Gate
The runner executes every safety scenario in the profile and aggregates the verdicts. The gate is binary: any safety failure, above a profile-defined tolerance, halts the evaluation. No capability scenarios run. No capability score is produced. The report records the failure and stops.
Phase 2 — Capability Scoring
Only agents that clear the safety gate are scored on capability. The capability section of the report is meaningful precisely because it cannot exist without a passing safety gate above it.
What makes OASIS different
Safety is a gate, not a score
Existing benchmarks average safety into capability, hiding catastrophic failures behind good aggregate numbers. OASIS refuses to score capability until safety passes completely. There is no weighted blend, no partial credit, no "mostly safe."
Domain profiles, not generic benchmarks
A spec defines structure; a profile defines what safe and capable mean for a specific domain. Software Infrastructure is the first profile. Generic benchmarks cannot capture domain-specific risk, and OASIS does not pretend they can.
Independent verification by design
Provider conformance is checked at runtime against a published contract. Adversarial verification is a first-class extension, not an afterthought. The standard is built to be audited by parties other than the agent's vendor.
The specification
OASIS is nine documents. Seven are normative; two provide context.
Normative
- Core — Foundational concepts, evaluation model, and architecture.
- Scenarios — Schema for scenarios and scenario suites.
- Profiles — Structure and quality criteria for domain profiles.
- Execution — Agent interface contract and environment model.
- Reporting & Conformance — Verdict format and report structure.
- Provider Conformance — Requirements for OASIS-conformant evaluation providers.
- Adversarial Verification — Optional extension for non-deterministic adversarial testing.
Non-normative
- Motivation — Why OASIS exists and what gap it fills.
- Design Principles — The principles the standard is built on.
Tools
Tooling for OASIS evaluations splits along two axes: tools that work across the spec, and tools that belong to a specific profile.
Spec-wide
oasisctl— the reference evaluation runner.Drives the three-phase evaluation (preflight → safety gate → capability scoring) against any OASIS-conformant provider, for any profile. The contract it speaks is defined in Provider Conformance.
Profile-specific
Each profile may publish its own verification harnesses — environment providers, scenario fixtures, scoring tools — that satisfy that profile's provider conformance contract. These belong with the profile, not the spec.
- Software Infrastructure — see the profile's provider implementation guide and conformance contract.
Reference evaluations
Reference evaluations will be published as conformant evaluation runs become available. The first will land alongside the v1.0 release.
Validated end-to-end
The release candidate was validated by executing the full SI v0.2 profile against a real AI agent (Joe) operating on a live Kubernetes cluster — not a simulation or mock environment.
This confirms the spec is implementable, not theoretical. The evaluation pipeline, scenario schema, and verdict model all function as specified under real-world conditions.
Status
OASIS is developed in the open. The specification, profiles, and tooling live at github.com/jaimegago/oasis-spec.