The Center for AI Standards and Innovation (CAISI) announced on 5 May that it had signed agreements with Google DeepMind, Microsoft and xAI to evaluate their frontier AI models before public release. The announcement extended existing arrangements with OpenAI and Anthropic and represented the broadest US government pre-deployment evaluation programme to date. Within days, the announcement page disappeared. Staff at CAISI have reportedly been told to take it down, with no public explanation offered for the removal.

The episode is the latest in a sequence of starts and stops that has characterised US federal AI policy since the change in administration in early 2025. Pre-deployment evaluation — the idea that government should test frontier models in confidential environments before they reach customers — has bipartisan support in principle and serial whiplash in practice.

What was announced

CAISI's 5 May announcement extended its existing pre-deployment evaluation framework to three additional labs. Under the agreements, Google DeepMind, Microsoft (covering both first-party models and licensed third-party models running on Azure) and xAI would each provide CAISI evaluators with confidential access to new frontier models before public release. The evaluators would assess capabilities relevant to national security and AI safety — including offensive cyber capability, biological-threat uplift, and autonomous-agent behaviour — and provide findings back to the labs and to relevant federal agencies.

The framework is voluntary. Labs are not legally required to participate, and CAISI's findings are not gating: a lab is free to release a model the evaluators have flagged, with the understanding that doing so creates friction in future federal procurement and policy interactions. The arrangement is closer to the UK's AI Safety Institute model — collaboration on capability discovery rather than approval-gate regulation — than to the EU AI Act's structured pre-deployment requirements.

Why this is unusual

Three things make the May announcement noteworthy beyond the participant list.

First, the inclusion of xAI. Elon Musk's lab had previously been outside the existing CAISI arrangements, and xAI's participation suggests an alignment-of-incentives moment with the federal evaluation regime that was not predictable from the company's earlier public stance.

Second, the breadth of Microsoft's coverage. Because Microsoft licenses third-party models including OpenAI's, the arrangement effectively extends evaluation reach into model releases that flow through Azure even if their underlying labs have not signed parallel agreements.

Third, the timing. The announcement landed in the same week that Anthropic disclosed first-quarter revenue of $44 billion ARR and a $30 billion fundraise, and only weeks after the Mythos model controversy raised regulator attention to frontier offensive capability. The CAISI framework is exactly the kind of voluntary evaluation that lets federal agencies stay current with the capability frontier without imposing a heavier regulatory hand.

Then the page came down

Within days of the announcement, CAISI's web page describing the new agreements was taken offline. Multiple outlets reported that staff had been instructed to remove it without public explanation. The agreements themselves remain in effect, according to the affected labs and to people familiar with the matter — only the public announcement was withdrawn.

Several hypotheses circulate in policy circles about why. The first is internal disagreement at the executive branch about how much pre-deployment evaluation should be made public, with concerns that publicising the framework gives competitors and adversaries a roadmap of US capability assessment. The second is friction between CAISI and other parts of the policy apparatus — White House AI offices, the National Security Council, individual cabinet departments — over which body owns the framework. The third, more straightforward, is that publication of the page was not fully cleared before it went live, and the takedown is the routine consequence.

None of the three explanations has been confirmed. What is clear is that the substantive arrangements continue. The labs say they are participating; CAISI says the framework is operating; only the public-facing description has been suppressed.

The capability the framework is trying to evaluate

Pre-deployment evaluation matters most for capabilities where a model going public produces irreversible consequences. Three classes dominate the policy literature.

Offensive cyber capability is the most acute. The Mythos model is the public-facing example: Anthropic has described a frontier model capable of identifying chains of low-severity bugs that compose into high-severity exploits, and decided to gate access through Project Glasswing rather than release the model openly. The CAISI framework gives federal evaluators visibility into similar capabilities at Google DeepMind, Microsoft and xAI before the labs make release decisions.

Biological-threat uplift is the second class. Research published over the past year has shown that frontier models can provide non-trivial uplift to actors attempting to synthesise dangerous biological agents, though the actual magnitude of uplift remains contested. CAISI's evaluators include specialists from agencies that handle biosecurity threats.

Autonomous-agent behaviour is the third. As models gain longer-horizon planning capability and broader tool access, evaluators are looking for evidence of behaviour that diverges from operator intent — situational awareness, deception, self-preservation behaviours. These are softer evaluations with less mature methodology, but they are formally in scope.

How the labs are positioning

Google DeepMind, Microsoft and xAI each emphasised the voluntary nature of their participation in commentary at the time of the announcement. None framed it as a regulatory burden; each framed it as a contribution to a shared evaluation ecosystem. Anthropic and OpenAI, whose arrangements predate the May announcement, have made similar statements.

The labs' incentive to participate is straightforward. Voluntary participation forestalls compulsory regulation. If the federal government can credibly say it has visibility into capability development through a framework that does not block releases, the political pressure for stricter, slower regimes is reduced. The fact that the framework's findings can shape future procurement decisions — the federal government is a major customer for cloud AI services — adds a commercial logic on top of the political one.

What to watch

Three near-term markers will tell whether the framework holds up to the controversy of its rollout. First, whether the CAISI page returns to the public web — and what its language looks like when it does. Second, whether other major labs (Meta, the Chinese labs operating in US markets, mid-tier players like Cohere and Mistral via their US subsidiaries) sign on or stay out. Third, whether findings from any specific model evaluation become public, even at the level of capability category — that would be the cleanest evidence that the framework is operating in substance rather than just in form.

The bigger question — whether voluntary pre-deployment evaluation is the right durable answer for frontier AI — sits at the centre of US AI policy debate. Supporters argue that the alternative (compulsory licensing, EU-style structured assessment, or moratoriums) imposes costs out of proportion to the marginal safety gain. Critics argue that voluntary frameworks lose force the moment a lab decides participation is no longer in its interest. The next 12 months of capability gains will tell which case has more weight.

Comparisons with the UK AISI and EU AI Office

BodyJurisdictionPostureFindings publication
CAISIUnited StatesVoluntaryLargely private
UK AI Safety InstituteUnited KingdomVoluntarySelective public methodology
EU AI OfficeEuropean UnionBinding under AI ActMandated incident reports

CAISI's framework sits in a small cluster of comparable bodies that have emerged in major jurisdictions over the past two years. The UK AI Safety Institute, established under the previous government and continued under the current one, has been operating a similar voluntary pre-deployment evaluation arrangement with OpenAI, Anthropic, Google DeepMind and a handful of other labs. The institute has published parts of its evaluation methodology and a small subset of its findings, building a public-facing reputation that lends weight to its private assessments.

The EU AI Office, the regulatory body created to enforce the AI Act, sits at the opposite end of the spectrum. Where CAISI and the UK AISI rely on voluntary participation and informal findings, the EU AI Office operates within a binding legal framework that requires general-purpose AI providers to meet specific transparency, documentation and risk-management obligations. The provisions for general-purpose AI models with systemic risk — a category broad enough to cover every frontier model in current deployment — include mandated incident reporting, evaluation cooperation and post-market monitoring.

The three regimes are not mutually exclusive. The same lab can participate in CAISI's evaluations, the UK AISI's evaluations and the EU AI Office's compliance regime simultaneously. In practice this is what is happening. The labs maintain three separate evaluation streams running in parallel, with the operational cost of doing so absorbed as a category of regulatory overhead.

Whether the three regimes converge or diverge over the next 18 months is one of the more consequential questions in international AI policy. Convergence — shared evaluation methodology, interoperable findings, mutual recognition of capability assessments — would lower the operational cost on the labs and produce more usable safety data for everyone. Divergence — three different methodologies producing three different conclusions about the same models — would erode the credibility of all three regimes and accelerate calls for either international coordination or unilateral action by whichever regime moved most aggressively. The early signals from working-level conversations between CAISI, the UK AISI and the EU AI Office suggest convergence is the intended path, but the political environment for international AI coordination remains fragile.

Sources