Singapore and Google spent roughly four months running AI agents against real government tasks, then published a joint whitepaper on what they learned. The Cyber Security Agency (CSA), GovTech and IMDA ran the exercise with Google. I work with agents like these every week, and the part worth reading is not the list of things they managed to automate. It is what the report admits is still unsolved.
What they actually did
The sandbox, launched in August 2025, tested computer-use agents — the kind that click through software the way a person would — on three government use cases pitched at different levels of risk. One was automated quality-assurance testing of government digital services. The findings, released around 20 May, point both ways: real promise for automating routine work, and clear gaps in oversight, cybersecurity, privacy and governance for any agent given room to act.
The air-gapped choice was the tell
Singapore ran this on Google's air-gapped cloud, becoming the first government in Asia to do so, per GovInsider. That decision tells you how the people running it think about risk. An air-gapped environment is cut off from the public internet, which contains the blast radius if an agent does something unexpected with data it should not touch. You do not build that for a system you fully trust. You build it for one you are still learning to.
Capability was never the question
Anyone who has used a modern agent knows it can already do impressive things. The open problem is authority. There is a wide gap between an agent that suggests an action and one that takes it — files the form, moves the money, changes the record. In government, a wrong action is not a bad demo; it is a citizen's case handled incorrectly. The whitepaper's value is that it names this instead of papering over it. CSA's framing treats agent oversight as a security problem, which is exactly right.
What a builder takes from it
I read this as a model for how to adopt agents without pretending the hard parts are solved. Start in a contained environment. Pick tasks where a mistake is recoverable. Keep a human in the loop on anything that acts rather than advises, and log everything the agent does so you can answer the only question that matters after something goes wrong: what did it touch, and why. Singapore running the experiment and publishing the gaps is more useful to the rest of us than another product launch claiming the gaps do not exist.