Scheduling five out of six pods in a distributed training job is not 83 per cent success — it is zero useful progress while 83 per cent of your GPUs sit idle. Kubernetes v1.36 "Haru", released 22 April 2026 with a deep-dive blog on the scheduling changes published 13 May, attacks that problem directly with a new PodGroup API designed for atomic, all-or-nothing gang scheduling baked into the core scheduler — currently at Alpha maturity.

The API Split That Makes Gang Scheduling Possible

The central architectural move in v1.36 is separating two concerns that were previously tangled together. The Workload object becomes a static policy template — it declares podGroupTemplates and scheduling rules but carries no runtime state. The new PodGroup object tracks live scheduling state for each group at runtime. Both now live under scheduling.k8s.io/v1alpha2.

This split matters for scale. Because the PodGroup handles status updates independently, each replica can shard its own state without hammering the parent Workload object — a pattern that breaks down badly at hundreds of concurrent training jobs. The scheduler itself is also simplified: it reads PodGroup directly rather than watching the Workload, reducing the object graph it must traverse on every scheduling cycle.

Pods reference their group via a new immutable spec.schedulingGroup.podGroupName field. Immutability is deliberate — it prevents a pod from being reassigned mid-scheduling, which would produce undefined gang behaviour.

Gang Scheduling: All-or-Nothing Enforced in Two Phases

When a Workload's schedulingPolicy.gang.minCount is set, the scheduler enforces quorum in two places. A PreEnqueue plugin holds pods in the queue until the minimum number of siblings are visible and schedulable together — no individual pod jumps the queue. A Permit plugin then applies the final admission barrier, holding assumed pods until the full quorum can bind simultaneously.

This removes the core reason teams historically bolted Volcano or Kueue onto Kubernetes: the built-in scheduler had no concept of a group. The feature is Alpha in v1.36, so production adoption requires enabling the relevant feature gates — WorkloadWithJob for Job controller integration and GenericWorkload for the Workload API itself. The architectural groundwork is now in tree rather than an external dependency, but teams should keep external schedulers in place until a future release promotes these APIs toward beta. As the official deep-dive post describes, v1.36 is explicitly positioned as a first pass, with topology-aware scheduling and workload-aware preemption flagged as follow-on work.

DRA Integration Brings GPU Claims to the Workload Level

Dynamic Resource Allocation (DRA) gets two notable advances in v1.36. DRA AdminAccess for ResourceClaims graduates to GA, giving privileged consumers — monitoring agents, cluster services — a defined path to access device resources without user-level claim semantics. DRA Partitionable Devices, covering GPU slicing and MIG profiles, moves to Beta.

Critically for AI teams, ResourceClaim support at the Workload level means GPU and specialised hardware can be declared once per workload rather than repeated per pod. Combine that with gang scheduling's quorum enforcement and a training run either acquires the full GPU allocation or waits — it does not start half-provisioned and waste compute.

Security Hardenings That Reach GA

Three security features cross the stable line in this release.

User Namespaces is now GA after years in alpha and beta. A process running as UID 0 inside a container maps to an unprivileged host user — so a container escape no longer lands the attacker as root on the node. This is a meaningful default hardening for any cluster running multi-tenant workloads.

Mutating Admission Policies, using the Common Expression Language (CEL), reach GA and are enabled by default in v1.36. Teams can now express mutation logic as native Kubernetes objects instead of maintaining out-of-process webhook servers — fewer moving parts, no webhook latency, and no additional TLS endpoint to secure.

Fine-Grained Kubelet API Authorisation also reaches GA, letting operators scope which components can call which kubelet endpoints rather than relying on coarse node-wide permission.

What the 70-Enhancement Count Signals

70Total enhancements in v1.36
18Graduating to Stable (GA)
25Moving to Beta
25New Alpha features

The breadth — 70 enhancements across stable, beta, and alpha — reflects a release that is maturing existing bets (DRA, admission policies, security primitives) while planting new alpha flags for the next cycle. For platform engineers running GPU clusters, the practical near-term step is enabling the WorkloadWithJob feature gate on non-production clusters and stress-testing PodGroup behaviour before the API moves toward beta.