Cloudflare cut core server boot time from 4 hours to minutes by fixing UEFI timeouts

Cloudflare's core edge servers were taking 4 hours to reboot after firmware updates. A 4-hour reboot window means you can't do emergency patches, you can't cycle hardware fast, and you can't scale fleet operations. They published the forensics yesterday: the problem was UEFI timeout loops and iPXE automation assumptions that stopped working once the firmware changed.

The fix brought boot time back to minutes. This matters for anyone running agentic infrastructure at scale, especially if your agents boot from network images or need fast cold-start cycles.

What actually happened

Cloudflare's core servers boot via iPXE — they pull a network image, not a local disk. Before the firmware update, boot took ~10 minutes. After the update, it jumped to 4 hours.

The culprit was UEFI's boot device enumeration. The firmware was looping through every possible boot device with a 5-second timeout per device. Cloudflare's hardware has a lot of devices (NVMe drives, network adapters, USB ports). Each timeout added up. The new firmware also changed how it handled network boot priorities, which broke iPXE's assumptions about which device would be selected first.

The UEFI spec allows vendors to implement device enumeration however they want. Some vendors iterate sequentially with fixed timeouts. Others fail fast. Cloudflare's vendor picked the slow path.

The two-part fix

First, they eliminated unnecessary boot devices from the UEFI device list. If a drive or port isn't needed for boot, UEFI doesn't need to check it. They used UEFI shell commands to prune the list. That cut the timeout window from hours to ~30 minutes.

Second, they rewrote their iPXE automation to explicitly set the correct network adapter as the first boot device. The old automation assumed the firmware would pick the right adapter by default. The new firmware didn't. Explicit device selection fixed that.

Boot time dropped back to 8 minutes.

Why this matters for agentic infrastructure

If you're running voice agents, RAG backends, or multi-tenant platforms on edge hardware, boot time is cold-start time. A 4-hour cold start means you can't redeploy on demand. You can't cycle hardware for updates. You can't scale horizontally fast.

We run Goldie (our florist voice agent) on ElevenLabs' hosted infrastructure, so we don't own the boot cycle. But we've seen similar timeout issues with Docker container cold starts when the base image pulls over a slow network. A 30-second cold start becomes a 3-minute cold start if the image registry times out or retries.

The lesson is the same: timeouts compound. If your infrastructure has sequential timeout loops anywhere in the boot/start path, you need to profile it and fix it before you scale.

What Cloudflare's blog post doesn't say

The post doesn't mention whether they tested the fix on a staging fleet before rolling it to production. It also doesn't say how many servers were affected or how long the rollout took. Those are the operational details that matter if you're trying to replicate this for your own stack.

They also don't explain why the firmware vendor changed the boot enumeration behavior in the first place. Was it a security patch? A new UEFI spec requirement? A bug? Knowing that would help predict whether other vendors will make the same change.

What to check in your own stack

If you're running agents on bare metal or VM instances that boot from network images:

Profile your cold-start time end-to-end. Measure the BIOS/UEFI phase separately from the OS boot phase. Use IPMI serial console logs or equivalent to capture the UEFI output.
If you see multi-minute UEFI phases, check the boot device list. Remove anything you don't need.
If you're using PXE or iPXE, explicitly set the boot device order in your automation. Don't assume the firmware will pick the right one.

Cloudflare's edge deployment model is extreme — they boot thousands of servers from network images every day. Most agentic deployments don't operate at that scale. But the timeout-accumulation pattern shows up everywhere: Kubernetes pod starts, Lambda cold starts, Docker image pulls, SSH key exchanges. Anywhere you have a sequential retry loop with fixed timeouts, you need to measure it.

The Cloudflare post is worth reading in full if you run infrastructure at scale. It's one of the better public breakdowns of a boot-time regression we've seen this year.

Cloudflare cut core server boot time from 4 hours to minutes by fixing UEFI timeouts — here's the diff

What actually happened

The two-part fix

Why this matters for agentic infrastructure

What Cloudflare's blog post doesn't say

What to check in your own stack

Uber capped Claude Code usage after blowing four months of AI budget — here's what that means for enterprise rollout

SQLite shipped an AGENTS.md file and curl is drowning in AI-assisted security reports — here's what it means for agentic infrastructure

Tell us what you'd like built.