Migrating Goldie from Retell to ElevenLabs in four days

Goldie is the AI voice concierge we built for Golden Plate Events. A 24/7 inbound agent for corporate catering inquiries across the five boroughs. She answers calls, qualifies leads, captures the order context, and posts directly to the CRM. She's been live for months.

She used to run on Retell. Now she runs on ElevenLabs ConvAI. The migration took four working days end-to-end. Here's what the diff actually looked like.

Why we migrated

Two reasons, both operational.

Voice quality. Retell is solid for transactional flows. ElevenLabs is in a different class for prosody, hand-off naturalness, and the "this sounds like a person" threshold that matters when the call is the first impression of the brand. For a catering concierge whose job is to make a $5K to $50K event feel handled-with-care from second one, the voice quality is the product.

Tooling depth. ElevenLabs ConvAI now has what we needed for the next phase: structured tool calling that maps cleanly to our CRM ingest, native interruption handling, dynamic context injection mid-call, and a conversation evaluator we can wire into our own eval pipeline. Retell's tool-calling story was working, but we were starting to build around its limits.

Day-by-day

Day 1, spec extraction. Pulled Goldie's existing prompt from Retell, the tool definitions, the call-handling rules, and a representative sample of sixty transcripts from the last ninety days. The transcripts mattered more than the prompt. They told us what the agent actually does in production, which is always different from what the prompt says it should do.

Day 2, prompt port and tool re-binding. Rebuilt the prompt using ElevenLabs' system message plus dynamic variables structure. Rewrote the four tools (qualify_lead, capture_event_context, schedule_callback, post_to_crm). The CRM ingest endpoint stayed exactly the same. The only thing changing was the agent calling it. The thing that took longest was tuning the conversation flow so the agent didn't ask the same question twice. ElevenLabs agents have stronger short-term memory than Retell's by default, and what was a workaround in Retell became unnecessary in ElevenLabs and had to be removed before it caused new bugs.

Day 3, Twilio re-pointing and parallel deploy. We didn't replace Goldie. We deployed the new ElevenLabs agent on a separate Twilio number, ran both numbers live, and routed 10% of incoming calls to the new one for two days. Both agents posted to the CRM with a version tag so we could compare lead quality side-by-side. Parallel deploy is non-negotiable for voice migrations. The cost-of-failure is too high for a cutover.

Day 4, cutover and eval baseline. Lead quality on the new agent was equal-or-better across all four scoring dimensions. Voice naturalness scores from the customer-side rubric were materially higher. We swapped the production Twilio number to point at ElevenLabs and decommissioned the Retell agent. The first eval suite run on the new agent went into the dashboard the same day.

What got better

Naturalness. Three customers in week one mentioned, unprompted, that they didn't realize they were talking to an AI until the agent told them. That didn't happen on Retell.

Interruption handling. Customers talk over agents constantly. ElevenLabs handles it cleanly. Retell's interruption support was working, but it felt mechanical.

Structured tool output. The CRM payload coming out of ElevenLabs is cleaner: fewer null fields, more accurate event-date parsing, better at distinguishing a single inquiry from a multi-event ask.

What broke

Latency on the first turn. ElevenLabs' first-response time was about 400ms slower than Retell's. We added a pre-warmed greeting and a brief preamble. The agent says "Hi, you've reached Golden Plate, this is Goldie" while it's still loading the call context. By the time the customer responds, the agent is fully ready.

Calendar tool returning timezone-naive dates. Retell stripped timezones aggressively. ElevenLabs preserves them, which is correct, but our CRM ingest didn't expect them. One-line fix in the ingest API, and it would have been a multi-hour mystery without parallel deploy catching it.

Voice ID drift. The exact voice we wanted was on a paid tier we hadn't subscribed to yet. The free-tier closest match was 90% there, but the brand sounded slightly different. Solved by upgrading. Budget it in advance next time.

What we'd do differently

Next time, parallel deploy starts on day one. Building in isolation for three days and then routing traffic on day three meant we caught issues later than we should have. Two days of dual-version traffic would have shaved another half-day off the timeline and surfaced the timezone bug 24 hours earlier.

We're also writing the eval suite before porting the prompt next time. Goldie's old eval suite was Retell-shaped. It tested for behaviors the old agent had. The new agent passed those tests easily but had its own failure modes the old suite didn't cover. We rewrote the suite around behavior outcomes (lead quality, naturalness, completion rate) rather than prompt-shape, and that's the version we keep.

If your provider's tooling is starting to feel like a ceiling instead of a floor, migration is more tractable than it feels. Four days for a complex agent with four tools and a live customer base. The discipline that mattered: parallel deploy, behavior-shaped evals, and not trying to make the new agent identical to the old one.