How to Build a Voice AI Agent for Phone Calls
Phones are back — and the businesses winning right now are the ones whose phones get answered in two rings at any hour of the day. This is the 2026 playbook for building a voice AI agent that sounds human, responds under a second, and actually closes the loop on what the caller needed.
Key Takeaways
- A production voice AI agent is a real-time pipeline: streaming ASR, a fast LLM, tool calls, streaming TTS, and telephony — all running in under 800ms per turn.
- Latency beats intelligence. A slow GPT-5 is worse than a fast Claude Haiku for phone calls, because callers hang up on awkward pauses.
- Budget 4 to 8 weeks for a done-for-you build. All-in running cost is typically $0.08 to $0.22 per minute.
- The two killers of DIY voice agents are poor interruption handling and no tool-calling. Both are solvable but neither is free.
Why voice is having a moment again
Here is a paradox. The average American checks their phone 144 times a day, yet most small and mid-market businesses miss one in three inbound calls. The ones they do answer pull a senior team member out of real work to book a simple appointment. For the regulated industries — healthcare, legal, home services, financial services — that missed call is often the only opportunity. Competitors are also one Google search away.
Voice AI agents in 2026 are finally good enough to answer that phone. The combination of sub-second streaming models, neural voices that sound indistinguishable from a human on a phone call, and reliable tool-calling means you can put an AI on the line that books the appointment, updates the CRM, sends the confirmation text, and hands off gracefully when something gets complicated.
The business case is also starker than for most AI agent categories. A full-time receptionist or SDR costs $45k–$80k fully loaded in the US. A voice AI agent that handles 80 percent of inbound calls runs closer to $800–$3,000 a month in all-in infrastructure. The delta pays for the entire build in 60 to 90 days.
The real-time voice AI pipeline
A voice AI agent is not one model — it's a pipeline of models running in real time. A caller speaks, audio streams into speech-to-text, text streams into a language model, the model's response streams out as synthesized speech, all while the caller might interrupt. Every stage has a latency budget. Every stage can break naturalness if it's off by 200ms.
- Telephony ingress. The call arrives via SIP or a provider like Twilio, Vonage, or Telnyx. Audio is streamed in 20ms packets.
- Streaming speech-to-text (ASR). Whisper, Deepgram Nova-3, AssemblyAI Universal-Streaming, or Google USM produces partial and final transcripts as the caller speaks.
- Voice activity detection and turn-taking. A small model detects when the caller has finished a turn — and also when they interrupt the agent.
- LLM with tool-calling. The transcript plus conversation state plus tool definitions goes to an LLM. The LLM either returns text or decides to call a tool.
- Tool execution. Book the appointment, look up the order, create the ticket. Results feed back to the model.
- Streaming text-to-speech (TTS). ElevenLabs, Cartesia Sonic, OpenAI TTS, or PlayHT turns the response into audio as the LLM is still producing tokens.
- Telephony egress. Audio streams back to the caller, ideally within 800ms of them finishing their sentence.
Every component has to stream. If any stage waits for the previous stage to finish before starting, you've already lost the latency game.
The 2026 voice AI stack
There are now three legitimate paths to stand up this pipeline, each with real tradeoffs.
| Path | What it is | Best for | Tradeoffs |
|---|---|---|---|
| Unified realtime API (OpenAI Realtime, Gemini Live) | Single API handling ASR + LLM + TTS in one duplex stream | Fastest time to prototype; simple use cases | Less control, vendor lock-in, per-minute cost higher |
| Voice agent platform (Vapi, Retell, Bland, LiveKit Agents) | Managed orchestration over best-of-breed ASR/LLM/TTS | Production builds in 4 to 8 weeks | Some platform lock-in, but most allow bring-your-own models |
| Fully custom pipeline | You wire ASR, LLM, TTS, and telephony together yourself | High-volume or regulated deployments needing total control | 3 to 6 month engineering effort; ongoing ops burden |
At Bananalabs, we build most production voice agents on a platform like LiveKit Agents or Retell, because they give us 80 percent of the custom-pipeline control at 20 percent of the engineering cost. For very high call volumes or specialized compliance needs, we go fully custom. For proofs of concept, OpenAI Realtime is genuinely impressive out of the box. The same layered decision logic applies across agent builds in general — see how to build an AI agent for the generic framing.
Why latency is the whole game
If you take one thing from this post, take this: on a phone call, latency beats intelligence. A 600ms response from a smaller model will outperform a 2.4 second response from a state-of-the-art frontier model. Callers interpret silence as confusion. They start talking over the agent. Conversations go sideways.
Here's how you get there. Use streaming everywhere — partial transcripts feed the LLM before the caller finishes their turn, and TTS starts synthesizing the first sentence of the response before the LLM is done generating. Use "filler words" (the agent saying "let me check that for you..." while a tool call is running) to hide tool latency. Run your infra in the same region as your telephony provider's media servers. Pick an LLM tier that's fast enough: Claude Haiku 3.5, GPT-5 mini, or Gemini 2 Flash are the workhorses in 2026.
Designing conversation flow
Voice conversations have different design rules than text. Messages are shorter. Turns are faster. The caller can't scroll back. You can't show options as buttons. The agent must be ruthless about keeping responses under two sentences unless the caller explicitly asks for more detail.
A few design heuristics that make a visible difference in CSAT:
- Open with identity and scope. "Hi, this is Ava, the AI assistant at Acme Clinic. I can help you book, reschedule, or cancel an appointment. What do you need?"
- One question at a time. Never stack three questions in a single turn.
- Confirm commitments aloud. "Booking you for Tuesday April 23rd at 3 PM with Dr. Chen — does that work?"
- Signpost when you're working. "One moment while I check the calendar." This hides 1 to 3 seconds of tool latency.
- Offer the escape hatch early and often. "You can say 'human' anytime to be transferred."
The 10-step build process
This is the sequence we follow on done-for-you voice AI engagements. It looks similar to a WhatsApp AI build but with meaningful differences around latency, interruption handling, and telephony.
Step 1 — Pick a single call type
Do not try to handle every call your business receives in version one. Pick the single most common call type — appointment booking, order status, lead intake, or after-hours overflow — and build that first. Expand scope in month two.
Step 2 — Choose telephony and numbers
Twilio, Telnyx, or Vonage are the default choices in 2026. Port your existing business number or buy a new one. Set up call recording (required in most use cases), DTMF handling (for accessibility), and SIP trunking if you have an existing PBX.
Step 3 — Write the conversation spec
Write a full happy-path transcript. Then write five branch conversations for common variations. Then write three disaster conversations — angry caller, silent caller, multi-topic caller. These become your test suite.
Step 4 — Build tool integrations
The tools are the reason you're building this. A check_calendar tool, a book_appointment tool, a lookup_order tool, a transfer_call tool. Each has a typed schema, input validation, and audit logging. For a deeper primer, see our notes on AI agent architecture.
Step 5 — Pick your voice
The voice matters more than most teams think. Test three to five voices with real callers before committing. Warmth and clarity matter more than "impressive" range. ElevenLabs and Cartesia have the deepest libraries; OpenAI Realtime ships strong default voices.
Ready to deploy your first AI agent?
Bananalabs builds custom AI agents for growing companies — done for you, not DIY. Book a strategy call and see what's possible.
Book a Free Strategy Call →Step 6 — Write and test the system prompt
Keep it under 2,500 tokens. Cover identity, scope, tone, escalation triggers, and tool-use rules. Add specific instructions about interruption handling and short-turn style. Test with your conversation spec.
Step 7 — Wire the pipeline
Stream audio through VAD, ASR, LLM, and TTS. Implement interruption handling — when the caller talks over the agent, cut TTS immediately and start listening. Add filler-word generation during tool calls.
Step 8 — Build the handoff path
Warm transfers matter. When the agent transfers to a human, the human should see a Slack message or CRM popup with the caller's name, phone number, intent, and transcript summary. Nothing kills CSAT like a human asking the caller to repeat themselves.
Step 9 — Test at volume
Before live traffic, run 50 to 100 synthetic calls through the agent, plus 10 to 20 real calls from your team. Measure end-to-end latency, task completion rate, and voice quality. Iterate.
Step 10 — Launch with a safety net
Start by routing overflow calls (after hours or when the main queue is full) to the agent. Once that's stable, promote to primary. Keep a human takeover button accessible and watch the first 500 calls manually.
Realistic cost model
Unlike most AI agent categories, voice has a predictable per-minute cost. Here's the typical breakdown in 2026:
| Component | Cost per minute |
|---|---|
| Telephony (Twilio, Telnyx) | $0.008 – $0.015 |
| Streaming ASR (Deepgram, AssemblyAI) | $0.015 – $0.042 |
| LLM inference (Haiku / GPT-5 mini class) | $0.020 – $0.080 |
| Streaming TTS (ElevenLabs, Cartesia) | $0.030 – $0.080 |
| Orchestration platform / compute | $0.010 – $0.030 |
| All-in total | $0.08 – $0.22 |
A 4-minute booking call costs under $1 in infrastructure. A dental office doing 500 inbound calls per week at an average 3 minutes is running $390 to $1,100 a month — replacing roughly 25 hours of receptionist time.
Disclosure, recording, and compliance
Voice AI has sharper compliance rules than most agent categories. Hit these early or you will be cleaning them up in production:
- AI disclosure. Most US states and the EU AI Act require identifying the caller as AI. Do this in the opener.
- Call recording consent. Two-party-consent states (California, Florida, and others) require explicit consent. A short greeting — "This call may be recorded for quality" — handles the standard case.
- TCPA and DNC. For outbound calls, honor the Do Not Call registry and TCPA rules. Inbound is lower risk but still requires prudence.
- HIPAA. For healthcare, you need a BAA with every vendor in the pipeline. Most major voice platforms now offer HIPAA-compliant tiers. See AI agents for healthcare for the full picture.
- Data residency. If you're serving EU callers, run infrastructure in EU regions and pick vendors with EU processing options.
Five ways voice agents fail
- No interruption handling. If the agent keeps talking after the caller starts talking, callers hang up. Implement proper barge-in from day one.
- Too-long turns. A single three-sentence reply feels like a lecture on a phone call. Keep turns short.
- Silent tool calls. If a tool takes 2 seconds, the agent must say something in that gap. "Let me check that for you" is your friend.
- Unreliable transfers. If the warm transfer drops or the human doesn't get context, customers feel worse than if there had been no AI at all.
- Treating it like a chatbot. Voice is its own medium. Don't reuse your chatbot prompt. Re-think everything for real-time turn-taking.
If you're weighing whether a voice agent is the right first project, or whether you should start with email, WhatsApp, or web chat, our overview of what AI agents can do will help you prioritize.
Frequently Asked Questions
What is a voice AI agent?
A voice AI agent is an autonomous system that answers phone calls, understands natural speech, and takes real actions — booking appointments, qualifying leads, transferring calls, or updating systems — using a combination of speech-to-text, a large language model, tool-calling, and neural text-to-speech. Unlike a traditional IVR, it holds a real two-way conversation and adapts to what the caller actually says.
How fast does a voice AI agent need to respond?
Target end-to-end response latency under 800 milliseconds for a natural conversation. The human ear notices awkwardness above 1 second and experiences clear unease above 1.5 seconds. Achieving sub-800ms requires streaming speech-to-text, a low-latency LLM (Claude Haiku, GPT-5 mini, or Gemini Flash), streaming TTS (ElevenLabs Flash or Cartesia), and co-located infrastructure.
How much does it cost to run a voice AI agent per minute?
Voice AI agents cost roughly $0.08 to $0.22 per minute all-in in 2026. This covers telephony (around $0.01/min), speech-to-text ($0.02 to $0.04), LLM inference ($0.02 to $0.08), text-to-speech ($0.03 to $0.08), and observability. A 4-minute booking call typically costs under $1 in infrastructure — often 1/30th of the labor it replaces.
Can a voice AI agent transfer a call to a human?
Yes. A well-built voice AI agent performs warm transfers to humans using the telephony provider's call-control API (SIP REFER or Twilio TwiML dial). Best practice is to pass structured context — caller name, intent, relevant account data — to the human agent via a CRM popup or Slack message so the human doesn't have to ask the caller to repeat themselves.
Is it legal to use an AI voice agent on customer calls?
Yes in most jurisdictions, but disclosure is required. In the US, the FCC's 2024 TCPA ruling and many state laws require identifying the caller as AI. The EU AI Act requires disclosure of any AI system interacting with humans. A one-line opener like 'Hi, this is Ava, the AI assistant at Acme — how can I help?' satisfies most regimes and is the industry standard in 2026.