Vapi Review 2026: Voice Agent Infrastructure for Developers Who Ship

Vapi Review 2026: Voice Agent Infrastructure for Developers Who Ship
This site contains affiliate links. We may earn a commission at no extra cost to you. How we review →

Vapi is voice agent infrastructure. Not a chatbot builder with a microphone bolted on, not a no-code demo platform—it is the orchestration layer that connects speech-to-text, large language models, and text-to-speech into a single real-time voice pipeline. If Twilio is the infrastructure layer for traditional telephony, Vapi is positioning itself as the equivalent for AI-powered voice conversations.

People searching for a Vapi review generally fall into two groups. The first group consists of developers who have already tried wiring together Deepgram, an LLM, and ElevenLabs manually, hit the latency and turn-taking problems that make real-time voice brutally hard, and want to know whether Vapi actually solves those problems. The second group consists of technical founders evaluating voice AI platforms for a production product—a customer service line, an AI receptionist, an outbound sales agent—and need to understand pricing, reliability, and lock-in risk before committing.

This review covers the architecture, provider ecosystem, pricing, and real limitations based on published documentation, developer community reports, and the platform's public API surface. No synthetic benchmarks from a weekend test project. Just what the platform does, what it costs, and where it breaks down.

What Vapi Does: The Voice Agent Orchestration Layer

At its core, Vapi manages the three-stage pipeline that every voice agent requires: converting speech to text (STT), sending that text to a language model for reasoning and response generation (LLM), and converting the response back to speech (TTS). This sounds simple on paper. In practice, doing it in real time with natural conversation dynamics is an engineering challenge that has consumed entire teams.

Vapi handles the hard parts that sit between those three stages. Turn-taking detection—knowing when a user has finished speaking versus pausing mid-sentence. Interruption handling—stopping TTS playback when the user cuts in. Endpointing—determining the boundary between speech and silence with low enough latency that the conversation feels natural rather than laggy. Backchanneling—the subtle audio cues that signal the agent is listening.

Beyond the core pipeline, Vapi provides telephony infrastructure. You can provision phone numbers directly through the platform, receive inbound calls, make outbound calls, handle call transfers (both warm and cold), detect voicemail, and process DTMF tones. This is the piece that separates Vapi from lower-level tools like LiveKit or Daily—you get the full stack from phone number to functioning voice agent.

Architecture: How the Pipeline Works

Vapi's architecture is built around WebSocket streaming. When a call connects, audio streams from the telephony layer into the STT provider via WebSocket. Transcribed text flows to the configured LLM, which generates a response. That response streams token-by-token to the TTS provider, which begins generating audio before the full response is complete. The resulting audio streams back to the caller.

The critical design decision is streaming at every stage. Vapi does not wait for complete STT transcription before sending to the LLM. It does not wait for the full LLM response before starting TTS. This pipelining is what makes sub-second response times possible—each stage begins processing as soon as it receives partial input from the previous stage.

Function calling (tool use) is integrated into the pipeline. You define tools in your assistant configuration—check appointment availability, look up a customer record, transfer the call—and the LLM can invoke them mid-conversation. Vapi handles the orchestration of pausing TTS, executing the function via your webhook or server URL, feeding the result back to the LLM, and resuming the conversation. Server-side tools execute synchronously on your infrastructure. Client-side tools execute in the browser for web-based agents.

The squad feature enables multi-agent architectures where different assistants handle different parts of a conversation. A receptionist agent qualifies the caller, then hands off to a scheduling agent, then transfers to a human if needed. Each agent in the squad has its own LLM configuration, system prompt, and tool set.

Supported Providers and Integrations

Vapi's value proposition depends heavily on its provider flexibility. Rather than building proprietary STT, LLM, or TTS, it integrates with the best options in each category and lets you mix and match.

Speech-to-Text Providers

Deepgram is the default and most commonly used STT provider on Vapi, offering low-latency streaming transcription with strong accuracy on conversational speech. AssemblyAI provides an alternative with different accuracy characteristics, particularly for specialized domains. Vapi also supports custom STT endpoints for teams running their own models, such as fine-tuned Whisper variants.

LLM Providers

OpenAI models (GPT-4o, GPT-4o-mini) are the most widely used on the platform due to their balance of speed and capability. Anthropic Claude models (Claude 4 Sonnet, Claude 4 Haiku) are supported for teams that want Claude's instruction-following strengths or need to stay within Anthropic's ecosystem. Google Gemini models are available as well. For teams with custom requirements, Vapi supports any OpenAI-compatible endpoint, which means you can bring fine-tuned models, self-hosted open-source models, or routing layers like OpenRouter.

Text-to-Speech Providers

ElevenLabs delivers the highest voice quality on the platform but adds latency and cost. Deepgram TTS offers significantly lower latency at the expense of some naturalness—a trade-off that makes sense for high-volume use cases where speed matters more than voice polish. PlayHT, LMNT, Cartesia, and Rime are also supported, each with different latency, quality, and pricing profiles. Azure TTS provides an enterprise option with Microsoft's neural voices.

Telephony Providers

Vapi provides built-in phone number provisioning, but also integrates with Twilio and Vonage for teams that already have telephony infrastructure. You can bring your own SIP trunks for maximum control over call routing and costs.

Capability and Pricing Overview

CapabilityDetailsTypical Cost Impact
Platform feePer-minute charge on all calls$0.05/min
STT (Deepgram)Streaming transcription, Nova-2 model$0.0059/min
STT (AssemblyAI)Streaming transcription, Universal model$0.01/min
LLM (GPT-4o-mini)Fast responses, lower cost~$0.005–0.02/min
LLM (GPT-4o)Higher capability, higher latency~$0.02–0.06/min
LLM (Claude 4 Haiku)Fast, strong instruction following~$0.005–0.02/min
TTS (Deepgram)Low latency, functional quality$0.015/min
TTS (ElevenLabs)High quality, higher latency$0.03–0.10/min
TTS (PlayHT)Mid-range quality and latency$0.02–0.05/min
Phone numbersUS/CA local and toll-free numbers$2–5/mo per number
Call transfersWarm and cold transfer via SIP/PSTNIncluded in platform fee
Voicemail detectionAutomated detection and handlingIncluded in platform fee
Function callingServer-side and client-side toolsIncluded (you pay for webhooks)
Squad (multi-agent)Multiple assistants per callIncluded in platform fee
Dashboard & analyticsCall logs, transcripts, cost trackingIncluded on all plans

The total per-minute cost for a typical production stack—Deepgram STT, GPT-4o-mini, Deepgram TTS—lands around $0.08/min. A premium stack with ElevenLabs TTS and GPT-4o can reach $0.15/min or higher. At 10,000 minutes per month, you are looking at $800 to $1,500 in Vapi costs alone, before your own infrastructure expenses.

Pricing Deep Dive

Vapi's pricing model is straightforward but the total cost adds up faster than the $0.05/min headline suggests. The platform fee is a flat per-minute charge that applies to every second of every call. On top of that, you pay the actual costs of each provider in your pipeline. Vapi passes through provider costs and adds its platform fee.

For a concrete example: a five-minute customer service call using Deepgram STT ($0.0059/min), GPT-4o-mini (approximately $0.01/min depending on conversation length), and Deepgram TTS ($0.015/min) costs roughly $0.08/min total, or $0.40 for the full call. The same call with ElevenLabs TTS instead of Deepgram bumps to approximately $0.12/min, or $0.60 for five minutes.

At scale, these numbers matter. A business handling 1,000 calls per day averaging three minutes each is looking at 90,000 minutes per month. At $0.08/min, that is $7,200/month. At $0.12/min, it is $10,800/month. The Vapi platform fee alone accounts for $4,500 of either figure. Enterprise plans with custom pricing exist for high-volume customers, but Vapi does not publish those rates.

The free tier provides a limited number of minutes for development and testing—enough to build a proof of concept but not enough for any production traffic.

Dashboard and Developer Experience

The Vapi dashboard provides call logs with full transcripts, per-call cost breakdowns, latency metrics, and error tracking. You can test assistants directly from the browser, which accelerates the development loop compared to platforms that require you to make an actual phone call for every test.

The API is REST-based for configuration (creating assistants, provisioning numbers, managing tools) and WebSocket-based for real-time call interaction. SDKs exist for Python, Node.js, and web (React). The documentation is reasonably comprehensive, though the pace of feature releases means some newer capabilities are documented primarily through changelog entries and Discord discussions rather than polished guides.

Prompt engineering for voice agents is meaningfully different from text-based LLM applications, and Vapi's documentation acknowledges this. System prompts need to account for conversational dynamics, graceful error handling when the user says something unexpected, and explicit instructions about when to use tools versus when to respond directly. The platform provides template assistants for common use cases—appointment scheduling, FAQ handling, lead qualification—that serve as reasonable starting points.

For developers who prefer working in code, Cursor with Vapi's TypeScript SDK provides a strong workflow. Define your assistant configuration as code, version control your system prompts and tool definitions, and deploy changes through your standard CI/CD pipeline rather than clicking through a dashboard.

Latency: The Number That Matters Most

In voice AI, latency is the metric that determines whether your agent feels like a conversation or an IVR menu from 2005. Vapi targets sub-800ms voice-to-voice latency, measured from the end of the user's speech to the beginning of the agent's response audio.

Achieving that target depends entirely on your provider stack. The fastest documented configuration—Deepgram Nova-2 for STT, GPT-4o-mini for LLM, Deepgram Aura for TTS—can hit 500-700ms in optimal conditions. Replace Deepgram TTS with ElevenLabs and you add 200-400ms. Replace GPT-4o-mini with GPT-4o or Claude 4 Sonnet and you add another 100-300ms depending on response length.

Network conditions, geographic proximity to provider endpoints, and the complexity of the LLM's reasoning task all introduce variance. A simple FAQ response will be faster than a response that requires a function call to check database state. Vapi's streaming architecture helps minimize the impact of longer LLM responses—TTS begins generating audio as soon as the first tokens arrive—but the initial time-to-first-token from the LLM is the main bottleneck in most configurations.

For production deployments, expect to spend meaningful time optimizing your provider selection and system prompt to minimize latency. The difference between a well-optimized and poorly-optimized Vapi deployment can be 500ms or more, which is the difference between natural and noticeably awkward.

When Vapi Falls Short: Five Real Limitations

1. Cost Accumulation at Scale

The $0.05/min platform fee is competitive for low to moderate volume. At high volume, it becomes the single largest line item in your voice AI budget. A company running 100,000 minutes per month pays $5,000/month just for Vapi's orchestration layer, on top of all provider costs. At that scale, building your own orchestration with LiveKit or a custom WebSocket server starts to make financial sense—the engineering investment pays back within a few months. Vapi's value proposition is strongest for teams that need to ship fast and have fewer than 50,000 minutes per month.

2. Vendor Lock-In Through Orchestration

While Vapi lets you swap individual providers (change your TTS from ElevenLabs to Deepgram with a config change), the orchestration logic itself is proprietary and opaque. Your system prompts, tool definitions, and assistant configurations are portable in theory but tied to Vapi's API schema and behavior in practice. If you need to migrate away from Vapi, you are rebuilding the turn-taking logic, interruption handling, and tool execution pipeline from scratch. The individual providers are swappable. The glue that connects them is not.

3. Voice Quality Is Not a Vapi Feature

Vapi frequently gets praised or criticized for voice quality, but Vapi does not generate any audio. Your voice quality is entirely determined by your TTS provider choice. Teams that choose Deepgram TTS for its low latency will have functional but noticeably synthetic voices. Teams that choose ElevenLabs will have excellent voice quality but higher latency and cost. Vapi cannot improve the quality of the audio its TTS providers generate. If you need a specific voice quality standard, evaluate TTS providers independently before choosing Vapi as your orchestration layer.

4. Debugging Distributed Pipeline Failures

When a voice agent misbehaves—gives a wrong answer, stutters, drops a call, fails to execute a tool—diagnosing the root cause requires understanding which stage of the pipeline failed. Was the STT transcription inaccurate? Did the LLM hallucinate? Did a function call time out? Was TTS audio corrupted? Vapi's dashboard provides call transcripts and logs, but tracing a specific failure through the STT-LLM-TTS pipeline often requires cross-referencing multiple log sources. For complex tool-calling flows with multiple sequential function invocations, debugging can be genuinely difficult. This is not unique to Vapi—it is inherent to distributed voice pipelines—but Vapi's observability tooling does not fully solve it.

5. Limited Offline and Edge Deployment

Vapi is a cloud-hosted service. Every call routes through Vapi's servers to your configured providers. There is no on-premises deployment option, no edge deployment, and no offline capability. For use cases that require data residency guarantees, air-gapped environments, or operation without reliable internet connectivity, Vapi is not a viable option. Regulated industries (healthcare, financial services, government) may face compliance challenges with the multi-vendor data flow that Vapi's architecture requires.

The Bottom Line

Vapi is the strongest option available for developers who need to ship a production voice agent without spending months building real-time audio infrastructure. The provider flexibility is genuine—you can optimize your STT/LLM/TTS stack for your specific cost, latency, and quality requirements. The telephony integration is complete enough for most use cases. The API and SDK experience is solid.

Use Vapi if you are building a voice agent product and need to reach production in weeks rather than months, if your volume is under 50,000 minutes per month, and if cloud-hosted multi-vendor architecture is acceptable for your compliance requirements.

Look elsewhere if you are processing more than 100,000 minutes per month and the platform fee dominates your budget, if you need on-premises or edge deployment, if you require deep control over turn-taking and interruption logic that Vapi's abstractions do not expose, or if you need to avoid vendor lock-in on the orchestration layer specifically.

The voice AI infrastructure space is moving fast. Vapi's position as the orchestration layer is strong today, but the competitive landscape includes Retell AI, Bland AI, and open-source alternatives like Vocode and Pipecat that are closing the feature gap. Evaluate Vapi against your specific requirements rather than adopting it as a default. The platform fee compounds. The lock-in is real. But for the right use case, the engineering time it saves is substantial.

Disclosure: Some links in this article are affiliate links. We may earn a commission if you sign up through them. This does not influence our assessments—we cover tools based on their technical merits and documented capabilities.

FAQ

What does Vapi actually do?
Vapi orchestrates the full real-time voice AI pipeline: speech-to-text, LLM reasoning, and text-to-speech. It handles telephony infrastructure, WebSocket streaming, turn-taking, function calling, and call transfers so developers can build voice agents without managing audio infrastructure directly.
How much does Vapi cost per minute?
Vapi charges a $0.05/min platform fee on top of your provider costs. Typical all-in cost ranges from $0.08 to $0.15 per minute depending on which STT, LLM, and TTS providers you select. High-volume plans with custom pricing are available for teams processing over 100,000 minutes per month.
What LLM providers does Vapi support?
Vapi supports OpenAI (GPT-4o, GPT-4o-mini), Anthropic (Claude 4 Sonnet, Claude 4 Haiku), Google (Gemini), and custom LLM endpoints via OpenAI-compatible APIs. You can also bring your own fine-tuned models through their custom LLM integration.
What is Vapi's typical response latency?
Vapi targets sub-800ms voice-to-voice latency. Actual latency depends on your provider stack: Deepgram STT plus a fast LLM like GPT-4o-mini plus Deepgram TTS can achieve 500-700ms. Heavier stacks with Claude or ElevenLabs typically land in the 800-1200ms range.
Can Vapi handle call transfers and voicemail?
Yes. Vapi supports warm and cold call transfers via SIP and PSTN, voicemail detection to avoid wasting agent minutes on answering machines, and DTMF tone handling. These telephony features work with both Vapi-provisioned numbers and numbers you bring from Twilio or Vonage.
What are the main alternatives to Vapi?
Key alternatives include Retell AI (similar orchestration layer, different pricing model), Bland AI (focused on outbound calling), Vocode (open-source voice pipeline), and building your own stack with LiveKit or Daily for real-time audio plus direct provider integrations. The right choice depends on whether you need managed telephony, care about vendor lock-in, or require specific provider combinations.

Related reads

Across the Wild Run AI network