Building an AI Interactive Voice Agent (IVA) with Gemini Multimodal Live & Twilio

December 24, 2024 · 14 min read

Building AI Agents @ Google

How to build an AI interactive voice agent (IVA) with Gemini Multimodal Live & Twilio.

Disclaimers:

At the time of this writing, I am employed by Google Cloud. However the thoughts expressed here are my own and do not represent my employer.
The code provided here is sample code for educational purposes only. Please write your own production code.

Introduction

The boundaries of what's possible with AI are being pushed every day and 2025 is shaping up to be a groundbreaking year with AI Agents. AI agents are poised to revolutionize industries by automating tasks, improving decision-making, and creating entirely new possibilities.

Voice AI agents, in particular, offer a natural and intuitive interface for users to engage with technology. This hands-free, conversational approach is not only convenient but also enhances accessibility for users with disabilities or those who prefer auditory interactions. The ability to personalize responses, recognize individual users, and provide quick, efficient support makes voice AI agents a powerful tool for businesses and individuals alike. For example:

Customer service: Tired of waiting on hold? AI agents can provide instant support, answer frequently asked questions, and resolve simple issues 24/7. This frees up human agents to focus on more complex problems and enhances customer satisfaction with quicker response times.
Sales and marketing: AI agents can analyze customer data to personalize product recommendations, generate targeted content, and automate lead nurturing. This personalized approach can lead to increased conversion rates and build stronger customer relationships.
Healthcare: From assisting with diagnoses to monitoring patients remotely, AI agents can improve patient outcomes and reduce healthcare costs. Imagine an AI agent that reminds you to take your medication, schedules appointments, and even provides personalized health advice.

Google’s Gemini 2.0 was recently released and it included the Multimodal Live API which enables low-latency, two-way interactions that use text, audio, and video input, with audio and text output. By integrating Gemini with Twilio, we can build a new generation of AI-powered Interactive Voice Assistants (IVAs) that are low-latency, intelligent, and more capable than ever before.

What We’ll Build

In this tutorial, we will build a server application using Quart that will act as a bridge between Twilio Voice Calls and Google Gemini AI. This application will allow us to create a powerful AI-powered Interactive Voice Assistant (IVA) capable of engaging in natural, dynamic conversations with users over the phone.

Receive Incoming Calls: Our Quart application will be configured to receive incoming calls from Twilio.
Streaming Audio to Gemini: We'll capture the caller's voice and stream it in real-time to Gemini's Multimodal Live API.
AI-Powered Understanding and Response: Gemini will process the audio and use AI to generate an audio response using speech generation
Sending Audio Back to the Caller: Finally, we'll send the generated audio back to the Twilio call, allowing the IVA to respond to the caller in real-time.

By the end of this tutorial, you'll have a working IVA that can respond to spoken requests and provide information and assistants to callers over the phone. This foundation can be expanded to create sophisticated IVAs for various use cases, including customer support, appointment scheduling, interactive surveys, and much more.

Prerequisites

Before we begin, ensure you have the following:

Google Cloud Platform Account: If you don't have a Google Cloud account you can sign up for free.
Twilio Account: We will be using Twilio for our phone calls. Sign up for a free Twilio account (or use an existing one).
Python 3.7+: Make sure you have a recent version of Python installed on your system.
ngrok: This handy tool will allow us to expose our local Quart server to the internet so Twilio can communicate with it. Download and install it from the official ngrok website.
Code Editor: Choose your favorite code editor or IDE for working with the Python code (Google Cloud Shell Editor, VS Code, etc.).

Setting Up the Server

First we’ll need a web server. This will serve as the bridge between our Gemini AI Agent and the phone call. We will use Quart, a Python web microframework that excels at asynchronous Python. Its primary function is to receive audio data from a Twilio phone call, send it to Gemini to create a response , and then send Gemini’s response back to the Twilio call. This allows for real-time audio interaction powered by Google Gemini's AI capabilities.

Let's break down the provided Python code step-by-step to understand how this integration works:

Install the dependencies

First we’ll need to install the required python packages. This simple app only requires two dependencies:

google-cloud-genai: This library allows us to interact with Google's Gemini API.
Quart: This is a web framework used to build the server that will handle communication between Twilio and Gemini.

To install the packages, run the following command in your terminal:

pip install google-cloud-genai Quart

Websocket Handler (gemini_twilio.py)

Next we need a handler to connect Twilio with Gemini. Create a file called gemini_twilio.py and add the following code:

# gemini_twilio.py

from google import genai
from quart import websocket
import json
import base64
import audioop

class GeminiTwilio:
    def __init__(self):
        # Initialize the Google Cloud GenAI SDK: https://googleapis.github.io/python-genai/#create-a-client
        self.client = genai.Client(vertexai=True, project='<GCP-PROJECT-NAME>', location='us-central1')

        # Set the model_id and config. System instructions and functions can be added to the config for more complex use cases
        self.model_id = "gemini-2.0-flash-exp"
        self.config = {"response_modalities": ["AUDIO"]}
        
        # This will hold the StreamSid sent from Twilio
        self.stream_sid = None # used 


    async def twilio_audio_stream(self):
        '''
        Async method to handle the incoming Twilio media stream: https://www.twilio.com/docs/voice/media-streams
        Messages come in as JSON strings review the "event" key and handle the events: https://www.twilio.com/docs/voice/media-streams/websocket-messages

        Start Event - extract the StreamSid and set self.stream_sid
        Media Event - Convert from mulaw to PCM audio and yield the resulting data
        Stop Event - print stream stopped
        '''
        while True:
            message = await websocket.receive()
            data = json.loads(message)
            if data['event'] == 'start':
                self.stream_sid = data['start']['streamSid']
                print(f"Stream started - {self.stream_sid}")
            elif data['event'] == 'media':
                audio_data = data['media']['payload'] # Base64 encoded audio
                decoded_audio = base64.b64decode(audio_data) # Decode the audio
                pcm_audio = audioop.ulaw2lin(decoded_audio, 2) # Convert to PCM
                yield pcm_audio
            elif data['event'] == 'stop':
                print("Stream stopped")


    def convert_audio_to_mulaw(self, audio_data: bytes) -> str:
        '''
        Converts audio bytes to mulaw and returns a base64 string
        Args:
            audio_data: (bytes) - the raw pcm audio data
        '''
        data, _ = audioop.ratecv(audio_data, 2, 1, 24000, 8000, None) # Convert from 24000 sample rate to 8000
        mulaw_audio = audioop.lin2ulaw(data, 2) # Convert to mulaw
        encoded_audio = base64.b64encode(mulaw_audio).decode('utf-8') # Convert to base64 encoded string
        return encoded_audio 


    async def gemini_websocket(self):
        '''
        Establishes a session (genai.types.AsyncSession) and starts a stream to process incoming audio and handle responses from Gemini
        '''
        print("New websocket connection established")
        async with self.client.aio.live.connect(model=self.model_id, config=self.config) as session:
            try:
                async for response in session.start_stream(stream=self.twilio_audio_stream(), mime_type='audio/pcm'):
                    if data := response.data:
                        message = {
                            "event": "media",
                            "streamSid": self.stream_sid,
                            "media": {
                                "payload": self.convert_audio_to_mulaw(data)
                            }
                        }
                        print(message)
                        await websocket.send(json.dumps(message))
            except Exception as e:
                print(f'Unexpected error in gemini_websocket: {e}')
            finally:
                print('Closing session')
                await websocket.close(code=200)
                await session.close()

You'll need to replace "GCP_PROJECT_NAME" with your Google Cloud Project Name.

The GeminiTwilio Class: This class forms the heart of the server's logic. Let's break down what each method in this class does:

__init__: This is the constructor of the GeminiTwilio class. It initializes the stream_sid to None. The stream_sid is used to identify a specific Twilio stream. It will be set to an actual value when a new stream is initiated by a call to the Twilio number. We use None as the initial value because no stream exists when the GeminiTwilio object is first created.
- genai.Client: This code initializes the genai.Client with your project credentials. In the code example, we’re using Vertex AI but if you are using Google AI for Developers, you can initialize the Client with an API key.
- The model_id is set to "gemini-2.0-flash-exp", which is a powerful and versatile model. You can explore other available Gemini models in the Google Cloud documentation if needed. Finally, the config is set to only receive audio responses from Gemini.
twilio_audio_stream: This asynchronous generator function manages the incoming audio from Twilio. It continuously receives WebSocket messages containing audio data. The function then parses the JSON data from these messages to extract the audio payload. The audio is sent from Twilio in Base64 format, so the code decodes it. Finally, the function converts the audio from the mu-law (u-law) format used by Twilio to Pulse Code Modulation (PCM) format using audioop.ulaw2lin. This PCM audio data is yielded by the function piece by piece, making it ready to be streamed to Google Gemini for processing.
convert_audio_to_mulaw: Twilio expects audio responses to be sent back in mu-law format. This function handles converting the audio from PCM (the format used internally) to the mu-law format and then encodes it using Base64 for sending it back to Twilio.
gemini_websocket: This asynchronous function handles the WebSocket connection to the Google Gemini API. It starts by establishing a connection to Gemini using the genai.Client that we initialized earlier. It then uses the twilio_audio_stream function to receive the audio data from the Twilio call. This audio data is then streamed to Gemini for processing. The gemini_websocket function then receives the processed audio (Gemini's response) and uses convert_audio_to_mulaw to convert the response to the correct format for Twilio. Finally, the function sends the encoded mu-law audio back to Twilio over the WebSocket, completing the loop.

Quart app (main.py)

Now we need to create a Quart app to run our code. Create a app.py file and copy the code below:

# app.py file 

from quart import Quart, websocket
from gemini_twilio import GeminiTwilio

app = Quart(__name__)

@app.websocket('/gemini')
async def talk_to_gemini():
    await GeminiTwilio().gemini_websocket() 

if __name__ == "__main__":
    app.run(host='localhost', port=8080)

This code creates a Quart app and runs it on port 8080. It also creates a /gemini route where we can send websocket traffic. When a WebSocket connection is established at this route, an instance of our GeminiTwilio class runs the gemini_websocket method which initiates the communication process between Twilio, the server, and Gemini.

Running the Server

Now that we've gone through the code, let's see how to get our Quart server up and running.

Running the Server: In the directory where you’ve added your python files (app.py and gemini_twilio.py) run the following command to start the Quart server:

python app.py

Using ngrok: Since our Quart server is running locally, we'll need to use a tool called ngrok to create a secure tunnel to make it accessible from the internet. This will allow Twilio to communicate with our server. In a new terminal window, run the following command (assuming your server is running on port 8080):

ngrok http 8080

Copy the HTTPS forwarding URL that ngrok provides. We will need this for our Twilio configuration.

Setting Up Twilio

Now, we'll configure Twilio to handle incoming calls and route them to our Quart server.

Configuring the TwiML Bin

Let's start by setting up a TwiML Bin in your Twilio account. This bin will hold the TwiML instructions, which are written in XML, that will control how Twilio handles incoming phone calls. Our TwiML instructions will direct Twilio to connect incoming calls to the Quart server we set up.

Steps for TwiML Bin Setup

Log in to Twilio: Log in to your Twilio account.
Navigate to TwiML Bins: On the left-hand side of your Twilio dashboard, click on Explore Products and then select TwiML Bins (you can also search for TwiML Bins using the search bar).
Create a New Bin: Click the blue Create new TwiML Bin button to create a new TwiML Bin.
Name the Bin: Give the bin a descriptive name, such as "Gemini Integration" or "AI Voice Assistant."
Paste the TwiML Code: Paste the following TwiML code snippet into the TwiML Bin editor:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Connect>
        <Stream url="wss://YOUR_NGROK_URL/gemini" />
     </Connect>
</Response>

Replace the Placeholder: In the TwiML code, replace YOUR_NGROK_URL with the actual HTTPS forwarding URL you copied from ngrok. Remove the https:// part of the url and replace it with wss://
Save the Bin: Click the Save button to save your TwiML Bin.

Purchasing a Twilio Phone Number

Next, you'll need to purchase a Twilio phone number. This number will be the entry point for users to initiate calls that will be processed by Google Gemini. Keep in mind that purchasing a phone number may incur a small cost, depending on Twilio's pricing.

Steps to Buy a Number

Navigate to Phone Numbers: In your Twilio console, click on Phone Numbers on the left-hand side of the page.
Search for Available Numbers: You can search for available numbers by country, area code, and capabilities (voice, SMS, MMS, and Fax). Select a number that meets your needs.
Purchase a Number: Once you've found a number you like, click the Buy button. You may be prompted to provide billing information

Connecting the Phone Number to the TwiML Bin

Now, let's connect your new Twilio phone number to the TwiML Bin we configured earlier. This connection ensures that when someone calls this number, Twilio will follow the instructions in our TwiML Bin, which will route the audio to our Quart server for processing by Google Gemini.

Steps to Connect Number and Bin

Access Phone Number Configuration: Go to the Active Numbers page and click on the phone number you just purchased to access its configuration settings.
Configure Voice Settings: Scroll down to the Voice section.
Select 'TwiML Bin': Under "A CALL COMES IN," select the "TwiML Bin" option. This allows us to specify a URL that Twilio will send a request to when a call is received.
Select The TwiML Bin: Select the name of the TwiML Bin you created.
Save the Configuration: Click the Save button at the bottom of the page to save your configuration.

Calling Your Agent

Now for the exciting part - let's test our setup to see if everything is working as expected!

Steps to Test the Connection:

Start Your Servers: Make sure your Quart server is running and that you have an active ngrok tunnel pointing to it.
Call Your Twilio Number: Use your phone to call the Twilio phone number you purchased and configured.
Observe the Logs: Keep an eye on your Quart server's console output. You should see messages indicating that Twilio is connecting and sending audio data to your server.
Listen for Gemini's Response: If everything is set up correctly, you should hear Google Gemini's response to your spoken input playing back through the phone call. The exact response will depend on how you've configured your Gemini model and what you say during the call.

Conclusion

🎉 Congratulations, you've now successfully built an AI interactive voice agent (IVA)! You have a Quart server acting as a bridge, receiving audio from Twilio, sending it to Gemini for processing, and then routing Gemini's response back to the caller. This setup provides a solid base upon which you can build a wide variety of AI-powered voice applications.

Next Steps and Use Cases

With this foundation in place, you're ready to explore a world of possibilities. Here are some ideas for projects you can build upon this integration:

Interactive Voice Assistants: Create voice-activated assistants that can answer questions, schedule appointments, set reminders, and interact with other services.
Advanced IVR Systems: Design dynamic and engaging Interactive Voice Response (IVR) systems that guide users through menus and options using natural language, enhancing the user experience.
Real-Time Call Analysis: Analyze the content of phone calls in real-time for sentiment analysis, keyword detection, or other insights. This could be incredibly valuable for customer service, sales calls, and more.
Personalized Audio Content: Deliver personalized audio experiences tailored to individual user preferences. Imagine a system that reads out customized news feeds, plays music based on user mood, or provides location-based information.
Deploy to Cloud Run: We've run this code locally using ngrok but you can deploy your Quart application to Google Cloud Run, a serverless platform that allows us to run our application in a scalable and cost-effective manner.

Introduction​

What We’ll Build​

Prerequisites​

Setting Up the Server​

Install the dependencies​

Websocket Handler (gemini_twilio.py)​

Quart app (main.py)​

Running the Server​

Setting Up Twilio​

Configuring the TwiML Bin​

Purchasing a Twilio Phone Number​

Connecting the Phone Number to the TwiML Bin​

Calling Your Agent​

Conclusion​

Next Steps and Use Cases​