Logo image of Arcanum12thArcanum12th

Voice AI systems based on Jambonz: from telephony to human-machine dialogue

From Popov's experiments with radio communication to AI calls — how telephony became the foundation of smart communications

  1. Авторы
  2. Курсы
Number of views14
reads14 reads
Voice AI systems based on Jambonz: from telephony to human-machine dialogue

Telephony has long ceased to be just voice communication. Today it has become a digital infrastructure where calls can be analyzed, managed programmatically, and connected to artificial intelligence. This turns a regular conversation into a dialogue between a person and a system — alive, personalized, and data-driven.

I graduated from the Odesa National Academy of Telecommunications named after A. S. Popov, where I studied signal physics, frequencies, and switching principles. Back then, it seemed like pure engineering — calculations, schematics, and laboratory instruments. But today, the same principles underlie Voice AI systems, where voice turns into a stream of data, and the logic of communication — into code.

In the photo: the building of the Odesa National Academy of Telecommunications

My name is Mykhailo Kapustin, I am the Chief Technical Officer (CTO) and co-founder of the transatlantic holding Advanced Scientific Research Projects. We conduct applied research in the field of artificial intelligence, neurointerfaces (BCI), and cognitive technologies, including dream analysis and modeling human perception.

Today we are approaching the era neurointerfaces (BCI), where communication can occur directly — from mind to mind.
I admit that in the future there will be systems of digital telepathy that transmit thoughts directly into the memory of the interlocutor, — either a person will evolve to this ability themselves.

But the essence of communication will not change: it will still be based on the transmission of meanings and routing of signals between minds.
Telephony already performs this role today — only instead of neural impulses, it works with voice, protocols, and code.

Alexander Stepanovich Popov — the inventor of radio and one of the founders of modern communication. In the background — the diagram and the first radio receiver demonstrated May 7, 1895.

Before moving on to the practical part and showing how to build a voice agent on AI, let's understand what telephony consists of — what is PSTNSIPSIP trunks and why Jambonz became a key link between traditional telephony and artificial intelligence.

Caller (PSTN / SIP): where the voice starts

Any voice call starts with caller — the party that initiates the call. This can be a mobile phonea landline or a VoIP client — a software application that transmits voice over the internet (for example, Zoom, Microsoft Teams, Telegram, or specialized corporate software).

Next, it all depends on whether what network the signal passes through:

  • PSTN (Public Switched Telephone Network) — the classic telephone network through which regular calls are made. It provides a stable connection but is not suitable for direct integration with digital systems.

  • SIP (Session Initiation Protocol) — a protocol that converts voice into digital format and allows it to be transmitted over the internet (Voice over IP). It manages the establishment, routing, and termination of calls.

To connect these two worlds — analog (PSTN) and digital (SIP) —  SIP trunk. This is a virtual channel through which calls from the telephone network enter the internet system, for example, into  Jambonz. The SIP trunk acts as an interface between the telecom operator and your application: it receives the call, converts it into a SIP session, after which control passes to Jambonz, where the processing logic is already defined programmatically.

Stage

What happens

Technology / Protocol

Result

1. Caller

The user initiates a call from a phone or VoIP client.

PSTN or SIP

The voice signal is sent to the network.

2. SIP trunk

The call comes through a virtual channel from the operator into the digital environment.

SIP

Conversion of the call into a SIP session.

3. Jambonz

Receives the call, creates a session, and sends a webhook to your application.

SIP / HTTP

The call event is sent to the backend service.

4. Your backend

Processes the event and returns JSON with instructions (verbs).

HTTP / WebSocket

The logic of the conversation is defined.

5. Jambonz (execution)

Executes verbs — speech synthesis, audio collection, routing.

STT / TTS / RTP

The user hears the AI's response.

This sequence reflects the entire path of the voice call — from the call initiator to the processing and response from the AI. Below is a diagram showing how these components connect with each other at the protocol and data stream levels:

Figure 1. The voice call stream from the subscriber to the AI response via SIP, Jambonz, and the backend system.

What is Jambonz and why is it needed

After the call passes through the PSTN, SIP, and SIP trunk, control passes to  Jambonz — a platform that makes phone calls programmatically controlled.
Jambonz is used to create  Voice AI applications, where calls are processed at the code level: they are received, routed, converted to text, voiced through TTS, and interact with external AI models.

The platform is open and flexible: it supports any SIP trunks, STT/TTS services, and LLM providers, allowing you to build your own voice pipelines.
Interaction between Jambonz and the application occurs via webhooks or WebSocket connections. The system sends call events, and the application responds with JSON instructions. — a sequence of commands that define the conversation logic.

When I first started studying telephony, much of it resembled working with analog measuring instruments. In laboratories, we recorded signals using millivoltmeters, analyzed distorted waveforms, and drew switching diagrams on large sheets. Back then, cloud APIs did not exist, but the principles were the same: measure, transmit, process, interpret. Today, the same tasks are solved digitally — only instead of cables and oscilloscopes, we work with SIP streams, webhooks, and LLM models.

Photo: laboratory millivoltmeter B3-38A, Odessa Academy of Communications

Photo: drawings of circuit diagrams — a traditional practice of engineering signal analysis

And yet the principles remain the same — only now instead of analog circuits we manage data streams. Jambonz allows describing these processes not at the schematic level, but using declarative commands called verbs. They determine the behavior of the voice call: when to answer, what to say, and how to connect AI to the conversation.

Key commands (verbs) in Jambonz

Each verb describes a specific action: answer the call, play speech, recognize voice, connect participants, or end the conversation. By combining verbs in a JSON scenario, the developer shapes the call behavior from start to finish.

Category

Examples of verbs

Purpose

Call management

answerhangupredirectpause

Receiving, ending, or redirecting a call.

Working with speech and audio

sayplaygatherlistentranscribesynthesizer

Speech synthesis, sound playback, collection and recognition of speech.

Integration with AI and NLP

LLMDialogflowRasa

Connecting external AI models and dialogue platforms.

Managing connections

dialconferenceenqueuedequeueleave

Connecting participants, organizing conferences and queues.

Working with SIP

sip-declinesip-refersip-request

Managing low-level SIP sessions.

Service commands

tagconfigalert

Adding tags, configuring parameters, notifications.

This set of commands allows describing voice logic completely declaratively — from simple IVR to complex AI dialogues.

Creating an application with AI and telephony

Now that we have figured out how the telephony architecture works and what role Jambonz plays in it, we can move on to the most interesting part — practical integration of AI into voice calls Below is a basic scenario on which most Voice AI applications are built.

1. Incoming call and event processing

When a user makes a call, Jambonz receives it via SIP and calls your backend, sending webhook with call information: number, time, session identifier (call_sid) and call direction. The backend responds with a JSON scenario indicating what to do next. The simplest example is to greet the user and start listening to speech:


app.post('/webhooks/call', (req, res) => {
res.json([
{ verb: 'say', text: 'Hello. Describe your dream.' },
{ verb: 'gather', input: ['speech'], actionHook: '/webhooks/gather' }
]);
});
  • say — synthesizes speech via TTS;

  • gather — includes speech recognition (STT) and waits for user response.

2. Speech processing and AI connection

When the user speaks, Jambonz sends the recognized text to your backend via  actionHook. You can then use any LLM model (e.g., GPT or Gemini) to generate a response:

app.post('/webhooks/gather', async (req, res) => {
const userInput = req.body.speech.transcript;
const reply = await ai.generate(userInput); // Call LLM API
res.json([
{ verb: 'say', text: reply },
{ verb: 'gather', input: ['speech'], actionHook: '/webhooks/gather' }
]);
});

Thus,   is createda dialog cycle where Jambonz responds with voice, and AI manages the content of the conversation.

3. Call termination

When the conversation is over, the application can return final commands:

res.json([
{ verb: 'say', text: 'Thank you for the conversation. Goodbye.' },
{ verb: 'hangup' }
]);

4. Components and Protocols

  • SIP / RTP — transmit signal and audio;

  • HTTP / WebSocket — provide communication between Jambonz and the application;

  • STT / TTS — convert speech to text and back;

  • LLM API — is responsible for the semantic part of the dialogue.

Thanks to this architecture, all the behavior of the call can be described in code, and the logic of communication can be adapted to specific tasks: consultations, customer support, interviews, research sessions.

Output

Today we manage calls just as engineers once managed radio signals — only now the scheme is in a JSON file, and instead of an antenna, an LLM model works.

There is something symbolic in this: every communication technology — from radio to neural interfaces — solves the same problem — to be heard and to be understood.
Jambonz shows how old principles can be translated into a new context, where voice becomes code and code becomes part of a live conversation.

You might also like

+0

Video content

21

Lessons

+500

Students

Please subscribe to comment, or Login if you already have a subscription.