WP - Powering voice agents using IBM Watson

W H I T E P A P E R

www.persistent.com

Voice agent pool:

Apool of voice agents waiting to engage with users.

Voice agent:

Voice agent is a SIP endpoint built on top of existing SIP libraries such as JSIP or peers-lib and comprises of an

API manager which orchestrates API calls between multiple services offered by IBM Watson and a dialog

assistant that helps streamline conversation dialogs to organization's needs.

Orchestrator:

The Orchestrator choreographs multipleAPIs of Watson such as speech to text, text to speech

DialogAssistant:

Interface with Watson conversation API and incorporate responses with business specific intelligence and data

before passing the response back toAPI manager.

For example; Watson conversationAPI may be configured to return the following response:

Your next dental checkup is scheduled on {next_schedule_date}

Here Dialog assistant can fetch the schedule date from a database and embed the date in dialogue.

Workflow:

1. Customer places a call to a PSTN number for a service request or assistance.

2. The PSTN number is configured with a SIP trunk which forwards the audio streams from tradition phone

system. This is dictated by the origination scheme of your SIP trunk setup.

3. SIP trunk connects with an organization-wide IP-PBX which then forwards SIP traffic to endpoints (Voice

agents).

4. AVoice agent (SIP endpoint) forwards audio stream toWatson speech-to-text service over a web-socket

connection and receives text transcriptions in real-time.

5. When a pause in speech is detected, the transcriptions are sent to Watson conversation service to fetch

the next dialogue.

6. The dialogue is parsed and processed further according to business requirements; dialog assistant has

access to dialogue transcriptions, intent, entities, confidence score and alternative transcriptions. Third

partyAPIs, database calls can also be incorporated at this stage to further enrich dialogues.

9. Once the final dialogue is fabricated, the text is send toWatson text-to-speech instance and audio stream

is received and forwarded to SIP endpoint.

Advantages:

1. Built on top of existing SIP stack and libraries

2. Closely coupled with own IP-PBX such as Asterisk or 3CX, hosted on premise or in the cloud. Most

organizations have some form of organization-wide PBX already setup.

3. Scale up or scale down infrastructure and number of voice-agents at any time

4. Multi-tenancy by connecting one or more direct inward dialing, as defined in asterisk configuration files,

to Voice agents or by having a pool of Voice agents waiting to engage with customers

5. Flexible plug-in architecture: Ability to replace Watson Speech to text, Watson Text to speech with other

services as per business requirements. Voice agents are designed to be very modular. One can extend

and override voice agents to utilize other services for Speech to text or Text to speech. For example: by

extending the method void speak (String text); one can incorporate another TTS service or SDK such as

FreeTTS or espeak.