You talkin' to me?

Australian Accessibility Conference, Online December 2020

Léonie Watson, TetraLogical

Voice XML (VXML)

VXML 1.0 W3C Recommendation (2000)
VXML 2.0 W3C Recommendation (2004)
VXML 2.1 W3C Recommendation (2007)
VXML 3.0 W3C Working Draft (2010)

Voice browsers

VXML is parsed in voice browsers to create voice interfaces

The `<prompt>` element

<vxml version="2.1" lang="en">
 <form>
  <block>
   <prompt bargein="false">Hello and welcome to hell!</prompt>
  </block>
 </form>
</vxml>

The `<menu>` element

<vxml version="2.1" lang="en">
 <menu>
  <prompt>Hello and welcome to hell!<enumerate/></prompt>
  <choice next="https://hell.com/1.vxml">Press 1 to make the wrong choice.</choice>
  <choice next="https://hell.com/2.vxml">Press 2 to listen to elevator music.</choice>
  <choice next="https://hell.com/3.vxml">Press 3 if you've lost the will.</choice>
  <noinput>Press 4 to hear these options again.<enumerate/></noinput>
 </menu>
</vxml>

Conversation architecture

Speak
Listen
Understand
Respond

Automatic Speech Recognition (ASR)

Recognises human speech and converts it into text

Apple voice recognition (1993)

Natural Language Processing (NLP)

Takes text and converts it into structured data

Machine Learning

Takes structured data, processes it, and returns structured data

Natural Language Generation (NLG)

Takes structured data and converts it into text

Text To Speech (TTS)

Takes text and converts it into synthetic speech

Voder synthesizer (1939)

Bell Labs IBM704 (1961)

Parts of a skill

Skill interface
Skill service

Interaction schema

{
	"interactionModel": {
  		"languageModel": {
   			"invocationName": "conference info",
   			"intents": [],
   			"types": []
  		}
 	}
}

Default intents

AMAZON.CancelIntent
AMAZON.HelpIntent
AMAZON.StopIntent
AMAZON.NavigateHomeIntent

Built-in intents

25 available, including:

AMAZON.FallbackIntent
AMAZON.NoIntent
AMAZON.YesIntent
AMAZON.RepeatIntent

AMAZON.HelpIntent schema

{
	"interactionModel": {
  		"languageModel": {
   			"invocationName": "conference info",
   			"intents": [
    			{
    				"name": "AMAZON.HelpIntent",
    				"samples": []
    			}
   			]
		 }
	}
}

AboutConferenceIntent schema

{
	"interactionModel": {
		"languageModel": {
			"invocationName": "conference info",
			"intents": [
				{
					"name": "AboutConferenceIntent",
					"samples": [ "about Australian Accessibility Conference" ]
				}
			],
			"types": []
		}
	}
}

Adding a slot, part 1 - intents

{
	"intents": [
		{
			"name":	"AboutConferenceIntent",
			"slots": [
			 	{
					 "name": "conference",
					 "type": "conference"
				}
			],
			"samples": [ "about {conference}" ]
		}
	],
	...

Adding a slot, part 2 - types

...
	"types": [
		{
			"name": "conference",
			"values": [
				{
					"name": { "value": "Australian Accessibility Conference" }
				},
				{
					"name": { "value": "Sight Tech Global" }
				},
				{
					"name": { "value": "Web Stories" }
				}
			]
		}

Alexa Skills Kit

const Alexa = require("ask-sdk-core");

Request types

LaunchRequest
IntentRequest
SessionEndedRequest

JSON launch request

    "version": "1.0",
	"session": {
		"new": true,
		"sessionId": "amzn1.echo-api.session.b65841a9-f6cf-406d-aa6b-5c89097f4de2",
		"application": {},
		"device": {},
		"request": {
			"type": "LaunchRequest",
			"requestId": "amzn1.echo-api.request.8a52f508-67b6-445b-a0ac-74273ed5ab13",
			"timestamp": "2020-12-08T15:20:25Z",
			"locale": "en-AU",
			"shouldLinkResultBeReturned": false
		}
	}

LaunchRequestIntentHandler

const LaunchRequestHandler = {
    canHandle(handlerInput) {
        return Alexa.getRequestType(handlerInput.requestEnvelope) === "LaunchRequest";
    },
    handle(handlerInput) {
        let welcomePrompt = "Hello Which conference do you want to know about?";
        let welcomeReprompt = "Let me know which conference you're interested in?";
        return handlerInput.responseBuilder
            .speak(welcomePrompt)
            .reprompt(welcomeReprompt)
            .getResponse();
    }
};

JSON launch request response

    "body": {
		"version": "1.0",
		"response": {
			"outputSpeech": {
				"type": "SSML",
				"ssml": "<speak>Hello. Which conference do you want to know about?</speak>"
			},
			"reprompt": {
				"outputSpeech": {
					"type": "SSML",
					"ssml": "<speak>Let me know which conference you're interested in?</speak>"
				}
			},
			"shouldEndSession": false,
			"type": "_DEFAULT_RESPONSE"
		},
		"sessionAttributes": {},
		"userAgent": "ask-node/2.7.0 Node/v10.17.0"
	}

Speech Synthesis Markup Language (SSML)

SSML W3C Recommendation (2004)
SSML 1.1 W3C Recommendation (2010)

Alexa, in her own voice(s)

"Hello, I'm Alexa and this is how I sound in India.
This is how I sound in America, and this is me in Canada.
Here I am in Australia, and this is me in the United Kingdom."

The `<p>` element

let responsePrompt = `<p>The terror, which would not end
					  for another 28 years – if it ever did end – began,
					  so far as I know or can tell, with a boat made from
					  a sheet of newspaper floating down a gutter swollen
					  with rain.</p>`;

The `<voice>` element

let responsePrompt = `<voice name='Matthew'>
					  	<p>The terror, which would not end
						for another 28 years – if it ever did end – began,
						so far as I know or can tell, with a boat made from
					  	a sheet of newspaper floating down a gutter swollen
					  	with rain.</p>
					  </voice>`;

The `<emphasis>` element

let responsePrompt =
	`<voice name='Matthew'>
	  <p>The <emphasis level='moderate'>terror</emphasis>, which would not end
	  for another 28 years – if it ever <emphasis level='moderate'>did</emphasis> end – began,
	  <emphasis level='reduced'>so far as I know or can tell</emphasis>, with a boat made from
	  a sheet of newspaper floating down a gutter <emphasis level='moderate'>swollen</emphasis>
	  with <emphasis level='moderate'>rain</emphasis>.</p>
	</voice>`;

The `<voice>` element again

let responsePrompt = `<voice name='Matthew'>
						<p>The Terror, which would not end
						for another 28 years...</p>
					  </voice>`;

The `<lang>` element

let responsePrompt = `<voice name='Matthew'><lang xml:lang ='en-US'>
						<p>The Terror, which would not end
						for another 28 years...</p>
					  </lang></voice>`;

The `<p>` element again

let responsePrompt = `<p>Hello, my name is Inigo Montoya.
					  You killed my father. Prepare to die!</p>`;

The `<break>` element

let responsePrompt = `<voice name='Enrique'><lang xml:lang='es-ES'>
					  	<p>Hello, <break time='500ms'/> my name is Inigo Montoya.
						 You killed my father.
						 <break strength='x-strong'/>Prepare to die!</p>
					  </lang></voice>`;

The `<s>` element

let responsePrompt = `<s>Piglet sidled up to Pooh from behind.</s>
					  <s>'Pooh!' he whispered.</s>
					  <s>'Yes, Piglet?'</s>
					  <s>'Nothing', said Piglet, taking Pooh's paw.
						 'I just wanted to be sure of you'.</s>`;

The `<prosody>` element

let responsePrompt =
	`<voice name='Brian'>
	  <s>Piglet sidled up to Pooh from behind.</s>
	  <s><prosody pitch='x-high' rate='medium' volume='soft'>'Pooh'</prosody> he whispered.</s>
	  <s>'Yes, Piglet?'</s><s><prosody pitch='x-high' rate='x-fast'>'Nothing',</prosody>
		 said Piglet, taking Pooh's paw.
		 <prosody pitch='x-high' rate='x-fast'>'I just wanted to be sure of you'.</prosody></s>
	 </voice>`;

The `<amazon-effect>` element

let responsePrompt = `<amazon:effect name='whispered'>
						Shhh. Be very, very quiet.
					  </amazon:effect>`;

The `<p>` element again

let responsePrompt = `<p>I'm so excited (and I just can't hide it)!</p>`;

The `<amazon:emotion name="excited">` element

let responsePrompt = `<amazon:emotion name='excited' intensity='high'>
						<p>I'm so excited (and I just can't hide it)!</p>
					  </amazon:emotion>`;

The `<amazon:emotion name="disappointed">` element

let responsePrompt = `<amazon:emotion name='disappointed' intensity='high'>
						<p>I'm so excited (and I just can't hide it)!</p>
					  </amazon:emotion>`;

The `<amazon:domain name="news>"` element

let responsePrompt = `<p>Now it's me, with the news. <amazon:domain name='news'>
					  Today absolutely nothing happened. Everybody went
					  and had a nice cup of tea instead.</amazon:domain></p>`;

The `<say-as interpret-as="expletive">` element

let responsePrompt = `<voice name='Joey'><lang xml:lang='en-US'>
						Frankly my dear, I don't give-a
						<say-as interpret-as='expletive'>damn</say-as>!
					  </lang></voice>`;

The `<say-as interpret-as="interjection">` element UK

let responsePrompt = `<say-as interpret-as='interjection'>
						Ace. As if. Bah humbug. Blimey. Cha ching. Crikey.
						Eek. Good Grief. Hiss. Moo. Nom nom. Oof. Oy. Simples.
						Tick tick tick. Uh huh. Whee.
					  </say-as>`;

The `<say-as interpret-as="interjection">` element Australia

let responsePrompt = `<say-as interpret-as='interjection'>
						aaaarrrr. As if. aussie aussie aussie. aw man. blimey.
						bummer. checkmate. fair dinkum. fair go. g'day.
						good onya. no worries. strewth. you beauty.
					  </say-as>`;

Web Speech API

W3C Draft Community Group Report (2012)
W3C Draft Community Group Report (2019)

Web Speech interfaces

SpeechRecognition interface
SpeechSynthesis interface

The `SpeechSynthesisUtterance` object

var utterance = new SpeechSynthesisUtterance();

The `.text` attribute

utterance.text = "Tequila!";

The `.speak()` method

window.speechSynthesis.speak(utterance);

The `.getVoices()` method

var tts = speechSynthesis.getVoices();

The `.voice` attribute

utterance.voice = "Microsoft Hazel desktop - English (Great Britain)";

SpeechSynthesis demo

CSS Speech module

CSS Level 2 Aural Style Sheets W3C Recommendation (1998)
CSS Level 3 Speech module W3C Note (2018)
CSS Level 3 Speech module Candidate Recommendation (2020)

CSS Speech properties

Describe the characteristics of content when spoken by synthetic speech

The `speak:;` property

.content {
	speak: auto;
}

The `voice-volume:;` property

.content {
	voice-volume: loud;
}

The `voice-rate:;` property

.content {
	voice-rate: x-fast;
}

The `voice-pitch:;` property

.content {
	voice-pitch: x-low;
}

Screen reader demo

<h2 class="headline">News headline: Nothing happened!</h2>
<p class="date">Reported on 1 April</p>
<p>Nothing happened today.
   Everybody went and had a nice cup of tea instead.</p>

Screen reader + CSS Speech demo

.headline {
	speak: auto;
	voice-volume: x-loud;
	voice-rate: fast;
	voice-pitch: high;
}

.date {
	speak: auto;
	voice-volume: soft;
	voice-rate: x-fast;
	voice-pitch: low;
}

You talkin' to me?

IBM Shoebox (1961)

Voice XML (VXML)

Voice browsers

The <prompt> element

The <menu> element

Conversation architecture

Automatic Speech Recognition (ASR)

Apple voice recognition (1993)

Natural Language Processing (NLP)

Machine Learning

Natural Language Generation (NLG)

Text To Speech (TTS)

Voder synthesizer (1939)

Bell Labs IBM704 (1961)

Amazon Echo (2014)

Parts of a skill

Interaction schema

Default intents

Built-in intents

AMAZON.HelpIntent schema

AboutConferenceIntent schema

Adding a slot, part 1 - intents

Adding a slot, part 2 - types

Alexa Skills Kit

Request types

JSON launch request

LaunchRequestIntentHandler

JSON launch request response

Speech Synthesis Markup Language (SSML)

Alexa, in her own voice(s)

The <p> element

The <voice> element

The <emphasis> element

The <voice> element again

The <lang> element

The <p> element again

The <break> element

The <s> element

The <prosody> element

The <amazon-effect> element

The <p> element again

The <amazon:emotion name="excited"> element

The <amazon:emotion name="disappointed"> element

The <amazon:domain name="news>" element

The <say-as interpret-as="expletive"> element

The <say-as interpret-as="interjection"> element UK

The <say-as interpret-as="interjection"> element Australia

Web Speech API

Web Speech interfaces

The SpeechSynthesisUtterance object

The .text attribute

The .speak() method

The .getVoices() method

The .voice attribute

SpeechSynthesis demo

CSS Speech module

CSS Speech properties

The speak:; property

The voice-volume:; property

The voice-rate:; property

The voice-pitch:; property

Screen reader demo

Screen reader + CSS Speech demo

Thank you

The `<prompt>` element

The `<menu>` element

The `<p>` element

The `<voice>` element

The `<emphasis>` element

The `<voice>` element again

The `<lang>` element

The `<p>` element again

The `<break>` element

The `<s>` element

The `<prosody>` element

The `<amazon-effect>` element

The `<p>` element again

The `<amazon:emotion name="excited">` element

The `<amazon:emotion name="disappointed">` element

The `<amazon:domain name="news>"` element

The `<say-as interpret-as="expletive">` element

The `<say-as interpret-as="interjection">` element UK

The `<say-as interpret-as="interjection">` element Australia

The `SpeechSynthesisUtterance` object

The `.text` attribute

The `.speak()` method

The `.getVoices()` method

The `.voice` attribute

The `speak:;` property

The `voice-volume:;` property

The `voice-rate:;` property

The `voice-pitch:;` property