You talkin' to me?

Léonie Watson, TetraLogical

You talkin' to me?

Frontend North, Sheffield February 2020

Léonie Watson, TetraLogical

IBM Shoebox (1961)

Voice XML (VXML)

Voice browsers

VXML is parsed in voice browsers to create voice interfaces

The <prompt> element

<vxml version="2.1" lang="en">
 <form>
  <block>
   <prompt bargein="false">Hello and welcome to hell!</prompt>
  </block>
 </form>
</vxml>

The <menu> element

<vxml version="2.1" lang="en">
 <menu>
  <prompt>Hello and welcome to hell!<enumerate/></prompt>
  <choice next="https://hell.com/1.vxml">Press 1 to make the wrong choice.</choice>
  <choice next="https://hell.com/2.vxml">Press 2 to listen to elevator music.</choice>
  <choice next="https://hell.com/3.vxml">Press 3 if you've lost the will.</choice>
  <noinput>Press 4 to hear these options again.<enumerate/></noinput>
 </menu>
</vxml>

Conversation architecture

Automatic Speech Recognition (ASR)

Recognises human speech and converts it into text

Apple voice recognition (1993)

Natural Language Processing (NLP)

Takes text and converts it into structured data

Machine Learning

Takes structured data, processes it, and returns structured data

Natural Language Generation (NLG)

Takes structured data and converts it into text

Text To Speech (TTS)

Takes text and converts it into synthetic speech

Voder synthesizer (1939)

Bell Labs IBM704 (1961)

Amazon Echo (2014)

Parts of a skill

Interaction schema

{
	"interactionModel": {
  		"languageModel": {
   			"invocationName": "conference info",
   			"intents": [],
   			"types": []
  		}
 	}
}

Default intents

Built-in intents

25 available, including:

AMAZON.HelpIntent schema

{
	"interactionModel": {
  		"languageModel": {
   			"invocationName": "conference info",
   			"intents": [
    			{
    				"name": "AMAZON.HelpIntent",
    				"samples": []
    			}
   			]
		 }
	}
}

AboutConferenceIntent schema

{
	"interactionModel": {
		"languageModel": {
			"invocationName": "conference info",
			"intents": [
				{
					"name": "AboutConferenceIntent",
					"samples": [ "about Front End North" ]
				}
			],
			"types": []
		}
	}
}

Adding a slot, part 1 - intents

{
	"intents": [
		{
			"name":	"AboutConferenceIntent",
			"slots": [
			 	{
					 "name": "conference",
					 "type": "conference"
				}
			],
			"samples": [ "about {conference}" ]
		}
	],
	...

Adding a slot, part 2 - types

...
	"types": [
		{
			"name": "conference",
			"values": [
				{
					"name": { "value": "Joy of Coding" }
				},
				{
					"name": { "value": "Beyond Tellerand" }
				},
				{
					"name": { "value": "Front End North" }
				}
			]
		}

Alexa Skills Kit

const Alexa = require("ask-sdk-core");

Request types

JSON launch request

    "version": "1.0",
	"session": {
		"new": true,
		"sessionId": "amzn1.echo-api.session.b65841a9-f6cf-406d-aa6b-5c89097f4de2",
		"application": {},
		"device": {},
		"request": {
			"type": "LaunchRequest",
			"requestId": "amzn1.echo-api.request.8a52f508-67b6-445b-a0ac-74273ed5ab13",
			"timestamp": "2020-02-05T15:50:25Z",
			"locale": "en-GB",
			"shouldLinkResultBeReturned": false
		}
	}

LaunchRequestIntentHandler

const LaunchRequestHandler = {
    canHandle(handlerInput) {
        return Alexa.getRequestType(handlerInput.requestEnvelope) === "LaunchRequest";
    },
    handle(handlerInput) {
        let welcomePrompt = "Hello Which conference do you want to know about?";
        let welcomeReprompt = "Let me know which conference you're interested in?";
        return handlerInput.responseBuilder
            .speak(welcomePrompt)
            .reprompt(welcomeReprompt)
            .getResponse();
    }
};

JSON launch request response

    "body": {
		"version": "1.0",
		"response": {
			"outputSpeech": {
				"type": "SSML",
				"ssml": "<speak>Hello. Which conference do you want to know about?</speak>"
			},
			"reprompt": {
				"outputSpeech": {
					"type": "SSML",
					"ssml": "<speak>Let me know which conference you're interested in?</speak>"
				}
			},
			"shouldEndSession": false,
			"type": "_DEFAULT_RESPONSE"
		},
		"sessionAttributes": {},
		"userAgent": "ask-node/2.7.0 Node/v10.17.0"
	}

Speech Synthesis Markup Language (SSML)

Alexa, in her own voice(s)

"Hello, I'm Alexa and this is how I sound in India.
This is how I sound in America, and this is me in Canada.
Here I am in Australia, and this is me in the United Kingdom."

The <p> element

let responsePrompt = `<p>The terror, which would not end
					  for another 28 years – if it ever did end – began,
					  so far as I know or can tell, with a boat made from
					  a sheet of newspaper floating down a gutter swollen
					  with rain.</p>`;

The <voice> element

let responsePrompt = `<voice name='Matthew'>
					  	<p>The terror, which would not end
						for another 28 years – if it ever did end – began,
						so far as I know or can tell, with a boat made from
					  	a sheet of newspaper floating down a gutter swollen
					  	with rain.</p>
					  </voice>`;

The <emphasis> element

let responsePrompt =
	`<voice name='Matthew'>
	  <p>The <emphasis level='moderate'>terror</emphasis>, which would not end
	  for another 28 years – if it ever <emphasis level='moderate'>did</emphasis> end – began,
	  <emphasis level='reduced'>so far as I know or can tell</emphasis>, with a boat made from
	  a sheet of newspaper floating down a gutter <emphasis level='moderate'>swollen</emphasis>
	  with <emphasis level='moderate'>rain</emphasis>.</p>
	</voice>`;

The <voice> element again

let responsePrompt = `<voice name='Matthew'>
						<p>The Terror, which would not end
						for another 28 years...</p>
					  </voice>`;

The <lang> element

let responsePrompt = `<voice name='Matthew'><lang xml:lang ='en-US'>
						<p>The Terror, which would not end
						for another 28 years...</p>
					  </lang></voice>`;

The <p> element again

let responsePrompt = `<p>Hello, my name is Inigo Montoya.
					  You killed my father. Prepare to die!</p>`;

The <break> element

let responsePrompt = `<voice name='Enrique'><lang xml:lang='es-ES'>
					  	<p>Hello, <break time='500ms'/> my name is Inigo Montoya.
						 You killed my father.
						 <break strength='x-strong'/>Prepare to die!</p>
					  </lang></voice>`;

The <s> element

let responsePrompt = `<s>Piglet sidled up to Pooh from behind.</s>
					  <s>'Pooh!' he whispered.</s>
					  <s>'Yes, Piglet?'</s>
					  <s>'Nothing', said Piglet, taking Pooh's paw.
						 'I just wanted to be sure of you'.</s>`;

The <prosody> element

let responsePrompt =
	`<voice name='Brian'>
	  <s>Piglet sidled up to Pooh from behind.</s>
	  <s><prosody pitch='x-high' rate='medium' volume='soft'>'Pooh'</prosody> he whispered.</s>
	  <s>'Yes, Piglet?'</s><s><prosody pitch='x-high' rate='x-fast'>'Nothing',</prosody>
		 said Piglet, taking Pooh's paw.
		 <prosody pitch='x-high' rate='x-fast'>'I just wanted to be sure of you'.</prosody></s>
	 </voice>`;
	

The <amazon-effect> element

let responsePrompt = `<amazon:effect name='whispered'>
						Shhh. Be very, very quiet.
					  </amazon:effect>`;
					

The <p> element again

let responsePrompt = `<p>I'm so excited (and I just can't hide it)!</p>`;

The <amazon:emotion name="excited"> element

let responsePrompt = `<amazon:emotion name='excited' intensity='high'>
						<p>I'm so excited (and I just can't hide it)!</p>
					  </amazon:emotion>`;

The <amazon:emotion name="disappointed"> element

let responsePrompt = `<amazon:emotion name='disappointed' intensity='high'>
						<p>I'm so excited (and I just can't hide it)!</p>
					  </amazon:emotion>`;

The <amazon:domain name="news>" element

let responsePrompt = `<p>Now it's me, with the news. <amazon:domain name='news'>
					  Today absolutely nothing happened. Everybody went
					  and had a nice cup of tea instead.</amazon:domain></p>`;

The <say-as interpret-as="expletive"> element

let responsePrompt = `<voice name='Joey'><lang xml:lang='en-US'>
						Frankly my dear, I don't give-a
						<say-as interpret-as='expletive'>damn</say-as>!
					  </lang></voice>`;

The <say-as interpret-as="interjection"> element UK

let responsePrompt = `<say-as interpret-as='interjection'>
						Ace. As if. Bah humbug. Blimey. Cha ching. Crikey.
						Eek. Good Grief. Hiss. Moo. Nom nom. Oof. Oy. Simples.
						Tick tick tick. Uh huh. Whee.
					  </say-as>`;

The <say-as interpret-as="interjection"> element Australia

let responsePrompt = `<say-as interpret-as='interjection'>
						aaaarrrr. As if. aussie aussie aussie. aw man. blimey.
						bummer. checkmate. fair dinkum. fair go. g'day.
						good onya. no worries. strewth. you beauty.
					  </say-as>`;

Web Speech API

Web Speech interfaces

The SpeechSynthesisUtterance object

var utterance = new SpeechSynthesisUtterance();

The .text attribute

utterance.text = "Tequila!";

The .speak() method

window.speechSynthesis.speak(utterance);

The .getVoices() method

var tts = speechSynthesis.getVoices();

The .voice attribute

utterance.voice = "Microsoft Hazel desktop - English (Great Britain)";

SpeechSynthesis demo

CSS Speech module

CSS Speech properties

Describe the characteristics of content when spoken by synthetic speech

The speak:; property

.content {
	speak: auto;
}

The voice-volume:; property

.content {
	voice-volume: loud;
}

The voice-rate:; property

.content {
	voice-rate: x-fast;
}

The voice-pitch:; property

.content {
	voice-pitch: x-low;
}

Screen reader demo

<h2 class="headline">News headline: Nothing happened!</h2>
<p class="date">Reported on 1 April</p>
<p>Nothing happened today.
   Everybody went and had a nice cup of tea instead.</p>

Screen reader + CSS Speech demo

.headline {
	speak: auto;
	voice-volume: x-loud;
	voice-rate: fast;
	voice-pitch: high;
}

.date {
	speak: auto;
	voice-volume: soft;
	voice-rate: x-fast;
	voice-pitch: low;
}

Thank you