Talking technology

Finch Frontend, Edinburgh September 2019

Léonie Watson, TetraLogical

Conversation architecture

Speak
Listen
Understand
Respond

Automatic Speech Recognition (ASR)

Recognises human speech and converts it into text

Natural Language Processing (NLP)

Takes text and converts it into structured data

Machine Learning

Takes structured data, processes it, and returns structured data

Natural Language Generation (NLG)

Takes structured data and converts it into text

Text To Speech (TTS)

Takes text and converts it into synthetic speech

Formant synthesis

For millions of years mankind lived just like the animals. Then something happened that unleashed the power of our imagination: we learned to talk.

Concatenative synthesis

For millions of years mankind lived just like the animals. Then something happened that unleashed the power of our imagination: we learned to talk.

Parametric synthesis

Never had much faith in love or miracles, never wanna put my heart on the line.

Voice XML (VXML)

VXML 1.0 W3C Recommendation (2000)
VXML 2.0 W3C Recommendation (2004)
VXML 2.1 W3C Recommendation (2007)
VXML 3.0 W3C Working Draft (2010)

Voice browsers

VXML is parsed in voice browsers to create voice interfaces

VXML documents

Recognise spoken words and phrases
Control dialogue flow
Respond with spoken or audio prompts
Manage telephony control

The `<prompt>` element


<vxml version="2.1" lang="en">
  <form>
    <block>
      <prompt bargein="false">Welcome!</prompt>
    </block>
  </form>
</vxml>

The `<menu>` element


  <vxml version="2.1" lang="en">
    <menu>
      <prompt>Choose from: <enumerate/></prompt>
      <choice next="https://tequila.com/blanco.vxml">Blanco</choice>
      <choice next="https://tequila.com/reposado.vxml">Reposado</choice>
      <noinput>Please say one of <enumerate/></noinput>
    </menu>
  </vxml>

Alexa `LaunchRequestHandler`

const LaunchRequestHandler = {
...
handle(handlerInput) {
    const speechOutput = "Hello world!";
    
    return handlerInput.responseBuilder
      .speak(speechOutput)
      .reprompt(speechOutput)
      .getResponse();
  },
};

Speech quality

Hey Jude, don't make it bad; take a sad song, and make it better.

Speech Synthesis Markup Language (SSML)

SSML W3C Recommendation (2004)
SSML 1.1 W3C Recommendation (2010)

SSML elements

Describe the characteristics of synthetic speech

Alexa default

const speechOutput =
  "The terror, which would not end for another twenty-eight years – if it ever
  did end – began, so far as I know or can tell, with a boat made from a sheet
  of newspaper floating down a gutter swollen with rain.";

The `<voice>` & `<emphasis>` elements

const speechOutput = 
  "<voice name='Matthew'><lang xml:lang='en-US'>
    The <emphasis level='moderate'>terror</emphasis>, which would
    not end for another twenty-eight years – if it ever 
    <emphasis level='moderate'>did</emphasis> end – began, 
    <emphasis level='reduced'>so far as I know or can tell</emphasis>, 
    with a boat made from a sheet of newspaper floating down a gutter 
    <emphasis level='moderate'>swollen</emphasis> with 
    <emphasis level='moderate'>rain</emphasis>.
  </lang></voice>";

The `<p>` element

const speechOutput =
  "<p>Hello, my name is Inigo Montoya. You killed my father. Prepare to die.</p>";

The `<lang>` & `<break>` elements

const speechOutput =
  "<voice name='Enrique'><lang xml:lang='es-ES'>
  <p>Hello, <break time='500ms'/> my name is Inigo Montoya.
  You killed my father. <break strength='x-strong'/>Prepare to die.</p>
  </lang></voice>";

The `<s>` element

const speechOutput = 
  "Piglet sidled up to Pooh from behind.
  'Pooh!' he whispered.
  'Yes, Piglet?'
  'Nothing', said Piglet, taking Pooh's paw.
  I just wanted to be sure of you.";

The `<prosody>` element

const speechOutput = 
  "<voice name='Brian'>
  Piglet sidled up to Pooh from behind.
  <prosody pitch='x-high' rate='medium' volume='soft'>'Pooh!'</prosody> he whispered.
  'Yes, Piglet?'
  <prosody pitch='x-high' rate='x-fast'>'Nothing',</prosody> said Piglet, taking Pooh's paw.
  <prosody pitch='x-high' rate='x-fast'>I just wanted to be sure of you.</prosody>
  </voice>";

The `<amazon:effect>` element

const speechOutput =
  "<amazon:effect name='whispered'>Shush! Be very, very quiet.</amazon:effect>";

The `<say-as>` element

const speechOutput = 
  "<voice name='Joey'><lang xml:lang='en-US'>Frankly my dear, I don't give-a</lang>
  <say-as interpret-as='expletive'>damn</say-as>!
  </voice>";

The `interpret-as="interjection"` attribute

const speechOutput = "<say-as interpret-as='interjection'>Cha ching!</say-as>";

SSML in browsers

SSML elements are not rendered in the DOM

Web Speech API

W3C Draft Community Group Report (2012)
W3C Draft Community Group Report (2019)

Web Speech interfaces

SpeechRecognition interface
SpeechSynthesis interface

The `SpeechSynthesisUtterance` object

var utterance = new SpeechSynthesisUtterance();

The `.text` attribute

utterance.text = "Tequila!";

The `.speak()` method

window.speechSynthesis.speak(utterance);

The `.getVoices()` method

var tts = speechSynthesis.getVoices();

The `.voice` attribute

utterance.voice = "Microsoft Hazel desktop - English (Great Britain)";

SpeechSynthesis demo

SpeechSynthesis hints demo

CSS Speech module

CSS Level 2 Aural Style Sheets W3C Recommendation (1998)
CSS Level 3 Speech module W3C Note (2018)

CSS Speech properties

Describe the characteristics of content when spoken by synthetic speech

The `speak:;` property

.content {
  speak: auto;
}

The `voice-volume:;` property

.content {
  voice-volume: loud;
}

The `voice-rate:;` property

.content {
  voice-rate: x-fast;
}

The `voice-pitch:;` property

.content {
  voice-pitch: x-low;
}

Screen reader demo

News headline: Nothing happened!
Reported on 1 April
Nothing happened today. Everybody went and had a cup of tea instead.

Screen reader + CSS Speech demo

.headeline {
  speak: auto;
  voice-volume: x-loud;
  voice-rate: fast;
  voice-pitch: high;
}

.date {
  speak: auto;
  voice-volume: soft;
  voice-rate: x-fast;
  voice-pitch: low;
}