Talking technology

Léonie Watson

Talking technology

Finch Frontend, Edinburgh September 2019

Léonie Watson, TetraLogical

Conversation architecture

Automatic Speech Recognition (ASR)

Recognises human speech and converts it into text

IBM Shoebox

Natural Language Processing (NLP)

Takes text and converts it into structured data

Machine Learning

Takes structured data, processes it, and returns structured data

Natural Language Generation (NLG)

Takes structured data and converts it into text

Text To Speech (TTS)

Takes text and converts it into synthetic speech

Formant synthesis

For millions of years mankind lived just like the animals. Then something happened that unleashed the power of our imagination: we learned to talk.

Concatenative synthesis

For millions of years mankind lived just like the animals. Then something happened that unleashed the power of our imagination: we learned to talk.

Parametric synthesis

Never had much faith in love or miracles, never wanna put my heart on the line.

Voice XML (VXML)

Voice browsers

VXML is parsed in voice browsers to create voice interfaces

VXML documents

The <prompt> element


<vxml version="2.1" lang="en">
  <form>
    <block>
      <prompt bargein="false">Welcome!</prompt>
    </block>
  </form>
</vxml>


The <menu> element


  <vxml version="2.1" lang="en">
    <menu>
      <prompt>Choose from: <enumerate/></prompt>
      <choice next="https://tequila.com/blanco.vxml">Blanco</choice>
      <choice next="https://tequila.com/reposado.vxml">Reposado</choice>
      <noinput>Please say one of <enumerate/></noinput>
    </menu>
  </vxml>

Alexa LaunchRequestHandler

const LaunchRequestHandler = {
...
handle(handlerInput) {
    const speechOutput = "Hello world!";
    
    return handlerInput.responseBuilder
      .speak(speechOutput)
      .reprompt(speechOutput)
      .getResponse();
  },
};

Speech quality

Hey Jude, don't make it bad; take a sad song, and make it better.

Speech Synthesis Markup Language (SSML)

SSML elements

Describe the characteristics of synthetic speech

Alexa default

const speechOutput =
  "The terror, which would not end for another twenty-eight years – if it ever
  did end – began, so far as I know or can tell, with a boat made from a sheet
  of newspaper floating down a gutter swollen with rain.";

The <voice> & <emphasis> elements

const speechOutput = 
  "<voice name='Matthew'><lang xml:lang='en-US'>
    The <emphasis level='moderate'>terror</emphasis>, which would
    not end for another twenty-eight years – if it ever 
    <emphasis level='moderate'>did</emphasis> end – began, 
    <emphasis level='reduced'>so far as I know or can tell</emphasis>, 
    with a boat made from a sheet of newspaper floating down a gutter 
    <emphasis level='moderate'>swollen</emphasis> with 
    <emphasis level='moderate'>rain</emphasis>.
  </lang></voice>";

The <p> element

const speechOutput =
  "<p>Hello, my name is Inigo Montoya. You killed my father. Prepare to die.</p>";

The <lang> & <break> elements

const speechOutput =
  "<voice name='Enrique'><lang xml:lang='es-ES'>
  <p>Hello, <break time='500ms'/> my name is Inigo Montoya.
  You killed my father. <break strength='x-strong'/>Prepare to die.</p>
  </lang></voice>";

The <s> element

const speechOutput = 
  "Piglet sidled up to Pooh from behind.
  'Pooh!' he whispered.
  'Yes, Piglet?'
  'Nothing', said Piglet, taking Pooh's paw.
  I just wanted to be sure of you.";

The <prosody> element

const speechOutput = 
  "<voice name='Brian'>
  Piglet sidled up to Pooh from behind.
  <prosody pitch='x-high' rate='medium' volume='soft'>'Pooh!'</prosody> he whispered.
  'Yes, Piglet?'
  <prosody pitch='x-high' rate='x-fast'>'Nothing',</prosody> said Piglet, taking Pooh's paw.
  <prosody pitch='x-high' rate='x-fast'>I just wanted to be sure of you.</prosody>
  </voice>";

The <amazon:effect> element

const speechOutput =
  "<amazon:effect name='whispered'>Shush! Be very, very quiet.</amazon:effect>";

The <say-as> element

const speechOutput = 
  "<voice name='Joey'><lang xml:lang='en-US'>Frankly my dear, I don't give-a</lang>
  <say-as interpret-as='expletive'>damn</say-as>!
  </voice>";

The interpret-as="interjection" attribute

const speechOutput = "<say-as interpret-as='interjection'>Cha ching!</say-as>";

SSML in browsers

SSML elements are not rendered in the DOM

Web Speech API

Web Speech interfaces

The SpeechSynthesisUtterance object

var utterance = new SpeechSynthesisUtterance();

The .text attribute

utterance.text = "Tequila!";

The .speak() method

window.speechSynthesis.speak(utterance);

The .getVoices() method

var tts = speechSynthesis.getVoices();

The .voice attribute

utterance.voice = "Microsoft Hazel desktop - English (Great Britain)";

SpeechSynthesis demo

SpeechSynthesis hints demo

CSS Speech module

CSS Speech properties

Describe the characteristics of content when spoken by synthetic speech

The speak:; property

.content {
  speak: auto;
}

The voice-volume:; property

.content {
  voice-volume: loud;
}

The voice-rate:; property

.content {
  voice-rate: x-fast;
}

The voice-pitch:; property

.content {
  voice-pitch: x-low;
}

Screen reader demo

News headline: Nothing happened!

Reported on 1 April

Nothing happened today. Everybody went and had a cup of tea instead.

Screen reader + CSS Speech demo

.headeline {
  speak: auto;
  voice-volume: x-loud;
  voice-rate: fast;
  voice-pitch: high;
}

.date {
  speak: auto;
  voice-volume: soft;
  voice-rate: x-fast;
  voice-pitch: low;
}

Thank you!

Léonie Watson, TetraLogical