News

How to add the Web Speech API to your site

Add a new level of interactivity to your site by letting the user control it with just their voice via the Web Speech API

webspeech800

DOWNLOAD TUTORIAL FILES

There are three main ways to interact with the web: keyboard, mouse, and touch – but there is another. Communicating with your computer has always been a shaky affair but the good people at the W3C and Google are pushing for a unified speech API for web developers. The possibilities of this are vast as it opens up new ways of interacting with sites and new interaction patterns if done correctly.
It’s currently supported by Chrome, and Firefox has shown hints that it’s working on it as well. While it’s worth considering that speech input for sensitive information could be problematic, you shouln’t let that stop you when finding new ways to work with your site.
We’re going to build a voice interface for a music player that will integrate with the SoundCloud API. The user will be able to speak commands such as play, skip, pause and stop to control the site. Let’s get going!

Detecting support

First we’re going to detect support for the speech recognition API. Currently only Chrome supports it but rather than browser sniff or only use webkitSpeechRecognition, we’ll check for what future speech recognition could look like to try and be future-friendly. Browsers will probably have their own subtle differences so keep an eye out for them!

001 var speech = function () {
 002     if (typeof speechRecognition !==         ‘undefined’) {
 003        return new speechRecognition();
 004    } else if (typeof msSpeechRecognition !==     ‘undefined’) {
 005        return new msSpeechRecognition();
 006     } else if (typeof mozSpeechRecognition !==     ‘undefined’) {
 007        return new mozSpeechRecognition();
 008     } else if (typeof webkitSpeechRecognition     !== ‘undefined’) {
 009        return new webkitSpeechRecognition();
 010    }
 011    throw new Error(‘No speech recognition API     detected.’);
 012 };

Continuous recognition

Initiating a new speech recognition constructor won’t cause it to start listening. First we can set some values – namely that we’re only after one result, so no continuing to listen once a result is final. If continuous is set to true then it’ll continue listening until it detects that we’ve stopped talking and deliver multiple final results – which is more useful for dictation.

 001 var recognition = speech();
 002 recognition.continuous = false;

Start listening

We want to show the user what we think they said. We can do so by setting interim results to true and our language to English (by default it is the language of the browser). Finally, we can programmatically start listening by calling start();. This will trigger the browser to ask the user if microphone access is allowed.

001 recognition.interimResults = true;
 002 recognition.lang = ‘en-GB’;
 003 recognition.start();
 004

Speech event listeners

It would be good and likely very helpful to give some indication to the user about what our application is actually doing. They’ve been asked if we can have access but they don’t know if what they are saying is working. speechRecognition() has many events that we can listen to to remedy this. The start event is fired when recognition is started.

001 recognition.addEventListener(‘start’,     function     () {
 002    feedback.innerHTML = ‘Talk to me’;
 003    button.style.display = ‘none’;
 004    for (var i = 0, len = mic.length; i < len;     i++) {
 005        mic[i].style.fill = ‘green’;
 006    }          
 007 }, false);
 008

Voice feedback HTML

We’ll create a small section that will be fixed to the bottom-right of the screen that will show the user when they’re being listened to, any event feedback we may have, and what we last detected they said. The specification requires that the browser also shows when it’s listening – in Chrome a pulsating record button is shown over the favicon.

001 <section class=”voice-feedback”>
 002     <img src=”mic.svg” alt=”Symbol of a     microphone”>
 003    <p id=”feedback”></p>
 004    <p>Last command: <span id=”last-command”></    span></p>
 005     <button id=”listen”>Start Listening</    button>
 006 </section>

Speechstart event

There’s also an event for when the browser first detects (what it thinks is) speech that it will then transcribe (‘speechstart’). There is also ‘audiostart’ and ‘audioend’ which subtly differ from ‘start’. ‘Audiostart’ is when it starts listening, ‘start’ is when it starts listening with the intent of transcribing it.

001 recognition.addEventListener(‘speechstart’,    function () {
 002     feedback.innerHTML = ‘Capturing’;
 003 }, false);

Speechend event

The sibling event of ‘speechstart’ is the aptly named ‘speechend’. We’ll update our voice feedback pod by returning the mic to a white colour and update the text to notify the user that they aren’t being listened to. Optionally, you can then reinitialise listening (this will trigger another notification).

001 recognition.addEventListener(‘speechend’,     function (event) {
 002    feedback.innerHTML = ‘I’m not listening’;
 003    button.style.display = ‘block’;
 004    for (var i = 0, len = mic.length; i < len;     i++) {
 005        mic[i].style.fill = ‘#fff’;
 006    }
 007    init();
 008 }, false);

Click to init

As well as trying to automatically relisten we’ll give the user the option to click a button and speak another command with this simple click event listener. The init function is a wrapper for all of the snippets that we’ve written up until now, so it establishes new event listeners and a new speech-recognition constructor.

001 listenButton.addEventListener(‘click’,     function () {
 002    init();
 003 }, false);

Initialise SoundCloud

The SoundCloud SDK requires that you sign up and create a new application – use its credentials to fill in the blanks so that you can stream music on your website. You can optionally sign in as a user to get access to private tracks but for the purposes of this tutorial the basic app authentication is all that is required.

001 SC.initialize({
 002    client_id: ‘Your client ID’,
 003    redirect_uri: ‘Publicly accessible URL’
 004 });

Result event

That’s just about all we need to do with SoundCloud for now, so let’s look at how to actually use the results from the Speech API. The result event is triggered every time it detects a voice and because we set interimResults to true it’ll provide a live preview of what it thinks the user has just said.

001 recognition.addEventListener('result',     function (event) {
 002     for (var i = event.resultIndex, len =     event.results.length; i < len; i++) {
 003        lastCommand.innerHTML = event.results[i]    [0].transcript;
 004        lastCommand.style.color = 'gray';      
 005     }
 006 }, false);

Final result

Now we have updated the lastCommand text and made it grey, if the result is final then we’ll set its colour to white and write a function that will deal with the command. The event can return multiple results and alternative transcripts. Each transcript has a confidence rating between 0 and 1.

001 if (event.results[i].isFinal) {
 002    lastCommand.style.color = ‘white’;
 003    processSpeech(event.results[i][0].    transcript);
 004 }
 

Process speech

The transcript that the Speech API returns is simply a string and can be manipulated as you would any other string. We want to initiate various methods based on what the contents of the command are. indexOf is a way of asking ‘Is this text in this other bit of text?’. It returns the substring’s index if found or -1.

001 var processSpeech = function (command) {
 002     if (query.indexOf(‘play’) > -1) {
 003         soundHandler.retrieveTracks(command.    replace(‘play’, ‘’));
 004     }
 005 };

The sound handler

We referenced something called soundHandler which will deal with all of our SoundCloud related calls – so let’s write that now. The retrieveTracks method will play the first track found by the SoundCloud search API for whatever term the user spoke, we’ll then play it or tell the user nothing was found.

001 var soundHandler = {
 002     Sound: null, Tracks: null,
 003     retrieveTracks: function (query) {
 004         SC.get('/tracks', { q: query },     function     (t) {
 005            if (t.length) {
 006                soundHandler.index = 0;
 007                soundHandler.Tracks = t;
 008                soundHandler.play(t[0]);
 009            } else 
 010                heading.innerHTML = 'Could not find     a matching song';
 011            });}};

Playing a track

The play function starts streaming the SoundCloud track, attaches an onfinish event that will call the next track, and sets the HTML elements to

display relevant information. Track provides metadata on the song (its title, album artwork, etc) and sound is the actual sound that has the play/pause methods.

001 play: function (track) {
 002     SC.stream(track.id, {
 003         autoPlay: true,
 004         onfinish: function () {
 005            soundHandler.next();
 006        }
 007     }, function (sound) {
 008         soundHandler.Sound = sound;
 009     });
 010     heading.innerHTML = track.title;
 011     image.src = track.artwork_url || track.    waveform_url;
 012 },

Stop space return

The next few methods are simple wrapper functions for convenience and consistency. Writing it this way means we can add additional functionality to each method without affecting other parts of the application – such as updating text or the favicon – to reflect the playback state. In this case this refers to the parent object, soundHandler.

001 pause: function () {
 002     this.Sound.pause();
 003 },
 004 resume: function () {
 005     this.Sound.resume();
 006 },
 007 stop: function () {
 008     this.Sound.stop();
 009 },

Next method

The next() method combines a couple of the methods listed above, namely by stopping the current track, increasing the current index value and calling play() with the new track as the argument. Within processSpeech we can have multiple conditions that trigger this, such as skip and next – synonyms that sound very different but mean the same thing.

 001 next: function () {
 002     this.stop();
 003     this.index++;
 004     this.play(this.Tracks[this.index]);
 005 }

Expanding processSpeech

Now that our soundHandler is complete we can expand our processSpeech function to include the range of new methods that it exposes. Sometimes the speech recognition isn’t perfect and it was found that pause was consistently recognised as ‘Paul’ so used this as an alternative. Make sure that there is a sound to pause to avoid errors.

001 if ((query.indexOf(‘pause’) > -1 &&         soundHandler.Sound) || (query.            indexOf(‘paul’) >     -1 && soundHandler.Sound)) {
 002     soundHandler.pause();
 003 } else if (query.indexOf(‘stop’) > -1 &&     soundHandler.Sound) {
 004     soundHandler.stop();
 005 } 

Managing sounds

SoundCloud uses SoundManager 2 (www.schillmania.com/projects/soundmanager2) to stream its tracks so all of the methods available via SoundManager are exposed. When we call retrieveTracks we also replace the word ‘play’ so it isn’t included in the search term to SoundCloud­­ – otherwise each result would have to have ‘play’ included!

Styling the page

Our app is fully functional but not very aesthetically pleasing. We’ll add a few styles to spruce it up. By setting a max-width of 100% on the images we ensure that they’ll be contained within their containers whether it’s a big waveform image or 100×100 artwork. Also #16161d is the colour that the human eye sees in pitch-black darkness.

001 body {
 002    background: #fafafa;
 003    color: #16161d;
 004    font-family: sans-serif;
 005 }
 006 img {
 007     margin: 0 1em 0 0;
 008     max-width: 100%;
 009 }

Feedback pod

We’ll dock the feedback area to the right-hand side and give it a fixed position, then colour it black and give it an old-school border radius. There are much better ways to display a listening status – a radiating microphone symbol is a common one – but the main thing is that the intent is clear.

Speech input element

If you’re not so confident with JavaScript then you can also get voice input into a text box in WebKit browsers with a proprietary attribute. This could be POSTed back like a standard form and dealt with server-side or simply read with ‘document.getElementById(‘voice-input’).value;. Unfortunately, until type=”speech” becomes available to use universally, this only works with Chrome.

Sound off

The Speech API opens up lots of new opportunities for web developers. From traditional site navigation (users could simply say ‘search for houses’) to using your voice to send commands or add a new level of interactivity to a game. The possibility of this becoming more widespread is coming and with further browser and device support it will become a useful tool.

×