Bristol Wearable Computing

[ Home | Plans | Meetings | Members | Search | Papers | Links | CyberWear | LocoSoft]

The Speech Interface


Audio.c is a general-purpose library of audio functions used by multiple applications. Some of the applications that use the library require functions, which are only slightly different to the needs of other applications. For this reason some of the functions have flags which indicate how the function should behave. A good example is the RecWord function. At certain times the recogniser software may wish to record a word for recognition only. However at other times the recogniser may wish to record a word while looking out for requests to render notes. The RecWord function may be used in both these instances by passing the relevant arguments.

The Recording functions

In audio.c there are 3 recording functions:

Records a single word spoken by the user. A socket may be passed to it to detect for note requests if desired. This function simply detects when the user starts to speak, records and finishes recording when the word has been spoken. This function does not end point the word. This means that in the final recorded word file, there is a small pause before and after the spoken word. This is because this function is chiefly used for word recognition, where the speech engine end points the file itself

This function is used for recording a while sentence or word from the user. The audio file created is end pointed. This function may be used to record a cyber jacket response.

This function is used to record free audio from the user. The code for this function is very similar to the code found in RecWord. The function continually records non-end pointed words spoken from the user. While this is happening any data received from the jacket microphone is written to a separate file.
On every word spoken or captured vocal fragment, the function users the recognition engine to detect what was spoken. If an “End of Note” or “Cancel-Note” command is detected the recording ends. When this happens the file in which all input was continually wrote to then gets compressed and saved to an appropriate file name.

Rendering notes

One of the ways in which notes may be rendered on the cyber jacket is audio through the ear piece. The audio software must be capable of rendering notes that contain both raw audio data and text data that is to be rendered as text to speech. These requests to render notes come directly from the render manager. The render manger sends the audio software a block of data and a service name. The audio software must look at this service name and treat the data accordingly. By default the audio software handles the service names defined by PCM_SERVICE, TTS_QUERY_SERVICE and AUDIO_SERVICE.
The audio/recogniser software is the ultimate front end to the cyber jacket. It is here that the cyber jacket interfaces with the user for most of the time. It is the hypothesis of Locomedia that the computer must serve the user and not the ultimate way round. Through policy and instruction, the actions of the cyber jacket may be precisely defined not to interfere with the user. It is therefore important that note renderings do not interfere with the user. For this reason, audio notes may only be given to the audio software for rendering if the user is note issuing an instruction to the jacket.
This means that if the user is trying to tell the jacket that it is to subscribe to a particular domain for example, the render manager will not be able to pass on a triggered note to the audio software. The recogniser software initially creates a socket so that the render manager may communicate with it. However, the recogniser/audio software must only allow requests to come down this socket when a note may be legally rendered (user is not communicating). This problem is solved by only registering the audio services when an audio note render may occur.
As audio rendering requests are only legal when the user is not trying to say anything, a good place to poll for incoming requests is in the RecWord( ) function in audio.c. It is here that the program continually awaits for the user to start speaking a word.
If an incoming request is detected, the program works out the service request. If it is valid the program either attempts to render the data sent as TTS, a TTS query or audio.

Rendering text to speech notes
If this service is requested, the data is treated as a textual string which is sent to the text-to-speech function. However the program must be still be able to detect whether a new note request is being sent. All note audio note renderings may be interrupted at the discretion of the render manager. For this reason a copy of the original socket is passed on to the text-to-speech function to poll on. If a new service request comes in the program kills the note render and serves the new one.

Rendering raw audio notes
If this service is requested, the data is assumed to be encoded in base 64. The program therefore decodes the data which is then passed to the audio play function. The audio data from a note is always assumed to be GSM compressed.
Again, a rendering of a raw audio note may be interrupted by a new request.

Rendering queries
If this service is requested, the data is treated as a textual string. This gets rendered to the user as text to speech. However, queries are given priority over notes. The render manager will not be able to interrupt a rendering of a query with a note.
Once the query has been issued to the user, the program then awaits a response from the user. This response is then reported back to the render manager.


Process.c is a library of functions used by multiple applications. The functions are basically used for inter-process communications. Say for example a process wished to send 20 bytes of data to another process. A call to send_data would achieve this using all the formal inter process protocol.

The Recognition Engine

The speech recognition engine has been obtained courtesy of Akos Vetek of Hewlett Packard. The engine is a closed system which does not directly recognise words. The engine accepts a pre-recorded spoken word from which recognition takes place. This closed approach allows us to use the engine for user recognition as well as spoken word recognition.

In order to achieve recognition, the engine must be trained with a set of words for a particular user. Before this can happen a global variance vector must be obtained for the users voice. This vector must be wrote to a file so that the recogniser can retrieve it later on.
Every time the user trains a word, a file with a .hmm extension is created. This file contains all the information about the word required to recogniser it later on. A number is associated with this .hmm file which may later be used as a unique id for this word.
Once all words have been trained, recognition may take place. For a particular user, the global variance and all .hmm files must be loaded into memory. To recogniser a spoken word, the word must be captured to an audio file. Passing this file to the recogniser results in either a not recognised signal or a word id. It is the responsibility of the user of the engine to know which words are associated with which ids.

User recognition

The speech recognition engine has been used to successfully recognise who is talking to the cyber jacket. For user recognition to work, the new wearer of the jacket must say “Hello jacket” into the microphone. Once this has been recorded the program does the following:

The program goes sequentially through all of the registered users global variances, loading them one by one. Every time one is loaded, the group of Hmm files which
belong to the user, and contain the “Hello jacket” file are loaded. An attempt is then made to recognise the spoken utterance. If a match is made, the confidence to what degree the match was made and the user is stored in an array. This process continues for all the registered users.
The program then has a list of all the possible candidates for who the speaker may be. If this list is empty, the jacket requests that the user says “Hello Jacket” again for another attempt. If the list is not empty, the program goes though all the possible candidates looking for the one with the highest confidence rating. The one with the highest rating is chosen to be the recognised wearer.

Supporting a Grammar

It is note efficient for the speech recognition engine to have all the .hmm loaded into the memory at once. Not only is this costly on memory but the engine will have far too many words to deal with at once. It is therefore necessary to identify an ordering of the possible spoken words and construct a grammar.

sgram.gif (2971 bytes)

A grammar may be supported by our recognition system by simply loading in different sets of .hmm files. For our purposes the following sets of words implement our grammar.

Set 1 Random words
There are a few words that cannot really be classed as representing any specific function. These may be used for starting up the jacket, or for telling the jacket to stop recording.


Set 2 Commands
These words instruct the jacket to do something, or report back to the wearer.


Set 3 Locations
These words are all the locations that the user may refer to. This set also contains a few commands which may be issued to the jacket while carrying out user requests on locations. See grammar


"  "
"  "
"  "

Set 4 Domains
These words are all the domains that the user may refer to. This set also contains a few commands which may be issued to the jacket while carrying out user requests on domains. See grammar

"        "
"        "
"        "

Set 5  Note Control Words


These words are used by the user when a note is playing.

unicrest.gif (4191 bytes)

The material displayed is provided 'as is' and is subject to use restrictions.
For problems or questions regarding this web contact Cliff Randell.
Last updated: January 14, 2000.
logoep.gif (1404 bytes)
ęCopyright Hewlett-Packard 1997-2000.