Speech Recognition


Capturing a spoken word

The speech engine employed in the cyberjacket requires the spoken word to be captured to a wav file. In order to do this the following techniques are used:

Normally, audio is recorded using signed data types. (byte for 8bit or unsigned integer for 16 bit). It is the phenomenon of most sound cards for the audio signal to be offset from the normal no signal level, (see diagram). The audio capture program is therefore required to scale all audio signals down to meet the 0 level. When the audio card is initialized, this offset level is obtained by sampling a series of audio readings at background noise level. The average of these readings then gives us the offset normal level. The maximum and minimum values taken by the samples are also recorded. The average of these two gives us a threshold value of which can be used to detect whether a person is speaking.

The spoken word is captured to a wav by starting the recording when the person starts speaking and stops when the word has been spoken. The software achieves this by constantly sampling frames of n bytes from the dsp device in a loop. The frame is effectively the first part of a larger buffer designed to capture the whole word. On every pass of the loop the program firstly scales down all of the samples down by subtracting the offset. The program then counts the number of samples above the threshold value. If this number is greater then a certain percentage of all the readings, the program then assumes the word has started to be spoken (see diagram). The program then detects the exact point in the frame where the samples firstly breached the threshold level. This point in the buffer is saved.

The program then continuously reads in frames of audio data into the large buffer. On every frame the program checks to see whether a certain percentage of the samples are below the threshold level. If a certain percentage is below, the program assumes the word has finished. Again the program scans through this frame to find the last instance where the samples fell bellow the threshold level. This point is also recorded. If no definite ending is detected, the program continues to read samples into the buffer until it fills up. This limits the length of a spoken word to around a second.

The program then copies the contents of the buffer stating from the saved starting point and finishing at the recorded finishing point, into a file. This effectively captures the spoken word. 

The Cyber Jacket Speech interface

Recording free audio

The cyber jacket is required to allow the user to record a note, while still interacting with speech recognition to allow the user to delimit the free audio with the word "stop". This is a problem because only a single device may read/write to to/from the dsp device at one time. This problem is solved by a piece of code which acts as a dsp server. This code accepts multiple requests for data from the dsp device and issues the data correspondingly.