When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights.ĬlassWeights = classWeights'/mean(classWeights) Ĭonvolution2dLayer(3,numF,Padding= "same") To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. You can also try increasing the number of convolutional filters by increasing numF. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. numF controls the number of filters in the convolutional layers. The network is small, as it has only five convolutional layers with few filters. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer. Global pooling also significantly reduces the number of parameters in the final fully connected layer.
This enforces (approximate) time-translation invariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Add a final max pooling layer that pools the input feature map globally over time. Use convolutional and batch normalization layers, and downsample the feature maps "spatially" (that is, in time and frequency) using max pooling layers. Count the number of examples belonging to each category.Ĭreate a simple network architecture as an array of layers. Use subset (Audio Toolbox) to create a datastore that contains only the commands and the subset of unknown words. To reduce the class imbalance between the known and unknown words and speed up processing, only include a fraction of the unknown words in the training set. The network uses this group to learn the difference between commands and all other words. Labeling words that are not commands as unknown creates a group of words that approximates the distribution of all words other than the commands. Label all words that are not commands as unknown. Specify the words that you want your model to recognize as commands. 'C:\Users\bhemmat\AppData\Local\Temp\google_speech\train' \AppData\Local\Temp\google_speech\train\bed\004ae714_nohash_1.wav' \AppData\Local\Temp\google_speech\train\bed\004ae714_nohash_0.wav'
\AppData\Local\Temp\google_speech\train\bed\00176480_nohash_0.wav' If YMode = "background" || count < countThreshold || maxProb < probThresholdĬurrentTime = currentTime + adr.SamplesPerFrame/fs ĬolorLimits =, "all")]),max(, "all")])] MaxProb = max(probBuffer(labels = YMode,:)) % 3) The maximum probability of the predicted label is at least probThreshold. % 2) At last countThreshold of the latest frame labels agree. % Declare a detection and display it in the figure if the following hold: % 1) The most common label is not background. % Now do the actual command detection by performing a thresholding operation. WavePlotter(y(end-adr.SamplesPerFrame+1:end))
% Plot the current waveform and spectrogram. = classify(trainedNet,spec,ExecutionEnvironment= "cpu") % Classify the current spectrogram, save the label to the label buffer, % and save the predicted probabilities to the probability buffer. Spec = helperExtractAuditoryFeatures(y,fs) Y = read(audioBuffer,fs,fs-adr.SamplesPerFrame) % Extract audio samples from the audio device and add the samples to % the buffer. While toc