Sunday, August 30, 2015

Summary

This post is a summary of my work for GSoC 2015, which includes the following subtasks:

  1. Conceptor Python Module 
  2. Speaker Recognition
  3. Gender Identification
  4. Emotion Detection
  5. Tone Characterisation
  6. Speaker Recognition Program for RedHen Pipeline

All source code can be found in my github repository


Conceptor Python Module 

Theory

A long detailed documentation of the conceptor theory can be found in this technique report by Prof. Herbert Jäger. Basic computations are based on Section 4 and recognition functions are based on Section 3.12.

Implementation

You can find the module in the folder called conceptor. Detailed documentation of each file can be found from my previous posts:  basic module and recognition.

Test

This module is tested in the following ipython notebooks: basic computationsclassification. To run the classification notebook, please download the training and testing data used: ae.train ae.test.

Speaker Recognition

Theory

Gaussian Mixture Models, some classic papers can be found here:
http://web.cs.swarthmore.edu/~turnbull/cs97/f09/paper/reynolds00.pdf
http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf

Implementation

silence.py
An energy-based voice activity detection and silence remove function.

skgmm.py
A set of GMMs given by SciKit-Learn.

GmmSpeakerRec.py
A speaker recognition interface, which includes the following functions:
enroll(): enroll new training data
train(): train a GMM for each class
recognize(): read an audio signal and output the recognition results
dump(): save a trained model
load(): load an existing model

Test

The usage and performance of this interface is demonstrated in the following Ipython notebooks:
Obama: a speaker recogniser with 7 mins training for Obama and 40 secs training for David Simon

Gender Identification

Theory

Exactly the same with Speaker Recognition, except that change the training voice to a concatenation of many voices of the same gender.

Implementation

See the implementation part of "Speaker Recognition".

Test

The usage and performance of this interface is demonstrated in the following Ipython notebooks:
Gender: a gender identifier with about a 5 mins training signal for each gender



Emotion Detection

Theory



This method proposed by Microsoft Research last year in the Interspeech conference, an approach using a deep neural network(DNN) and extreme learning machine(ELM).





Implementation


For details, please refer to my last post.

energy.py
Takes a speech signal and returns the indices of frames with top 10% energy.

Given two audio folders (training and validation, see "folder structure" for the structures of these folders), extracts the segment-level features from audio files in these folders for DNN training.

Given one (testing) audio folder, extracts the segment-level features from audio files in the folder for DNN feature extraction.

Train ELM with the probability features extracted by DNN.

Annotate the recognition results of the test files into Results.txt

Test

the recognition results on one section of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from here.


Tone Characterisation

The same with Emotion detection,  except that the training data for each class should be a collection of utterances with the same tone.


Speaker Recognition Program for RedHen Pipeline


The pipeline version has the following updated features:

  • Shifted from Python3 to Python2
  • Replaced GMM from Sklearn by GMM from PyCASP, so that the training is much faster!
  • Added functions to recognize features directly, so that it is ready for the shared features from the pipeline.
  • Returns the log likelihood of each prediction so that one can make rejections on untrained classes and filter out unreliable prediction results. You can also use it to search for speakers, by looking for predicted speakers with high likelihood.
  • Karan's speaker diarization results are now incorporated.
  • Output file has a format consistent with other RedHen output files.

Implementation

Python Speaker Identification Module written for the RedHen Audio Analysis Pipeline

Pipeline program that takes use of the speaker ID module and speaker diarisation results to output .spk file that has consistent format with other RedHen output files.

Test

here you can find an example output file produced on 2015-08-07_0050_US_FOX-News_US_Presidential_Politics

Sunday, July 19, 2015

Emotion Recognition and Tone Characterization by DNN + ELM

Let's face it, emotions are tricky!

Method

After struggling with emotion recognition for long time, I decided to implement the method proposed by Microsoft Research last year in the Interspeech conference, an approach using a deep neural network(DNN) and extreme learning machine(ELM). In particular, a DNN was trained to extract segment-level (256 ms) features and ELM was trained to make decisions based on the statistics of these features on a utterance level.

The training was done using the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from here.

Experimental results in the paper demonstrated that this approach outperforms the state-of-the-art, including OpenEAR.

Here you can also find a video presentation by the author Kun Han himself.

My implementation

all code can be found here

Dependency:

python speech features library: used to extract MFCC features from audio files, which can be replaced by the common feature extraction component once the Red Hen Lab Audio Analysis pipeline is established. But to run the current code, this library must be included in the working folder.

PDNN: A Python Toolkit for Deep Learning, which is used to extract segment-level features by deep neural network. 

Python files:

energy.py
Takes a speech signal and returns the indices of frames with top 10% energy.

Given two audio folders (training and validation, see "folder structure" for the structures of these folders), extracts the segment-level features from audio files in these folders for DNN training.

Given one (testing) audio folder, extracts the segment-level features from audio files in the folder for DNN feature extraction.

Train ELM with the probability features extracted by DNN.

Annotate the recognition results of the test files into Results.txt

Folder Structure:

wav_train (training folder):
has as many subfolders as there are emotions(or tones) to be recognized, each subfolder corresponds to one emotion (or tone) and has one utterance as one separate .wav audio file inside it.

wav_valid (validation folder):
has as many subfolders as there are emotions(or tones) to be recognized, each subfolder corresponds to one emotion (or tone) and has one utterance as one separate .wav audio file inside.

wav_test (testing folder):
has one utterance as one separate .wav audio inside. These files will be annotated and the results are written in Results.txt

Shell Scripts:

Extract features and train DNN and ELM. You can specify the pointer to PDNN, the training folder, and validation folder in this file.

Extract features of test files and recognize the emotions of utterance files in test folder. You can specify the pointer to PDNN and the testing folder here.

Intermediate Files:

dnn.param
contains the parameters of the trained deep neural network

dnn.cfg
contains the configurations of the trained deep neural network

ELMWeights.pickle.gz
contains weight parameters of the extreme learning machine

LabelNumMap.pickle.gz
contains the mapping between the emotion string labels in training/validation folders and integer labels during DNN training.

Output File:

Results.txt
the recognition results

Usage Example:

To train a new emotion recognizer (after specifying the training and validation data):
./train.sh

To predict emotions on new audio files with trained model: 
./test.sh

Monday, June 22, 2015

Week 2 and Week 4: Gender Identification and Speaker recognition

In this post,  I am going to explain my work for gender identification and speaker recognition:

Toolkits used:

Librosa: A Python package for Music and Audio Analysis

SciKit-Learn: Machine Learing in Python

I use librosa to load audio files and extract features from audio signals. I choose it for now because it is a light-weight open source library with nice Python interface and IPython functionalities, it can also be integrated with SciKit-Learn to form a feature extraction pipeline for machine learning. This is enough for moderately complex tasks such as speaker recognition.

SciKit-Learn is used for training a UBM/GMM on MFCC features.

Data preprocessing:

I trained a UBM with 32 Gaussian components on a dataset of standardised MFCC vectors extracted from speech signals by multiple female and male speakers.

For every standardised MFCC vector, it's probability in each Gaussian component is evaluated and put together as a feature vector for conceptor classifications. The reason for this it to refine the subspace using Gaussian components and the probabilities are in the range of [0, 1] already, there is no need for normalisation. These data are then fed into the generic Conceptor recognition framework.

This method makes decision for every MFCC vector (one every 512 ms), and one example results (of Gender Detection on a short female male conversation audio,  0 indicates female, 1 indicates male) look like this:
, which is noisy and will not be very useful in practice as we usually want to have recognition decisions for longer periods (multiple seconds) and with less noises.  A simple frequency count does not solve this problem, since the noisy decisions will very often overwhelm the right decisions. One way to cope with this is to use mid-term statistics(mean, std, median, min, max) on short-term features, as I did in the demo code for Gender Identification I submitted before. This, although works, is not the best way since many information are lost during the statistics computations.

In the next step, I will try recognitions on spectrogram segments and use convolutional neural networks (CNN) to extract features from these segments and feed them to Conceptors.

To catch up with my planned schedule and provide a working solution for now, I implemented a GMM Speaker Recognition system with state-of-the-art performance. This system consists of the following part:

silence.py
An energy-based voice activity detection and silence remove function.

skgmm.py
A set of GMMs given by SciKit-Learn.

GmmSpeakerRec.py
A speaker recognition interface, which includes the following functions:
enroll(): enroll new training data
train(): train a GMM for each class
recognize(): read an audio signal and output the recognition results
dump(): save a trained model
load(): load an existing model

The usage and performance of this interface is demonstrated in the following two Ipython notebooks:
Gender: a gender identifier with about a 5 mins training signal for each gender
Obama: a speaker recogniser with 7 mins training for Obama and 40 secs training for David Simon



Sunday, June 14, 2015

Week 3: Generic Conceptor Framework for Speaker Recognition

Based on the description in "Section 3.12 Example: Dynamical Pattern Recognition" of the tech report a generic frame work for pattern recognition is implemented and added to the Python module.

Edit: 

All conceptor-based functions related to recognition tasks are now put into conceptor.recognition of the Python module.

A usage example:
   
     import conceptor.recognition as recog

     new_recogniser = recog.Recognizer()

     new_recogniser.train(training_data)

     results = new_recognizer.predict(test_data)

, where training_data is a list of feature_size * sample_size dimension numpy arrays with each array corresponding to a training dataset from one class; test_data is a feature_size * sample_size dimension numpy array to be recognized; results is a sample_size dimension vector with each element an integer as a class index.

This framework repeats the results shown in the tech report:
http://nbviewer.ipython.org/github/littleowen/Conceptor/blob/master/ClassifyTest.ipynb

Monday, June 1, 2015

Week 1: Python Module for Conceptors

A python module for conceptor computation is implemented based on section 4 of the technique report and this github repository .

The module is consisted of the following files:

reservoir:
set up the reservoir network, drive the reservoir with dynamic patterns, train output weights to read the original pattern signals, train internal weight to reconstruct the original reservoir dynamics, compute the correlation matrices from reservoir states, compute conceptor matrices from the correlation matrices.

logic:
apply logic operations on conceptors. in particular, AND, OR, NOT, PHI functions from the original MATLAB implementation.

util:
useful utility functions that will be repeatedly used within the module, for example, randomly initialise the weights in a reservoir network.

An IPython notebook script was also written to test the above-mentioned module, the results match with those in the technique report and can be viewed here:  http://nbviewer.ipython.org/github/littleowen/Conceptor/blob/master/ConceptorTest.ipynb