Sunday, August 30, 2015

Summary

This post is a summary of my work for GSoC 2015, which includes the following subtasks:

  1. Conceptor Python Module 
  2. Speaker Recognition
  3. Gender Identification
  4. Emotion Detection
  5. Tone Characterisation
  6. Speaker Recognition Program for RedHen Pipeline

All source code can be found in my github repository


Conceptor Python Module 

Theory

A long detailed documentation of the conceptor theory can be found in this technique report by Prof. Herbert J├Ąger. Basic computations are based on Section 4 and recognition functions are based on Section 3.12.

Implementation

You can find the module in the folder called conceptor. Detailed documentation of each file can be found from my previous posts:  basic module and recognition.

Test

This module is tested in the following ipython notebooks: basic computationsclassification. To run the classification notebook, please download the training and testing data used: ae.train ae.test.

Speaker Recognition

Theory

Gaussian Mixture Models, some classic papers can be found here:
http://web.cs.swarthmore.edu/~turnbull/cs97/f09/paper/reynolds00.pdf
http://www.cs.toronto.edu/~frank/csc401/readings/ReynoldsRose.pdf

Implementation

silence.py
An energy-based voice activity detection and silence remove function.

skgmm.py
A set of GMMs given by SciKit-Learn.

GmmSpeakerRec.py
A speaker recognition interface, which includes the following functions:
enroll(): enroll new training data
train(): train a GMM for each class
recognize(): read an audio signal and output the recognition results
dump(): save a trained model
load(): load an existing model

Test

The usage and performance of this interface is demonstrated in the following Ipython notebooks:
Obama: a speaker recogniser with 7 mins training for Obama and 40 secs training for David Simon

Gender Identification

Theory

Exactly the same with Speaker Recognition, except that change the training voice to a concatenation of many voices of the same gender.

Implementation

See the implementation part of "Speaker Recognition".

Test

The usage and performance of this interface is demonstrated in the following Ipython notebooks:
Gender: a gender identifier with about a 5 mins training signal for each gender



Emotion Detection

Theory



This method proposed by Microsoft Research last year in the Interspeech conference, an approach using a deep neural network(DNN) and extreme learning machine(ELM).





Implementation


For details, please refer to my last post.

energy.py
Takes a speech signal and returns the indices of frames with top 10% energy.

Given two audio folders (training and validation, see "folder structure" for the structures of these folders), extracts the segment-level features from audio files in these folders for DNN training.

Given one (testing) audio folder, extracts the segment-level features from audio files in the folder for DNN feature extraction.

Train ELM with the probability features extracted by DNN.

Annotate the recognition results of the test files into Results.txt

Test

the recognition results on one section of the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database from here.


Tone Characterisation

The same with Emotion detection,  except that the training data for each class should be a collection of utterances with the same tone.


Speaker Recognition Program for RedHen Pipeline


The pipeline version has the following updated features:

  • Shifted from Python3 to Python2
  • Replaced GMM from Sklearn by GMM from PyCASP, so that the training is much faster!
  • Added functions to recognize features directly, so that it is ready for the shared features from the pipeline.
  • Returns the log likelihood of each prediction so that one can make rejections on untrained classes and filter out unreliable prediction results. You can also use it to search for speakers, by looking for predicted speakers with high likelihood.
  • Karan's speaker diarization results are now incorporated.
  • Output file has a format consistent with other RedHen output files.

Implementation

Python Speaker Identification Module written for the RedHen Audio Analysis Pipeline

Pipeline program that takes use of the speaker ID module and speaker diarisation results to output .spk file that has consistent format with other RedHen output files.

Test

here you can find an example output file produced on 2015-08-07_0050_US_FOX-News_US_Presidential_Politics