Giving Robots an Infant’s Sense of Hearing

2 min readNov 14, 2020

Recognizing human emotions through extraction of various features from the data, including audio and visual, has always been a fascinating task. For the first part of my three part mini series on AI & robotics, I am demonstrating an experimental Speech Emotion Recognition (SER) project to explore its potential.

Before we walk through the project, here’s a quick overview of the project:

Project Description: Using a Multi Layer Perceptron (MLP) classifier to recognize emotion from features extracted from audio recordings.

Data Set: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

Includes both actors and actresses
Contains data for neutral, calm, happy, sad, angry, fearful, disgust, and surprised

Furthermore, lets divide this project into 3 parts:

Feature Extraction: extracting features from audio files
MLP Classifier: building, training, and testing the model
Live Recognition: detecting emotions using speech features on the go through a microphone

Feature Extraction

To extract features from audio files, we use the Librosa library. Specifically, we’ll be using the MLCC and MEL functions

MLCCs are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”)
The mel scale, is a perceptual scale of pitches judged by listeners to be equal in distance from one another.

To learn more about these features and the library, visit this link. If you want a thorough understanding of MFCCs, here is a great tutorial for you.

The following function allows us to extract the features using Librosa’s extraction methods.

MLP Classifier

To prepare the data for our classifier, we would have to read in each file from the dataset, extract features and add to X, and add the emotion from the filename to Y. We can use this code sample:

Finally, if running this code for the first time, or the model needs to be retrained, this code sample should be used:

It has a constant learning rate and uses a logistic function for activation.

Live Recognition

To detect emotions on the go, we can use a microphone and this function:

The function helps us record audio for a certain duration.

Finally, we can use the built model to predict emotion from the features of the audio recording.

Conclusion

Similar to an infant’s sense of hearing in the way that they can change moods based on your voice’s features, this model can be used by robots to know our emotions when we’re talking to them.

This code was tested on a Raspberry Pi 4 powered Humanoid — Shelbot (one of my ongoing projects)

Full code for this project can be found here.

The next part of this mini series will be focusing on developing another one of the five senses of an infant

About Me

Github: https://github.com/LakshBhambhani

LinkedIn: https://www.linkedin.com/in/lakshbhambhani/

Laksh Bhambhani is a Student Ambassador in the Inspirit AI Student Ambassadors Program. Inspirit AI is a pre-collegiate enrichment program that exposes curious high school students globally to AI through live online classes. Learn more at https://www.inspiritai.com/.