27/07/15 Perceiving and Predicting Expressive Rhythm with Recurrent Neural Networks

— Andrew J. Lambert, Tillman Weyde and Newton Armstrong

Electronic technology has liberated musical time and changed musical aesthetics. In the past, musical time was considered as a linear medium that was subdivided according to ratios and intervals of a more-or-less steady meter. However, the possibilities of envelope control and the creation of liquid or cloud-like sound morphologies suggests a view of rhythm not as a fixed set of intervals on a time grid, but rather as a continuously flowing, undulating, and malleable temporal substrate upon which events can be scattered, sprinkled, sprayed, or stirred at will. In this view, composition is not a matter of filling or dividing time, but rather of generating time.
— Curtis Roads, 2014

Introduction

When we listen to or perform music, there is one fundamental organising principle which must be obeyed: time. Time in music is often thought of in terms of two related concepts: the ‘pulse’ and the ‘metre’ of the music. The pulse is what we latch on to when we listen to music; it is the periodic rhythm within the music that we can tap our feet to. In fact, the pulse is only one level in a hierarchical structure of time periods which is collectively known as the metre. Lower layers divide the pulse into smaller periods and higher levels extend the pulse into bars, phases and even higher order forms.

A metrical heirarchy

Metrical levels marked with Lerdahl and Jackendoff's ‘dot notation’. The pulse level in this score would be at the crotchet (quarter note) level.

This gives the impression that rhythm is all about dividing or combining periods together, perfectly filling time with rhythmic events. However, in performance this is rarely the case. Humans are not perfect time-keepers and will always stray from where the event ‘should’ be. These deviations are even expected when we listen to a performance of a piece. If a performance is too well-timed it is often viewed as being robotic, lacking in expression and dynamics.

What Roads is alluding to in the above quote is that it is the perception of these ill-timed rhythmic events that provides a subjective experience of time to the listener. Roads considers only what he knows best, computer music, where one has direct control over the timing of these events. It is quite possible though to extend this view on to every genre of music. As the performer expressively varies the temporal dynamics, waves of metrical dissonance and consonance are formed, affecting our perception of musical time and our expectation of rhythmical events.

Our research concerns this interplay of metric perception, expectational prediction, and rhythmic production with respect to expressive variations on musical timing.

Our Approach

We take a cognitive approach, utilising a neurologically inspired model of rhythm perception known as a Gradient Frequency Neural Network (GFNN). In a GFNN a network of oscillators are distributed across a frequency spectrum. Internal connections between oscillators in the network can be learned via Hebbian learning. When stimulated by a signal, the GFNN resonates nonlinearly, producing larger amplitude responses at related frequencies along the spectrum. When the frequencies in a GFNN are distributed within a rhythmic range, resonances can occur at integer ratios to the pulse. These resonances can be interpreted as the perception of a hierarchical metrical structure.

GFNNs have shown promise even when dealing with more complex input, such as syncopated rhythms and polyrhythms. The oscillators' entrainment properties make them good candidates for solving the expressive timing problem and so the GFNN forms the basis of our perception layer.

In our system the GFNN is coupled with a Long Short-Term Memory Neural Network (LSTM), which is a type of recurrent neural network able to learn long-term dependencies in a time-series. The LSTM takes the role of prediction in our system. It reads the GFNN's resonances to make predictions about the expected rhythmic events in the piece.

Once seeded with some initial values, the GFNN-LSTM can be used for production. That is, the generation of new expressive timing structures based on its own output and/or other music agents' output.

A metrical heirarchy

An overview of our GFNN-LSTM system showing (A) audio input, (B) mid-level representation, (C) GFNN, (D) LSTM, and (E) rhythm prediction output. The variable ν can be a mean field function or full connectivity.

Results

We have promising results which seem to be in line with state-of-the-art beat tracking systems. Here we present some visual and audio examples of the system's output.

A total of 12 different network topologies were trained, modulating system parameters such as oscillator type, GFNN learning, and network connectivity. For further details, including numerical results such as F-measure, we refer you to the accompanying papers, which will be linked here when published.

Please note that the example here are all from test data i.e. data that the networks have not seen during training.

Many thanks to Alvaro Correia, Julien Krywyk, and Jean-Baptiste Rémy for helping to curate the audio examples.

Glossary

Term Meaning
Critical Oscillators resonate with input, but the amplitude decays over time in the absence of input
Detune Oscillators change their natural frequency more freely, especially in response to strong stimuli
NoLearn No learning in the GFNN layer
Online Online Hebbian learning in the GFNN layer
InitOnline Online Hebbian learning in the GFNN layer, with initial generic connections
Full Full connectivity between GFNN and LSTM
Mean Mean field connectivity between GFNN and LSTM

Plots

Audio examples

Critical, NoLearn, Full network

Target

Prediction

Critical, NoLearn, Mean network

Target

Prediction

Detune, NoLearn, Full network

Target

Prediction

Detune, NoLearn, Mean network

Target

Prediction

Critical, Online, Full network

Target

Prediction

Critical, Online, Mean network

Target

Prediction

Detune, Online, Full network

Target

Prediction

Detune, Online, Mean network

Target

Prediction

Critical, InitOnline, Full network

Target

Prediction

Critical, InitOnline, Mean network

Target

Prediction

Detune, InitOnline, Full network

Target

Prediction

Detune, InitOnline, Mean network

Target

Prediction

Read Paper »

paper, software, music


© Andrew Elmsley 2017 | andy [at] andyroid.co.uk