Electronic technology has liberated musical time and changed musical aesthetics. In the past, musical time was considered as a linear medium that was subdivided according to ratios and intervals of a more-or-less steady meter. However, the possibilities of envelope control and the creation of liquid or cloud-like sound morphologies suggests a view of rhythm not as a fixed set of intervals on a time grid, but rather as a continuously flowing, undulating, and malleable temporal substrate upon which events can be scattered, sprinkled, sprayed, or stirred at will. In this view, composition is not a matter of filling or dividing time, but rather of generating time.
— Curtis Roads, 2014
When we listen to or perform music, there is one fundamental organising principle which must be obeyed: time. Time in music is often thought of in terms of two related concepts: the ‘pulse’ and the ‘metre’ of the music. The pulse is what we latch on to when we listen to music; it is the periodic rhythm within the music that we can tap our feet to. In fact, the pulse is only one level in a hierarchical structure of time periods which is collectively known as the metre. Lower layers divide the pulse into smaller periods and higher levels extend the pulse into bars, phases and even higher order forms.
This gives the impression that rhythm is all about dividing or combining periods together, perfectly filling time with rhythmic events. However, in performance this is rarely the case. Humans are not perfect time-keepers and will always stray from where the event ‘should’ be. These deviations are even expected when we listen to a performance of a piece. If a performance is too well-timed it is often viewed as being robotic, lacking in expression and dynamics.
What Roads is alluding to in the above quote is that it is the perception of these ill-timed rhythmic events that provides a subjective experience of time to the listener. Roads considers only what he knows best, computer music, where one has direct control over the timing of these events. It is quite possible though to extend this view on to every genre of music. As the performer expressively varies the temporal dynamics, waves of metrical dissonance and consonance are formed, affecting our perception of musical time and our expectation of rhythmical events.
Our research concerns this interplay of metric perception, expectational prediction, and rhythmic production with respect to expressive variations on musical timing.
We take a cognitive approach, utilising a neurologically inspired model of rhythm perception known as a Gradient Frequency Neural Network (GFNN). In a GFNN a network of oscillators are distributed across a frequency spectrum. Internal connections between oscillators in the network can be learned via Hebbian learning. When stimulated by a signal, the GFNN resonates nonlinearly, producing larger amplitude responses at related frequencies along the spectrum. When the frequencies in a GFNN are distributed within a rhythmic range, resonances can occur at integer ratios to the pulse. These resonances can be interpreted as the perception of a hierarchical metrical structure.
GFNNs have shown promise even when dealing with more complex input, such as syncopated rhythms and polyrhythms. The oscillators' entrainment properties make them good candidates for solving the expressive timing problem and so the GFNN forms the basis of our perception layer.
In our system the GFNN is coupled with a Long Short-Term Memory Neural Network (LSTM), which is a type of recurrent neural network able to learn long-term dependencies in a time-series. The LSTM takes the role of prediction in our system. It reads the GFNN's resonances to make predictions about the expected rhythmic events in the piece.
Once seeded with some initial values, the GFNN-LSTM can be used for production. That is, the generation of new expressive timing structures based on its own output and/or other music agents' output.
We have promising results which seem to be in line with state-of-the-art beat tracking systems. Here we present some visual and audio examples of the system's output.
A total of 12 different network topologies were trained, modulating system parameters such as oscillator type, GFNN learning, and network connectivity. For further details, including numerical results such as F-measure, we refer you to the accompanying papers, which will be linked here when published.
Please note that the example here are all from test data i.e. data that the networks have not seen during training.
Many thanks to Alvaro Correia, Julien Krywyk, and Jean-Baptiste Rémy for helping to curate the audio examples.
|Critical||Oscillators resonate with input, but the amplitude decays over time in the absence of input|
|Detune||Oscillators change their natural frequency more freely, especially in response to strong stimuli|
|NoLearn||No learning in the GFNN layer|
|Online||Online Hebbian learning in the GFNN layer|
|InitOnline||Online Hebbian learning in the GFNN layer, with initial generic connections|
|Full||Full connectivity between GFNN and LSTM|
|Mean||Mean field connectivity between GFNN and LSTM|
paper, software, music