Enhance MIDI Generation with Harmonic and Rhythmic Features

blog cover

Music generation at Taiwan AI Labs is based on the generation of note sequences. This approach preserves most details of a piece of music, but to the human ear, music is not only a set of notes, but the patterns that are formed by them. This is exactly what chorder and groover aim to achieve; they are packages designed to extract information on harmony and groove respectively.

From extracting harmonic and rhythmic features, computers are now able to look at a piece of music on a larger scale than plain notes. However, readers should still be aware of the fact that there is no 100% objective right and wrong in the perception of music, so the harmonic and rhythmic features are far from being the sole correct analysis to a piece of music.


The main feature of chorder is chord detection from MIDI files. To successfully accomplish this task, chorder uses a 12-dimensional vectors to represents semitone distribution. For a certain time period, there is a vector v that sums up the duration of each note by their pitch class. For reference, there are different weight vectors w for different chord qualities. For example,

w_{\text{major}} = [1.0, -0.2, -0.1, -0.2, 1.0, -0.5, -0.2, 1.0, -0.2, -0.1, -0.2, 0.0]

Meaning semitones that are 0, 4, and 7 semitones away from the root note of a major chord are the most contributing factors, while the semitone that is 5 semitones away from the root is a reverse indicator. The quality and the pitch class of the root can then be expressed as follows:

\text{argmax}_{R \times Q} \text{  } v_r \cdot w_q

Where r is an integer between 0 to 11, representing the pitch class of the root. vr is v rotated left for r positions to find the root that best fits the chord’s weights wq. For now, Q contains six type of basic chords: major, minor, diminished, augmented, sus2 and sus4. One thing to note is that a segment will be determined as no chord if the dot product of vr and wq is less than the total duration of the segment.

The bass note, different from the root note, is simply the lowest notes found in the segment that has a combined duration of at least 1/8 of the duration of the entire segment. With knowledge of bass note and basic quality, certain rules are applied to correct the quality of the chord or to add seventh notes. For example, a Fmaj chord with D on the bass should be considered a Dmin7 chord instead.

The lengths of segments chorder use is 1 and 2 beats. If the 2-beat segment has a higher alignment score than both 1-beat segments, the chord of the 2-beat segment is applied to the 2 beats. The 2 beats will be assigned their separate chords if that is not the case.


Unlike chords, there are no symbols on grooving that are universally agreed on. That’s why in groover,  the grooving representations are simply classification of rhythmic patterns derived from a given MIDI dataset.

The rhythmic patterns are set at a certain length (for example, a measure), and divided into quantized periods. Each note contributes an intensity value to the pattern, and the intensity increases with lower pitch and higher note velocity. Then, the patterns are clustered, not with Euclidean distance, but a modified version of cosine similarity. The modification is to do a “blurring” before calculating cosine similarity. Let v(i, l) be the left shift of v for i positions and v(i, r) be the right shift of v for i positions, then we can define the blur of n positions as

v_{\text{blurred}} = v + \sum_{i=1}^{n} (1 – \frac{i}{n})(v_{i, l} + v_{i, r})

The magnitude can be altered to make it a weighted average, but for the purpose of applying cosine similarity, this is not necessary. The blurring makes positions that are closer to each other value more, instead of a mere 0 if they are not exactly the same. With this customized similarity, we can now apply clustering algorithms such as k-means, which is the one currently used by groover to label rhythmic patterns.

Below is an audiation of harmonic and rhythmic features of an 8-bar excerpt. The velocity of the chords are representative of the intensity value the detected pattern holds.

The original 8-bar excerpt:

Its audiation of harmonic and rhythmic features:


By:  Joshua Chang and Yi-Hsuan Yang (Yating Music Team, Taiwan AI Labs)