Making A Song Recognition Algorithm - A Signal Processing Project

This articles presents a Signal Processing project that was submitted in my first year at Mines Paris - PSL. In this project, I developed a song recognition algorithm from scratch based on Shazam's founder Avery Li-Chun Wang method.

SIGNAL PROCESSINGPYTHONMINES PARIS

Tanguy Favin-Lévêque

4/5/20213 min read

Project Overview

The algorithm is structured around several key steps. The key idea is to work on a sliding window Fourier Transform representation (or spectrogram) and identify local maxima. By grouping those local maxima into clusters, we obtain a set of patterns that can fully characterize a song, like a fingerprint.

1. Signal Encoding

The first step involves encoding a music track to generate its unique signature. I created an Encoding class that processes audio files, visualizes waveforms, and generates spectrograms within specified time frames. The spectrogram is computed using the Fourier transform, capturing the energy distribution across frequencies at different time steps. By analyzing a 10-second sample, Avery Li-Chun Wang observed that less than 10% of the spectrogram coefficients accounted for 90% of the signal’s energy. This insight allowed for significant data reduction while preserving essential information.

2. Peak Detection and Constellations

To identify distinctive features in the audio, I implemented a peak detection algorithm using the peak_local_max function from scikit-image. The key parameter here is MIN_DISTANCE, which controls the minimum separation between peaks. Through empirical testing, I determined that setting MIN_DISTANCE to 130 provided an optimal balance between information retention and data sparsity. Remarkably, only 0.01% of the spectrogram’s coefficients were sufficient to characterize the music, leading to the formation of sparse yet informative constellations.

3. Signature Formation

The next phase involved constructing the song’s signature from these constellations. Two global parameters, DELTA_T and DELTA_F, were introduced to define the temporal and frequency proximity required for pairing peaks. These pairs, or anchor-target combinations, were encoded into hashable dictionaries of the form {'t': t1, 'hash': (t2 - t1, f1, f2)}. This hash is invariant to time translation, making it a robust feature for matching.

Graph of the raw signal of 10 seconds of a song

The spectrogram representation of the song sequence

The spectrogram of a small sequence of a song, where the local maxima have been plotted

4. Database Creation, Matching and Comparison

Using the database.py script, I processed and serialized the signatures of nine test tracks, storing them in a compact, searchable format. This database serves as the reference for subsequent song identification tasks. To identify a song, the algorithm compares the signature of an unknown excerpt against the database entries. It searches for matching hashes, allowing a small tolerance (EPSILON) for timing discrepancies. Matched anchors form a scatter plot of (t1, t2) pairs, where a linear alignment indicates a positive match, typically following the equation , with representing a time offset.

A point cloud that give the time postion of matching constellation in the first and the second song

5. Histogram Analysis and Decision Criterion

The algorithm then analyzes the distribution of time offsets through histograms. A clear peak signifies a strong match. The decision criterion is set such that the highest peak must be at least ten times larger than the second-highest to confirm a match. This threshold balances sensitivity and specificity, accounting for repeated musical elements like choruses. It is similar to identifying points forming a line in the previous figure.

The resulting histogram for two non matching songs

6. Performance Evaluation

To assess the algorithm’s reliability, I conducted extensive tests using varying sample lengths and multiple iterations. The results demonstrated a 100% success rate with rapid processing times (~0.11s per test). Performance improved with longer samples but at the cost of increased computation time.

Technical Details and Tools

You can access the github repository here as well as Avery Li-Chun Wang's article here.

The project utilized numpy, scipy, and scikit-image for numerical computations and image processing. Audio samples were stored in WAV format at a sampling rate of 55 kHz. The source code comprises:

algorithm.py: Defines the Encoding and Matching classes.
database.py: Generates and stores track signatures.
demo.py: Selects random excerpts and identifies corresponding tracks.

Author: Tanguy Favin-Lévêque

The resulting histogram for two matching songs