GenAI Music Course

Part 1
Part 2 (Audio)

Part 1

Lecture 1: Course Introduction

When: January 12th 2026, 14:00–15:00 Room 52.329

What: Introduction to the course, its structure, philosophy, and evaluation.

Before taking this lecture, students are expected to have watched the following videos from The Sound of AI’s Generative Music AI Course:

Lecture 2: End-to-End Generative Music Project

When: January 12th 2026, 15:00–16:30 Room 52.329

What:

Steps to run a generative music project in a real-world setting
Types of symbolic music data
Real-world challenges, tips and tricks

Lecture 3: Evaluation

When: January 13th 2026, 10:30–11:30 Room 52.101

What:

Objective evaluation metrics
Subjective evaluation metrics
Expert-based evaluation metrics
Market-driven evaluation metrics
Real-world problems, and possible solutions

Lecture 4: Genetic Algorithms

When: January 13th 2026, 11:30–13:00 Room 52.101

What:

Genetic algorithms for music generation
Real-world experience / challenges implementing this technique
GenJam
Exercises and practical challenges

Before taking this lecture, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:

Genetic Algorithms [video] [slides]
Melody Harmonization with Genetic Algorithms [video] [code]

Lecture 5: Transformers

When: January 26th 2026, 14:00–16:30 Room 52.329

What:

Transformers for music generation.
Flipped classroom: Group interview activity
Real-world experience / challenges implementing this technique

Before taking this lecture, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:

Transformers Explained Easily: Part 1 [video] [slides]
Transformers Explained Easily: Part 2 [video] [slides]
Melody Generation with Transformers [video] [code]

Lecture 6: Transformer Tokenizers

When: January 27th 2026, 10:30–11:30 Room 54.004

What:

Tokenizers for MIDI / symbolic representations
MidiTok

Lecture 7: Advanced Transformer Architectures

When: January 27th 2026, 11:30–13:00 Room 54.004

What:

SOTA transformer systems for symbolic music generation
Discuss papers, and debate systems’ outputs
MuseFormer
MuPT

Before taking this lecture, students are expected to have read the following papers:

Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation [paper][website]
MuPT: Symbolic Music Generative Pre-trained Transformer [blog][paper]

Optional reading:

Music Transformer (the first transformer for symbolic music generation) [paper][blog + demos]
Anticipatory Music Transformer: A Controllable Infilling Model for Music [blog][paper]

Lecture 8: Code Assignments

When: February 16th 2026, 14:00-16:00 Room 52.329

What:

Present 2x code assignments
Evaluate results together + get feedback

Lecture 9: Career Advice

When: February 16th 2026, 16:00–16:30 Room 52.329

What:

How to get a career in generative music
Q&A

Lecture 10: Final Project Presentations

When: February 17th 2026, 10:30–13:00 Room 52.105

What:

Present final project
Get feedback

Extra Tutorial Class: Hugging Face Transformers

This class is taught by Fernando and Andreas.

When: January 29th 2026, 14:00–16:30 Room 52.329

What:

Inference + Fine Tuning with Hugging Face Transformers
Using pre-trained symbolic models

Before taking this lecture, students are expected to have installed the following libraries and coded along the code walkthrough:

Hugging Face Transformers [blog]
Hugging Face MidiGPT2 [blog]
Hugging Face PEFT (Parameter-Efficient Fine-Tuning) [blog]
Hugging Face LoRA [blog]

Part 2 (Audio)

Slides [CMC_0_Intro]

Week 1: Audio modeling Introduction; Sound Model Factory

When: Monday, February 9th 2026, 14:00–17:00 Room 52.329

What:

Introduction to the second part of the course on generative audio.
Discussion about the main ideas, audio representations, and architectures commonly used.
Sound Model Factory approach to creating playable audio models.
Audio representations

Lecture preparation :

Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [Link]
Wyse, L., Kamath, P., & Gupta, C. (2022, April). Sound model factory: An integrated system architecture for generative audio modelling. In International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Cham: Springer International Publishing. [Link]
[OK - just a quick browse of this one]
1. Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.

Pre-lecture Quiz: [Link]

Slides [CMC_1_DataDrivenSoundModeling.pptx] (download to access embedded audios)

Slides [CMC_2a_Representation&SoundModeling.pptx] (download to access embedded audios)

— Week XX : Back to Symbolic —

When: **February 16th, 17th - Symbolic Final Projects

Week 2: Representation & Codecs

When: Monday, February 23rd 2026, 14:00–17:00 Room 52.329

What:

From Audio representations to Codecs

Lecture preparation :

Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36. [Link] - The Descript Audio Codec (DAC) that we will look at more closely next week.
Garcia, H. F., Seetharaman, P., Kumar, R., & Pardo, B. (2023). Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686. [Link] - Uses the DAC in fun and interesting ways, helps to understand and motivate tokenization.

Optional:

Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30. [Link] - 5000 citations - historically important paper, a good image that Kumar et al should really have, and a section specifically on audio.

Pre-lecture Quiz: [Link]

Slides: []CMC_2b_Representation&SouondModeling.pptx]

Week 3: Style, DDSP and Rave

When: Monday, March 2nd 2026, 14:00–17:00 Room 52.329

What:

DDSP, RAVE, BRAVE

Lecture preparation :

Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643. [Link]
Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011. [Link]
(This is only two pages, and relevant to codec space exploration): Tokui, N., & Baker, T. (2025). Latent Granular Resynthesis using Neural Audio Codecs. arXiv preprint arXiv:2507.19202.

Pre-lecture Quiz [Link]

Come prepared to share your experience with Codec Exploration, and your training of rnencodec. (See assignments page)

slides [Link]

Week 4: Transformers for Audio

When: Monday, March 9th 2026, 14:00–17:00 Room 52.329

What:
- Core transformer architecture, considerations for audio
Lecture preparation :
1. Video: Peter Bloem, Lecture 12.1: Transformers (20 minutes) [Link]
2. Video: Peter Bloem, Lecture 12.2: Transformers (20 minutes) [Link]
3. Video: Visualizing transformers and attention (60 minutes [no need to watch the Q&A]) [Link]
4. Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., … & Défossez, A. (2023). Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 47704-47720. [Link] (This is the “MusicGen” paper from Meta)
The videos are a “review” of the fundamentals of Transformers - You’ve looked at Transformers before, I know, but they are here because you may not have all the details clear in your mind and they are excellent (Bloem for clear explanation, and 3 Brown one Blue for visualization).

The Copet paper is a classic. It is actually Text-2-Audio, but uses a token-based autoregressive Transformer network at its core, using language as conditioning. Pretty cool, and a good transition to more “proper” text to audio that we will look at next week.

And not required, but highly recommended for your personal enrichment:
1. Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., … & Zeghidour, N. (2023). Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31, 2523-2533. [Link]
Pre-lecture Quiz [Link]

Come prepared to share your experience with your training of rnencodec (and Encodec exploration if you would like). (See assignments page)

Slides[set1_RAVE] [set2_transformers]

Week 5: Text2Audio & Evaluation for generative models

When: Monday, March 16th 2026, 14:00–17:00 Room 52.329

What:
- Overview of Diffusion and Transformer models for text-to-audio - CLAP
- Objective and subjective approaches to evaluating generative audio
Lecture preparation :
1. Valle, R., Badlani, R., Kong, Z., Lee, S. G., Goel, A., Santos, J. F., … & Catanzaro, B. Fugatto 1: Foundational Generative Audio Transformer Opus 1. In The Thirteenth International Conference on Learning Representations. [Link] (I consider this state-of-the-art, though not an easy read. It is from NVidia, but no code is available.)
2. Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., … & Plumbley, M. D. (2023). Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
Optional, but worth a look for understanding of CLAP:
1. Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023, June). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [Link] (Optional, but worth a look for understanding CLAP)
Pre-lecture quiz [Link]

Slides[Link_transformers, Link_TTA]

A few good way-background papers on Generative Audio

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.
Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., … & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32. [keywords: Vocoder; Phase construction]
Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [keywords: conditional training]
Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643.

[keywords: inductive bias, signal processing units, real time]
Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011. [keywords: conditional training]
Huzaifah, M., & Wyse, L. (2021). Deep generative models for musical audio synthesis. Handbook of artificial intelligence for music: foundations, advanced approaches, and developments for creativity, 639-678. [keywords: “review” paper]
Wyse, L., Kamath, P., & Gupta, C. (2022, April). Sound model factory: An integrated system architecture for generative audio modelling. In International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Cham: Springer International Publishing. [keywords: playability; latent space]
Garcia, H. F., Seetharaman, P., Kumar, R., & Pardo, B. (2023). Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686.

[keywords: transformer, in-painting, masking for training, codecs]
Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., & Pons, J. (2024). Stable audio open. arXiv preprint arXiv:2407.14358. [keywords: Text-2-audio; Open (data, weights, code, latent diffusion]