Part 1
Lecture 1: Course Introduction
When: January 12th 2026, 14:00–15:00 Room 52.329
What: Introduction to the course, its structure, philosophy, and evaluation.
Before taking this lecture, students are expected to have watched the following videos from The Sound of AI’s Generative Music AI Course:
- What’s Generative Music?
- History of Generative Music
- Use Cases
- Ethical Implications
- Symbolic Vs Audio Generation
- Generative Techniques
- Limitations and Future Vision
Lecture 2: End-to-End Generative Music Project
When: January 12th 2026, 15:00–16:30 Room 52.329
What:
- Steps to run a generative music project in a real-world setting
- Types of symbolic music data
- Real-world challenges, tips and tricks
Lecture 3: Evaluation
When: January 13th 2026, 10:30–11:30 Room 52.101
What:
- Objective evaluation metrics
- Subjective evaluation metrics
- Expert-based evaluation metrics
- Market-driven evaluation metrics
- Real-world problems, and possible solutions
Lecture 4: Genetic Algorithms
When: January 13th 2026, 11:30–13:00 Room 52.101
What:
- Genetic algorithms for music generation
- Real-world experience / challenges implementing this technique
- GenJam
- Exercises and practical challenges
Before taking this lecture, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:
- Genetic Algorithms [video] [slides]
- Melody Harmonization with Genetic Algorithms [video] [code]
When: January 26th 2026, 14:00–16:30 Room 52.329
What:
- Transformers for music generation.
- Flipped classroom: Group interview activity
- Real-world experience / challenges implementing this technique
Before taking this lecture, students are expected to have watched the following videos and coded along the code walkthrough from The Sound of AI’s Generative Music AI Course:
- Transformers Explained Easily: Part 1 [video] [slides]
- Transformers Explained Easily: Part 2 [video] [slides]
- Melody Generation with Transformers [video] [code]
When: January 27th 2026, 10:30–11:30 Room 54.004
What:
- Tokenizers for MIDI / symbolic representations
- MidiTok
When: January 27th 2026, 11:30–13:00 Room 54.004
What:
- SOTA transformer systems for symbolic music generation
- Discuss papers, and debate systems’ outputs
- MuseFormer
- MuPT
Before taking this lecture, students are expected to have read the following papers:
- Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation [paper][website]
- MuPT: Symbolic Music Generative Pre-trained Transformer [blog][paper]
Optional reading:
- Music Transformer (the first transformer for symbolic music generation) [paper][blog + demos]
- Anticipatory Music Transformer: A Controllable Infilling Model for Music [blog][paper]
Lecture 8: Code Assignments
When: February 16th 2026, 14:00-16:00 Room 52.329
What:
- Present 2x code assignments
- Evaluate results together + get feedback
Lecture 9: Career Advice
When: February 16th 2026, 16:00–16:30 Room 52.329
What:
- How to get a career in generative music
- Q&A
Lecture 10: Final Project Presentations
When: February 17th 2026, 10:30–13:00 Room 52.105
What:
- Present final project
- Get feedback
This class is taught by Fernando and Andreas.
When: January 29th 2026, 14:00–16:30 Room 52.329
What:
- Inference + Fine Tuning with Hugging Face Transformers
- Using pre-trained symbolic models
Before taking this lecture, students are expected to have installed the following libraries and coded along the code walkthrough:
- Hugging Face Transformers [blog]
- Hugging Face MidiGPT2 [blog]
- Hugging Face PEFT (Parameter-Efficient Fine-Tuning) [blog]
- Hugging Face LoRA [blog]
Part 2 (Audio)
Slides [CMC_0_Intro]
Week 1: Audio modeling Introduction; Sound Model Factory
When: Monday, February 9th 2026, 14:00–17:00 Room 52.329
What:
- Introduction to the second part of the course on generative audio.
- Discussion about the main ideas, audio representations, and architectures commonly used.
- Sound Model Factory approach to creating playable audio models.
- Audio representations
Lecture preparation :
- Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [Link]
- Wyse, L., Kamath, P., & Gupta, C. (2022, April). Sound model factory: An integrated system architecture for generative audio modelling. In International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Cham: Springer International Publishing. [Link]
- [OK - just a quick browse of this one]
- Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.
Slides [CMC_1_DataDrivenSoundModeling.pptx] (download to access embedded audios)
Slides [CMC_2a_Representation&SoundModeling.pptx] (download to access embedded audios)
— Week XX : Back to Symbolic —
When: **February 16th, 17th - Symbolic Final Projects
Week 2: Representation & Codecs
When: Monday, February 23rd 2026, 14:00–17:00 Room 52.329
What:
- From Audio representations to Codecs
Lecture preparation :
-
Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., & Kumar, K. (2024). High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems, 36. [Link] - The Descript Audio Codec (DAC) that we will look at more closely next week.
-
Garcia, H. F., Seetharaman, P., Kumar, R., & Pardo, B. (2023). Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686. [Link] - Uses the DAC in fun and interesting ways, helps to understand and motivate tokenization.
Optional:
- Van Den Oord, A., & Vinyals, O. (2017). Neural discrete representation learning. Advances in neural information processing systems, 30. [Link] - 5000 citations - historically important paper, a good image that Kumar et al should really have, and a section specifically on audio.
Pre-lecture Quiz: [Link]
Slides: []CMC_2b_Representation&SouondModeling.pptx]
Week 3: Style, DDSP and Rave
When: Monday, March 2nd 2026, 14:00–17:00 Room 52.329
What:
Lecture preparation :
-
Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643. [Link]
- Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011. [Link]
- (This is only two pages, and relevant to codec space exploration): Tokui, N., & Baker, T. (2025). Latent Granular Resynthesis using Neural Audio Codecs. arXiv preprint arXiv:2507.19202.
Pre-lecture Quiz [Link]
Come prepared to share your experience with Codec Exploration, and your training of rnencodec. (See assignments page)
slides [Link]
When: Monday, March 9th 2026, 14:00–17:00 Room 52.329
-
What:
- Core transformer architecture, considerations for audio
Lecture preparation :
-
Video: Peter Bloem, Lecture 12.1: Transformers (20 minutes) [Link]
-
Video: Peter Bloem, Lecture 12.2: Transformers (20 minutes) [Link]
-
Video: Visualizing transformers and attention (60 minutes [no need to watch the Q&A]) [Link]
-
Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Synnaeve, G., … & Défossez, A. (2023). Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 47704-47720. [Link] (This is the “MusicGen” paper from Meta)
The videos are a “review” of the fundamentals of Transformers - You’ve looked at Transformers before, I know, but they are here because you may not have all the details clear in your mind and they are excellent (Bloem for clear explanation, and 3 Brown one Blue for visualization).
The Copet paper is a classic. It is actually Text-2-Audio, but uses a token-based autoregressive Transformer network at its core, using language as conditioning. Pretty cool, and a good transition to more “proper” text to audio that we will look at next week.
And not required, but highly recommended for your personal enrichment:
- Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., … & Zeghidour, N. (2023). Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31, 2523-2533. [Link]
Pre-lecture Quiz [Link]
Come prepared to share your experience with your training of rnencodec (and Encodec exploration if you would like). (See assignments page)
Slides[set1_RAVE] [set2_transformers]
Week 5: Text2Audio & Evaluation for generative models
When: Monday, March 16th 2026, 14:00–17:00 Room 52.329
-
What:
- Overview of Diffusion and Transformer models for text-to-audio - CLAP
- Objective and subjective approaches to evaluating generative audio
Lecture preparation :
- Valle, R., Badlani, R., Kong, Z., Lee, S. G., Goel, A., Santos, J. F., … & Catanzaro, B. Fugatto 1: Foundational Generative Audio Transformer Opus 1. In The Thirteenth International Conference on Learning Representations. [Link]
(I consider this state-of-the-art, though not an easy read. It is from NVidia, but no code is available.)
- Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., … & Plumbley, M. D. (2023). Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
Optional, but worth a look for understanding of CLAP:
- Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., & Dubnov, S. (2023, June). Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE. [Link] (Optional, but worth a look for understanding CLAP)
Pre-lecture quiz [Link]
Slides[Link_transformers, Link_TTA]
A few good way-background papers on Generative Audio
-
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., … & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.
-
Kumar, K., Kumar, R., De Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., … & Courville, A. C. (2019). Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32. [keywords: Vocoder; Phase construction]
-
Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710. [keywords: conditional training]
-
Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing. arXiv preprint arXiv:2001.04643.
[keywords: inductive bias, signal processing units, real time]
-
Caillon, A., & Esling, P. (2021). RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv preprint arXiv:2111.05011. [keywords: conditional training]
-
Huzaifah, M., & Wyse, L. (2021). Deep generative models for musical audio synthesis. Handbook of artificial intelligence for music: foundations, advanced approaches, and developments for creativity, 639-678. [keywords: “review” paper]
-
Wyse, L., Kamath, P., & Gupta, C. (2022, April). Sound model factory: An integrated system architecture for generative audio modelling. In International Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar) (pp. 308-322). Cham: Springer International Publishing. [keywords: playability; latent space]
-
Garcia, H. F., Seetharaman, P., Kumar, R., & Pardo, B. (2023). Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686.
[keywords: transformer, in-painting, masking for training, codecs]
-
Evans, Z., Parker, J. D., Carr, C. J., Zukowski, Z., Taylor, J., & Pons, J. (2024). Stable audio open. arXiv preprint arXiv:2407.14358. [keywords: Text-2-audio; Open (data, weights, code, latent diffusion]