Multimodal music tokenization with residual quantization for generative retrieval

Wo Jae Lee; Rifat Joyee; Emanuele Coviello; Sudev Mukherjee

Publication

Multimodal music tokenization with residual quantization for generative retrieval

By Wo Jae Lee, Rifat Joyee, Emanuele Coviello, Sudev Mukherjee

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Recent advances in generative retrieval allow large language models (LLMs) to recommend items by generating their identifiers token by token, rather than using nearest-neighbor search over embeddings. This approach requires each item, such as a music track, to be represented by a compact and semantically meaningful token sequence that LLMs can generate. We propose a multimodal music tokenizer (3MToken) that transforms rich metadata from a music database, including audio, credits, semantic tags, song and artist descriptions, musical characteristics, release dates, and consumption patterns into discrete tokens using a Residual-Quantized Variational Autoencoder. Our method learns hierarchical representations, capturing coarse features in early quantization levels and refining them at later levels, preserving fine-grained information. We train and evaluate our model on a large-scale dataset of 1.6 million tracks, and it achieves +40.0%, +43.4%, and +15.8% improvements in Precision@k, Recall@k, and Hit@k, respectively, over the baselines.

Multimodal music tokenization with residual quantization for generative retrieval

Latest news

Work with us