Cold-start audiobook recommendation via cross-domain sub-tower fusion
2025
For music streaming services expanding into audiobooks, cold-start personalization presents a critical challenge: as audiobooks are a newly introduced content type, the vast majority of existing users have no audiobook listening history. This domain-level cold-start scenario differs from traditional item or user cold-start scenarios, since personalization must begin before any behavioral data exists in the target domain. Yet these same users possess rich engagement histories in the platform's established offerings of music and podcasts, creating an opportunity to transfer cross-modal signals for early-stage audiobook recommendations. We present a lightweight framework designed for scalability and minimal retraining, showing that cross-modal transfer can yield strong personalization even in sparse domains. Our framework, studied in the context of a large-scale music streaming service, adopts a two-tower design with two key design choices: (1) the user side is frozen and structured into modality-specific sub-towers, preserving signals without retraining overhead; and (2) an adaptive fusion mechanism integrates these signals, while the item side learns audiobook embeddings. To further enrich content representations, we incorporate BAAI's BGE model for text encoding, which injects semantic knowledge into the towers. This combination yields consistent and substantial relative gains: offline precision exceeds +100% over popularity baselines and +50% over single-domain based collocation methods, with strong complementarity between modalities. Our method scales to millions of users with minimal training cost and generalizes to public datasets, enabling both open research and industrial adoption. Large-scale A/B testing in the US marketplace demonstrates a ∼10% improvement in first audiobook listens compared to popularity baselines. These results demonstrate that frozen multi-modal sub-towers with pretrained text enrichment offer a principled alternative for cross-domain cold-start personalization, providing a generalizable architecture for efficient content expansion across any streaming platform diversifying into new media types.
Research areas