MusiCoT:
Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation
[Paper]
Max W. Y. Lam*†, Yijin Xing*, Weiya You, Jingcheng Wu, Zongyu Yin,
Fuqiang Jiang, Hangyu Liu, Feng Liu, Xingda Li, Wei-Tsung Lu, Hanyu Chen,
Tong Feng, Tianwei Zhao, Chien-Hung Liu, Xuchen Song†, Yang Li, Yahui Zhou
Skywork AI
*: Equal Contribution†: Corresponding author
Abstract.
Autoregressive (AR) models have demonstrated impressive capabilities in generating high-fidelity music. However, the conventional next-token prediction paradigm in AR models does not align with the human creative process in music composition, potentially compromising the musicality of generated samples.
To overcome this limitation, we introduce MusiCoT, a novel chain-of-thought (CoT) prompting technique tailored for music generation. MusiCoT empowers the AR model to first outline an overall music structure before generating audio tokens, thereby enhancing the coherence and creativity of the resulting compositions. By leveraging the contrastive language-audio pretraining (CLAP) model, we establish a chain of "musical thoughts", making MusiCoT scalable and independent of human-labeled data, in contrast to conventional CoT methods.
Moreover, MusiCoT allows for in-depth analysis of music structure, such as instrumental arrangements, and supports music referencing -- accepting variable-length audio inputs as optional style references. This innovative approach effectively addresses copying issues, positioning MusiCoT as a vital practical method for music prompting.
Our experimental results indicate that MusiCoT consistently achieves superior performance across both objective and subjective metrics, producing music quality that rivals state-of-the-art generation models.
Overview of MusiCoT
MusiCoT is a novel chain-of-thought prompting technique designed for high-fidelity music generation, which aligns the autoregressive model's creative process with human-like musical thought by deriving an overall music structure before generating audio tokens.
By leveraging the contrastive language-audio pretraining (CLAP) model, MusiCoT enhances structural analyzability and supports music referencing, resulting in superior generation performance compared to traditional music generation models.
Generation Samples
In this section, we provide 10 randomly picked samples (no cherry-picking) out of the 100 songs for the comparison of different music generators.
Lyrics | Music Generator | Sample |
---|---|---|
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 | ||
|
Base model w/ MusiCoT | |
Base model w/o MusiCoT | ||
YuE | ||
Suno V4 | ||
Udio V1.5 | ||
Mureka V5.5 |