代码实战：使用扩散模型微调生成自己曲风的音频

🕗 发布于 2024-10-08 22:55 音视频扩散模型音频生成微调 深度学习

Diffusion Models专栏文章汇总：入门与实战

前言：扩散模型在图像领域的成功人尽皆知，其实扩散模型在音频领域相当成功，可以根据输入的一小段音频，就能微调生成出自己曲风的音频。这篇博客从代码开始讲解，使用预训练的音频扩散模型微调生成自己曲风的音频。

加载预训练模型

加载预训练模型

import torch, random
import numpy as np
import torch.nn.functional as F
from tqdm.auto import tqdm
from IPython.display import Audio
from matplotlib import pyplot as plt
from diffusers import DiffusionPipeline
from torchaudio import transforms as AT
from torchvision import transforms as IT

# Load a pre-trained audio diffusion pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-instrumental-hiphop-256").to(device)

可以简单推理一下原始模型试试：

# Sample from the pipeline and display the outputs:
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

音频到频谱的转换

音频的'波形'从时间上表现出了源音频 - 比如，这可能是接收自麦克风的电信号。从这种'时域'的表达方式上做处理会有些棘手，所以有种更普遍的做法把它转换成其他形式，通常把这叫做频谱。频谱直接展示出在不同频率（y轴）与时间（x轴）上的剧烈程度。

# Calculate and show a spectrogram for our generated audio sample using torchaudio
spec_transform = AT.Spectrogram(power=2)
spectrogram = spec_transform(torch.tensor(output.audios[0]))
print(spectrogram.min(), spectrogram.max())
log_spectrogram = spectrogram.log()
plt.imshow(log_spectrogram[0], cmap='gray');

我们刚刚做好的这个频谱取值范围在0.0000000000001到1之间，其中大部分内容都接近取值下限。这对于可视化与建模并不理想 - 实际上我们需要对这些值取log来得到一个可以看到更多细节的灰度图。同样也因此，我们特别使用一种专门的梅尔频谱（Mel spectrogram），这是一种通过对不同频率成分做一些变化，专门设计的一种符合人耳感知特性而利于提取重要信息的方式。

https://colab.research.google.com/github/darcula1993/diffusion-models-class-CN/blob/main/unit4/02_diffusion_for_audio_CN.ipynb

幸运的是，我们并不需要太过于担心这些变换方法 - pipeline中的mel功能会为我们处理这些细节。这样操作，我们就能把频谱图像转换成音频：

a = pipe.mel.image_to_audio(output.images[0])
a.shape

音频被表现为一串很长的数字数组。要把它播放出来的话，我们还需要一个关键信息：采样率。我们要用到多少个采样点（单个的数值），才能够播放出单位秒的音频呢？

我们可以在pipeline中这样来看使用的采样率：

sample_rate_pipeline = pipe.mel.get_sample_rate()
sample_rate_pipeline

加载数据集

现在我们已经大致理解了这个pipeline是怎么工作的，现在来在一些新音频数据上对它进行微调！

这个数据集是不同类别的音频片段集合，我们可以从hub上这样加载它：

from datasets import load_dataset
dataset = load_dataset('lewtun/music_genres', split='train')
dataset

你可以使用下面的代码来看看在数据集中各类别样本的占比：

for g in list(set(dataset['genre'])):
  print(g, sum(x==g for x in dataset['genre']))

Pop 945
Blues 58
Punk 2582
Old-Time / Historic 408
Experimental 1800
Folk 1214
Electronic 3071
Spoken 94
Classical 495
Country 142
Instrumental 1044
Chiptune / Glitch 1181
International 814
Ambient Electronic 796
Jazz 306
Soul-RnB 94
Hip-Hop 1757
Easy Listening 13
Rock 3095

这个数据集把音频存储为数组：

audio_array = dataset[0]['audio']['array']
sample_rate_dataset = dataset[0]['audio']['sampling_rate']
print('Audio array shape:', audio_array.shape)
print('Sample rate:', sample_rate_dataset)
display(Audio(audio_array, rate=sample_r

注意这条音频的采样率会更高 - 如果我们想用手头的这个pipeline，需要对它'重采样'来匹配。这个片段也比pipeline所预设的长度更长。幸运的是，当我们使用pipe.mel在加载音频时，会自动把它切片成更短的片区。

a = dataset[0]['audio']['array'] # Get the audio array
pipe.mel.load_audio(raw_audio=a) # Load it with pipe.mel
pipe.mel.audio_slice_to_image(0) # View the first 'slice' as a spectrogram

我们要记得去调整采样率，因为此数据集的数据在每秒中有着多两倍的数据点。

sample_rate_dataset = dataset[0]['audio']['sampling_rate']
sample_rate_dataset

这里我们用torchaudio's transforms（导入为AT）来做音频的重采样，pipe中的mel把音频转换为图像,torchvision's transforms(导入为IT)来把图片转换为tensors。这个函数可以把音频片段转换为频谱tensor供训练使用：

resampler = AT.Resample(sample_rate_dataset, sample_rate_pipeline, dtype=torch.float32)
to_t = IT.ToTensor()

def to_image(audio_array):
  audio_tensor = torch.tensor(audio_array).to(torch.float32)
  audio_tensor = resampler(audio_tensor)
  pipe.mel.load_audio(raw_audio=np.array(audio_tensor))
  num_slices = pipe.mel.get_number_of_slices()
  slice_idx = random.randint(0, num_slices-1) # Pic a random slice each time (excluding the last short slice)
  im = pipe.mel.audio_slice_to_image(slice_idx) 
  return im

来使用我们的to_image()函数来组成我们特定的整理函数（collate function）来把数据集转换到dataloader中来训练模型。整理函数定义了如何把一批来自数据集的样例变换为最终训练用数据。在这个例子中我们把每个音频转换为频谱图像再把他们的tensors堆叠起来：

def collate_fn(examples):
  # to image -> to tensor -> rescale to (-1, 1) -> stack into batch
  audio_ims = [to_t(to_image(x['audio']['array']))*2-1 for x in examples]
  return torch.stack(audio_ims)

# Create a dataset with only the 'Chiptune / Glitch' genre of songs
batch_size=4 # 4 on colab, 12 on A100
chosen_genre = 'Electronic' # <<< Try training on different genres <<<
indexes = [i for i, g in enumerate(dataset['genre']) if g == chosen_genre]
filtered_dataset = dataset.select(indexes)
dl = torch.utils.data.DataLoader(filtered_dataset.shuffle(), batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
batch = next(iter(dl))
print(batch.shape)

训练

这是一个在dataloader中读取数据的简洁训练循环，用几个周期来微调pipeline的UNet网络。你可以跳过此块，直接使用下一块代码来加载pipeline。

epochs = 3
lr = 1e-4

pipe.unet.train()
pipe.scheduler.set_timesteps(1000)
optimizer = torch.optim.AdamW(pipe.unet.parameters(), lr=lr)

for epoch in range(epochs):
    for step, batch in tqdm(enumerate(dl), total=len(dl)):
        
        # Prepare the input images
        clean_images = batch.to(device)
        bs = clean_images.shape[0]

        # Sample a random timestep for each image
        timesteps = torch.randint(
            0, pipe.scheduler.num_train_timesteps, (bs,), device=clean_images.device
        ).long()

        # Add noise to the clean images according to the noise magnitude at each timestep
        noise = torch.randn(clean_images.shape).to(clean_images.device)
        noisy_images = pipe.scheduler.add_noise(clean_images, noise, timesteps)

        # Get the model prediction
        noise_pred = pipe.unet(noisy_images, timesteps, return_dict=False)[0]

        # Calculate the loss
        loss = F.mse_loss(noise_pred, noise)
        loss.backward(loss)

        # Update the model parameters with the optimizer
        optimizer.step()
        optimizer.zero_grad()

推理

pipe = DiffusionPipeline.from_pretrained("johnowhitaker/Electronic_test").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=22050))

# Make a longer sample by passing in a starting noise tensor with a different shape
noise = torch.randn(1, 1, pipe.unet.sample_size[0],pipe.unet.sample_size[1]*4).to(device)
output = pipe(noise=noise)
display(output.images[0])
display(Audio(output.audios[0], rate=22050))

原文地址：https://blog.csdn.net/qq_41895747/article/details/142316769

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：Slot attention理解
下一篇：python数据分析

GeoCue与Xer Technologies合作推动无人机测绘技术革新
这一里程碑式的合作不仅标志着无人机测绘技术的一次重大飞跃，也预示着可扩展远程LiDAR和图像测绘技术的新时代的到来。展望未来，随着无人机测绘技术的不断成熟和应用领域的不断拓展，GeoCue与Xer T
阅读更多2024-10-09
Ngx+Lua+Redis 快速存储POST数据
系统几万台设备有windows有安卓还有linux系统，每个设备三分钟就会向服务器post设备的硬件信息，数据格式json，后台管理界面只需要最新的数据，不需要历史数据，业务逻辑非常简单，PHP代码就
阅读更多2024-10-09
【物流配送中心选址问题】基于改进粒子群算法
基于动态惯性权重优化粒子群算法的物流配送中心选址问题
阅读更多2024-10-09
超越GPT-4的视觉与文本理解能力，开源多模态模型领跑者 - Molmo
Molmo AI介绍及如何使用的指南，是由艾伦人工智能研究所推出的一系列先进多模态模型，提供图像理解和文本分析的卓越能力。这些开源模型不仅在性能上超越了GPT-4等商业模型，还通过创新的数据收集方法实
阅读更多2024-10-09
力扣206.反转链表
请你反转链表，并返回反转后的链表。链表中节点的数目范围是。
阅读更多2024-10-09
『网络游戏』Tips弹窗队列【10】
列队实现进栈出栈
阅读更多2024-10-09
力扣题11~15
这种题目第一眼就是双循环，但是肯定不行滴，o(n^2)这种肯定超时，很难接受。所以要另辟蹊径，我们先用俩指针（标志位）在最左端和最右端，我们知道这个容器的最大容积是看最短的那条（木桶效应嘛）。如果我们
阅读更多2024-10-09
生成对抗网络（GANs）详解：原理、架构与应用潜力
GAN的核心思想是通过对抗性训练生成器（Generator）与判别器（Discriminator），使得两者相互竞争，最终达到生成高质量样本的目的。例如，在医学影像分析中，由于获取标注样本的困难，GA
阅读更多2024-10-09
2024年10月8日Java学习内容总结
把单词按照语法组成句子，然后把句子通过一定的含义组成文章。在计算机中，我们将这种写成的文章称之为程序。浏览器的作用是进行交互，用户与服务器之间数据交互的一个通道。渲染：使用JavaScript引擎将后
阅读更多2024-10-09
如何避免PuTTY的连接超时
本文介绍了如何保持PuTTY的连接
阅读更多2024-10-09

代码实战：使用扩散模型微调生成自己曲风的音频

加载预训练模型

音频到频谱的转换

加载数据集

训练

推理

相关文章